New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle race conditions from deleting a message and adding/removing related fields simultaneously. #18559
base: main
Are you sure you want to change the base?
Conversation
Hello @zulip/server-emoji members, this pull request was labeled with the "area: emoji" label, so you may want to check it out! |
This PR now uses Edit: This was resolved by locking the message row during deletion. |
f7b6af2
to
2e0ae8b
Compare
I've tested all scenarios I could think of, and this is ready for review. |
I merged the first commit as 8bcdbc7, just to head off merge conflicts. |
zerver/views/message_edit.py
Outdated
@has_request_variables | ||
def delete_message_backend( | ||
request: HttpRequest, | ||
user_profile: UserProfile, | ||
message_id: int = REQ(converter=to_non_negative_int, path_only=True), | ||
) -> HttpResponse: | ||
message, ignored_user_message = access_message(user_profile, message_id) | ||
message, ignored_user_message = access_message(user_profile, message_id, lock_message=True) | ||
validate_can_delete_message(user_profile, message) | ||
try: | ||
do_delete_messages(user_profile.realm, [message]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if someone had locked a UserMessage row when we call this? I guess what I'd hope for is that we just finish the other transaction happily and then run this. (In which case it should be illegal to lock a UserMessage if you're later going to want to lock the Message within that transaction).
One possible theory for not needing to think about this sort of question is that we should just always lock the message object (and never lock UserMessage objects).
In any case, I think we should probably put a brief version of the commit message explanation in a comment here, e.g. "We lock the Message object to ensure that any transactions modifying the Message object are serialized properly with deleting the message; this prevents a deadlock that would otherwise happen because ..." except I only spent 30 seconds on drafting that.
zerver/lib/actions.py
Outdated
"""Should be called while holding a SELECT FOR UPDATE lock | ||
(e.g. via access_message(..., lock_message=True)) on the | ||
Message row, to prevent race conditions. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about these comments, could we replace these with something like (pseudocode):
assert message.row_is_locked
so that the code is self-explanatory. And we could potentially just set that property in access_message
in the initial commit that adds support for lock_message
. (And have it be a default value that's false on the Message
class, so one doesn't need to getattr
?)
OK, I'm feeling pretty good about this PR. A few ideas:
|
@timabbott There is also the case of scheduled archiving (the retention.py system) that hasn't been addressed yet - that is, the archiving cron job while it's doing its things will have various races with message deletion/editing etc. Perhaps we can leave that for a follow-up PR, but worth noting at least. |
Yes, agreed, I think that's fine to leave as a follow-up project for now but let's not lose track of it. (Probably we'll want to open a few issues when we close this for the known follow-ups). |
2e0ae8b
to
a99c704
Compare
@abhijeetbodas2001 let me know when this is ready for a next review. |
b4b1418
to
c6938d6
Compare
446185f
to
ecae959
Compare
Re #18559 (comment), I'm not sure if I understand what needs to be done. Do you mean we create new field on Another idea is that, there's this field used by Django internally -> |
I merged most of this as the series ending with 86d6872, after editing comments/docstrings lightly. Huge thanks for doing this migration @abhijeetbodas2001! @mateuszmandera @alexmv FYI :). In production, we're likely to see a reduction in 500s from races but potentially some deadlock risk, so we should be on the lookout for new deadlocks. I'll post a few comments on the last commit; I'm not sure we have the right scope for the |
https://sentry.io/share/issue/775a4e4173314d34a2e75751e02ed932/ is a race of |
I think we just want to handle the DoesNotExist case + lock the Message row so that it doesn't get deleted while being processed by the queue worker? Something like diff --git a/zerver/worker/queue_processors.py b/zerver/worker/queue_processors.py
index d94237905e..d72132c452 100644
--- a/zerver/worker/queue_processors.py
+++ b/zerver/worker/queue_processors.py
@@ -738,25 +738,31 @@ class FetchLinksEmbedData(QueueProcessingWorker):
"Time spent on get_link_embed_data for %s: %s", url, time.time() - start_time
)
- message = Message.objects.get(id=event["message_id"])
- # If the message changed, we will run this task after updating the message
- # in zerver.lib.actions.check_update_message
- if message.content != event["message_content"]:
- return
- if message.content is not None:
- query = UserMessage.objects.filter(
- message=message.id,
- )
- message_user_ids = set(query.values_list("user_profile_id", flat=True))
+ with transaction.atomic():
+ try:
+ message = Message.objects.select_for_update().get(id=event["message_id"])
+ except Message.DoesNotExist:
+ logging.info("Message %s no longer exists.", event["message_id"])
+ return
- # Fetch the realm whose settings we're using for rendering
- realm = Realm.objects.get(id=event["message_realm_id"])
+ # If the message changed, we will run this task after updating the message
+ # in zerver.lib.actions.check_update_message
+ if message.content != event["message_content"]:
+ return
+ if message.content is not None:
+ query = UserMessage.objects.filter(
+ message=message.id,
+ )
+ message_user_ids = set(query.values_list("user_profile_id", flat=True))
- # If rendering fails, the called code will raise a JsonableError.
- rendered_content = render_incoming_message(
- message, message.content, message_user_ids, realm
- )
- do_update_embedded_data(message.sender, message, message.content, rendered_content)
+ # Fetch the realm whose settings we're using for rendering
+ realm = Realm.objects.get(id=event["message_realm_id"])
+
+ # If rendering fails, the called code will raise a JsonableError.
+ rendered_content = render_incoming_message(
+ message, message.content, message_user_ids, realm
+ )
+ do_update_embedded_data(message.sender, message, message.content, rendered_content)
@assign_queue("outgoing_webhooks")
|
Also, looking at
I don't see an attachment codepath here, so we can probably just remove the decorator? And even in the case where we want to keep it, it's doing |
ecae959
to
dfe2fb3
Compare
dfe2fb3
to
a8a87de
Compare
We may want to avoid locking the One idea would be to have a shorter atomic block with locking just for the end bit in
Hmm, I don't know the background on that, but that does seem worth changing. Do we also want to change all use of transaction.atomic to use |
I think
And then maybe we could have a linter rule to enforce
Yeah, seems unnecessary (probably copied from |
a8a87de
to
31d2916
Compare
Yeah, I'm fine with a lint rule to enforce that we pass one of those two parameters. (Though maybe that'll fail tests? Not sure how the testing transactions work). |
Just an update that the parts of this that are already merged are now now running happily in production. |
When a race occurs between starring and reacting to a message not received by the user, duplicate UserMessage creation is attempted, which throws a `IntegrityError`. Fix: We already use `FOR UPDATE` queries when adding reactions. This commit makes it so that even the code which creates historical usermessage because of starring uses `FOR UPDATE` queries. So, if there are parallel processes which will be attempting to star and react to such a message, the first process which locks the message row will block the other one (till the first transaction is complete). Because the UserMessage check in the star codepath is done before acquiring the lock, we need to check again if the UserMessage exists. Fixes zulip#16813
This isn't any attachments code involved here. This was added in c93f1d4, probably accidentally.
If there are nested `transaction.atomic`s, we want to make sure that atomicity is guaranteed for the outermost transaction.
31d2916
to
43986a9
Compare
43986a9
to
892c887
Compare
Heads up @abhijeetbodas2001, we just merged some commits that conflict with the changes your made in this pull request! You can review this repository's recent commits to see where the conflicts occur. Please rebase your feature branch against the |
4ec3636
to
88b200c
Compare
This replaces the old PR #16721 because that has a lot of pre-Django3.2 discussions which are now unnecessary.
Part of #16502
Fixes: #16813
Chat: https://chat.zulip.org/#narrow/stream/3-backend/topic/message.20edit.20race.20.2316502