Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle race conditions from deleting a message and adding/removing related fields simultaneously. #18559

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

abhijeetbodas2001
Copy link
Member

@abhijeetbodas2001 abhijeetbodas2001 commented May 20, 2021

This replaces the old PR #16721 because that has a lot of pre-Django3.2 discussions which are now unnecessary.
Part of #16502
Fixes: #16813
Chat: https://chat.zulip.org/#narrow/stream/3-backend/topic/message.20edit.20race.20.2316502

@zulipbot zulipbot added size: XL area: emoji Emoji in markup, emoji reactions, emoji picker, etc. labels May 20, 2021
@zulipbot
Copy link
Member

Hello @zulip/server-emoji members, this pull request was labeled with the "area: emoji" label, so you may want to check it out!

@abhijeetbodas2001
Copy link
Member Author

abhijeetbodas2001 commented May 20, 2021

This PR now uses captureOnCommitCallbacks, but there's an issue with database transactions which I found while testing this.
https://chat.zulip.org/#narrow/stream/3-backend/topic/message.20edit.20race.20.2316502/near/1184676

Edit: This was resolved by locking the message row during deletion.

@abhijeetbodas2001 abhijeetbodas2001 force-pushed the for_update branch 3 times, most recently from f7b6af2 to 2e0ae8b Compare May 24, 2021 13:25
@abhijeetbodas2001 abhijeetbodas2001 marked this pull request as ready for review May 24, 2021 13:31
@abhijeetbodas2001
Copy link
Member Author

I've tested all scenarios I could think of, and this is ready for review.
@mateuszmandera could you take a look?

@timabbott
Copy link
Sponsor Member

I merged the first commit as 8bcdbc7, just to head off merge conflicts.

zerver/lib/message.py Outdated Show resolved Hide resolved
@has_request_variables
def delete_message_backend(
request: HttpRequest,
user_profile: UserProfile,
message_id: int = REQ(converter=to_non_negative_int, path_only=True),
) -> HttpResponse:
message, ignored_user_message = access_message(user_profile, message_id)
message, ignored_user_message = access_message(user_profile, message_id, lock_message=True)
validate_can_delete_message(user_profile, message)
try:
do_delete_messages(user_profile.realm, [message])
Copy link
Sponsor Member

@timabbott timabbott May 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if someone had locked a UserMessage row when we call this? I guess what I'd hope for is that we just finish the other transaction happily and then run this. (In which case it should be illegal to lock a UserMessage if you're later going to want to lock the Message within that transaction).

One possible theory for not needing to think about this sort of question is that we should just always lock the message object (and never lock UserMessage objects).

In any case, I think we should probably put a brief version of the commit message explanation in a comment here, e.g. "We lock the Message object to ensure that any transactions modifying the Message object are serialized properly with deleting the message; this prevents a deadlock that would otherwise happen because ..." except I only spent 30 seconds on drafting that.

zerver/lib/actions.py Outdated Show resolved Hide resolved
zerver/lib/actions.py Outdated Show resolved Hide resolved
"""Should be called while holding a SELECT FOR UPDATE lock
(e.g. via access_message(..., lock_message=True)) on the
Message row, to prevent race conditions.
"""
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about these comments, could we replace these with something like (pseudocode):

assert message.row_is_locked

so that the code is self-explanatory. And we could potentially just set that property in access_message in the initial commit that adds support for lock_message. (And have it be a default value that's false on the Message class, so one doesn't need to getattr?)

zerver/lib/actions.py Outdated Show resolved Hide resolved
@timabbott
Copy link
Sponsor Member

OK, I'm feeling pretty good about this PR. A few ideas:

  • Can we drop the lock_usermessage stuff for now, and leave that for a follow-up PR? I feel like it has various tricky things to think about, and it might be worth getting all the lock_message stuff in before turning to that in full.
  • I note this detail in the Django documentation: "By default, select_for_update() locks all rows that are selected by the query. For example, rows of related objects specified in select_related() are locked in addition to rows of the queryset’s model. If this isn’t desired, specify the related objects you want to lock in select_for_update(of=(...)) using the same fields syntax as select_related(). Use the value 'self' to refer to the queryset’s model.". I think we probably do want our lock_message logic to pass of=(self), i.e. not try to lock foreign keys, since only the Message object itself has these races; we're not super worried about the UserProfile object, and want to take minimal locks.

@mateuszmandera
Copy link
Contributor

@timabbott There is also the case of scheduled archiving (the retention.py system) that hasn't been addressed yet - that is, the archiving cron job while it's doing its things will have various races with message deletion/editing etc. Perhaps we can leave that for a follow-up PR, but worth noting at least.

@timabbott
Copy link
Sponsor Member

Yes, agreed, I think that's fine to leave as a follow-up project for now but let's not lose track of it.

(Probably we'll want to open a few issues when we close this for the known follow-ups).

@timabbott
Copy link
Sponsor Member

@abhijeetbodas2001 let me know when this is ready for a next review.

@abhijeetbodas2001 abhijeetbodas2001 force-pushed the for_update branch 2 times, most recently from b4b1418 to c6938d6 Compare June 3, 2021 15:55
@abhijeetbodas2001
Copy link
Member Author

Re #18559 (comment), I'm not sure if I understand what needs to be done. Do you mean we create new field on Message class for this purpose? I did not understand how that would help. A hypothetical bad caller which ignores the docstring could still set that field as true, right?

Another idea is that, there's this field used by Django internally -> QuerySet.query.select_for_update: bool. If we refactor access_message to return a QuerySet (instead of a Message) we could probably use that directly to assert if the row is locked.

@timabbott
Copy link
Sponsor Member

I merged most of this as the series ending with 86d6872, after editing comments/docstrings lightly. Huge thanks for doing this migration @abhijeetbodas2001!

@mateuszmandera @alexmv FYI :). In production, we're likely to see a reduction in 500s from races but potentially some deadlock risk, so we should be on the lookout for new deadlocks.

I'll post a few comments on the last commit; I'm not sure we have the right scope for the @transaction.atomic block.

zerver/lib/actions.py Outdated Show resolved Hide resolved
zerver/lib/actions.py Outdated Show resolved Hide resolved
@alexmv
Copy link
Collaborator

alexmv commented Jun 5, 2021

https://sentry.io/share/issue/775a4e4173314d34a2e75751e02ed932/ is a race of embed_links with message deletion, which I don't think this covered by this yet.

@mateuszmandera
Copy link
Contributor

I think we just want to handle the DoesNotExist case + lock the Message row so that it doesn't get deleted while being processed by the queue worker?

Something like

diff --git a/zerver/worker/queue_processors.py b/zerver/worker/queue_processors.py
index d94237905e..d72132c452 100644
--- a/zerver/worker/queue_processors.py
+++ b/zerver/worker/queue_processors.py
@@ -738,25 +738,31 @@ class FetchLinksEmbedData(QueueProcessingWorker):
                 "Time spent on get_link_embed_data for %s: %s", url, time.time() - start_time
             )
 
-        message = Message.objects.get(id=event["message_id"])
-        # If the message changed, we will run this task after updating the message
-        # in zerver.lib.actions.check_update_message
-        if message.content != event["message_content"]:
-            return
-        if message.content is not None:
-            query = UserMessage.objects.filter(
-                message=message.id,
-            )
-            message_user_ids = set(query.values_list("user_profile_id", flat=True))
+        with transaction.atomic():
+            try:
+                message = Message.objects.select_for_update().get(id=event["message_id"])
+            except Message.DoesNotExist:
+                logging.info("Message %s no longer exists.", event["message_id"])
+                return
 
-            # Fetch the realm whose settings we're using for rendering
-            realm = Realm.objects.get(id=event["message_realm_id"])
+            # If the message changed, we will run this task after updating the message
+            # in zerver.lib.actions.check_update_message
+            if message.content != event["message_content"]:
+                return
+            if message.content is not None:
+                query = UserMessage.objects.filter(
+                    message=message.id,
+                )
+                message_user_ids = set(query.values_list("user_profile_id", flat=True))
 
-            # If rendering fails, the called code will raise a JsonableError.
-            rendered_content = render_incoming_message(
-                message, message.content, message_user_ids, realm
-            )
-            do_update_embedded_data(message.sender, message, message.content, rendered_content)
+                # Fetch the realm whose settings we're using for rendering
+                realm = Realm.objects.get(id=event["message_realm_id"])
+
+                # If rendering fails, the called code will raise a JsonableError.
+                rendered_content = render_incoming_message(
+                    message, message.content, message_user_ids, realm
+                )
+                do_update_embedded_data(message.sender, message, message.content, rendered_content)
 
 
 @assign_queue("outgoing_webhooks")

@mateuszmandera
Copy link
Contributor

mateuszmandera commented Jun 5, 2021

Also, looking at do_update_embedded_data it seems to have a redundant atomic() decorator from 5 years ago?

# We use transaction.atomic to support select_for_update in the attachment codepath.
@transaction.atomic
def do_update_embedded_data(

I don't see an attachment codepath here, so we can probably just remove the decorator? And even in the case where we want to keep it, it's doing send_event inside it, which is a pattern we want to avoid

@timabbott
Copy link
Sponsor Member

We may want to avoid locking the Message row while calling render_incoming_message, just because that's a potentially slow operation and every situation where we hold a lock for a long time is costly.

One idea would be to have a shorter atomic block with locking just for the end bit in do_update_embedded_data.

Also, looking at do_update_embedded_data it seems to have a redundant atomic() decorator from 5 years ago?

Hmm, I don't know the background on that, but that does seem worth changing.

Do we also want to change all use of transaction.atomic to use savepoint=False across the codebase, and lint for it, so that we don't have to consider the possibility of nested @transaction.atomic?

@abhijeetbodas2001
Copy link
Member Author

abhijeetbodas2001 commented Jun 7, 2021

I think durable (new in Django 3.2) might be better for ensuring no nested transaction.atomics. From Django's docs-

It is sometimes useful to ensure an atomic block is always the outermost atomic block, ensuring that any database changes are committed when the block is exited without errors. This is known as durability and can be achieved by setting durable=True. If the atomic block is nested within another it raises a RuntimeError.

And then maybe we could have a linter rule to enforce durable=True is always passed in transaction.atomic?

Also, looking at do_update_embedded_data it seems to have a redundant atomic() decorator from 5 years ago?

Yeah, seems unnecessary (probably copied from do_update_message accidentally?) . I added a commit to fix it.

@timabbott
Copy link
Sponsor Member

Yeah, durable=True works for places where we intend that -- I think we might have a place where we want savepoint=False because the function can be called from multiple code paths, only some of which already have a transaction.

I'm fine with a lint rule to enforce that we pass one of those two parameters. (Though maybe that'll fail tests? Not sure how the testing transactions work).

@timabbott
Copy link
Sponsor Member

Just an update that the parts of this that are already merged are now now running happily in production.

When a race occurs between starring and reacting to a message
not received by the user, duplicate UserMessage creation is
attempted, which throws a `IntegrityError`.

Fix:
We already use `FOR UPDATE` queries when adding reactions.
This commit makes it so that even the code which creates
historical usermessage because of starring uses `FOR UPDATE`
queries.

So, if there are parallel processes which will be attempting
to star and react to such a message, the first process which
locks the message row will block the other one (till the first
transaction is complete).

Because the UserMessage check in the star codepath is done
before acquiring the lock, we need to check again if the
UserMessage exists.

Fixes zulip#16813
This isn't any attachments code involved here.
This was added in c93f1d4, probably
accidentally.
If there are nested `transaction.atomic`s, we want to make sure that
atomicity is guaranteed for the outermost transaction.
@zulipbot
Copy link
Member

zulipbot commented Sep 7, 2021

Heads up @abhijeetbodas2001, we just merged some commits that conflict with the changes your made in this pull request! You can review this repository's recent commits to see where the conflicts occur. Please rebase your feature branch against the upstream/main branch and resolve your pull request's merge conflicts accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: emoji Emoji in markup, emoji reactions, emoji picker, etc. has conflicts size: XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emoji reactions on historical messages can race with themselves or message starring
5 participants