New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668
Conversation
changelog/_unreleased/2021-02-18-add-retryable-transaction-to-prevent-deadlocks.md
Outdated
Show resolved
Hide resolved
src/Core/Framework/DataAbstractionLayer/Doctrine/RetryableQuery.php
Outdated
Show resolved
Hide resolved
93c0cf3
to
ceace3b
Compare
I updated the changelog. More importantly: I updated the |
changelog/_unreleased/2021-02-18-add-retryable-transaction-to-prevent-deadlocks.md
Show resolved
Hide resolved
28b290b
to
64290cb
Compare
I added the UPGRADE INFORMATION part of the changelog. |
64290cb
to
b4fa53f
Compare
Hello, thank you for creating this pull request. Please use this issue to track the state of your pull request. |
Hey @hanneswernery , |
618f909
to
3b28730
Compare
thanks : ) |
Since nothing happened here for a while I prepared a hotfix from this PR and applied it to our installation. The mentioned issue is not fixed, so it seems something else regarding transaction handling is broken besides this PR. |
Here are two
I guess these could be wrapped in the new |
These ain't working: (8289d01, 9da6473 @OliverSkroblin)
These lines catch exceptions which are needed for the control flow of transactions, i.e. which have to reach the new It also does not make sense to split up large queries for deadlocks. This code is (or should be) wrapped in one transaction either way and InnoDB decides based on the size of the transaction, i.e. how many rows are modified, which one to roll back. Not based on the number of queries. Best solution here is using Fixing all mentioned issues (wrap ProductIndexer and CategoryIndexer in RetryableTransaction and removing the catch blocks, i.e. remove the split updates) resolves NEXT-15805. |
By the way, I don't see why the sleep call should be necessary: I guess in most cases the sleep statement will only prolong the execution of the script. If a transaction is long-running, it might help. But transactions should not be long running anyway and I doubt that a few microseconds will help in that case. IMHO it would be better to leave the sleep out at the moment and observe how it works out in practice. |
@UlrichThomasGabor Thanks for your feedback, we will have a look at this internally. Unfortunately we also are waiting for Shopware to merge this. I will ask them again, if they can merge this PR. |
9b9aecb
to
3190f5a
Compare
@hanneswernery Why do you think it makes a difference if queries are executed as batch or non-batch inside of a transaction? From my understanding it should not make a difference:
Or more specific: From my understanding, performing such work inside of a transaction non-batchy produces only SQL overhead, but will not result in less deadlocks. |
Hi @UlrichThomasGabor , my last changes are the result of a discussion with @OliverSkroblin. The original PR caused some unwanted behaviour changes on the current Main reason is that regular non-retryabe exceptions were not mapped (translated into a user-facing error) anymore. This is because this mapping is only called on non-batchwise executions. The original Shopware code retried every exception non-batchwise manually. |
@OliverSkroblin Why is this behavior unwanted? Basically, this method makes no sense to me: If the intend of "batch vs. non-batch" is "should be executed as one transaction or as multiple", then the solution should be to split the transaction or not. That does not happen at the moment. If the intend of "batch vs. non-batch" is "should block other transaction as least as possible", then changing the isolation level is the way to go. (If this is possible. Otherwise, it becomes complicated...) Either way, the last commit does not make it worse, but in my opinion does not solve any problem. I would revert it, remove any non-batch code, and if this performs bad in practice, then check if isolation levels can be lowered for certain transactions. I don't get why batch vs. non-batch has anything to do with throwing Exceptions. This sounds like an unrelated issue to me. |
Hi Ulrich, first of all, thank you for looking so deeply into this topic. Regarding your questions and suggestions: This has brought some performance optimizations, and it also happened more often that these places ran into a deadlock or unique constraints. So I added the try catch here to try an single insert/update/delete first and otherwise do a single update/insert. The sleep call is necessary, because there could be not only 2 but also 10 transactions that are trying out the same records. These run then at the same time into a deadlock and will try at the same time also again an update. This sleep call is just to make sure that this happens a little bit staggered. Perhaps there is a misunderstanding here. The behavior is not unintentional, but rather breaking. We have functions in the core that rely on the event thrown there in the try catch. By changing the entity gateway for the first time the pipeline did not run anymore and tests for duplicate constraints and error reportings were red. Therefore, I played this information back. However, if this error reporting causes us to have a performance as well as stability loss on the transactions, then I am happy to replace this with an alternative in the next major. Only as it is now I can't take it into the core because:
|
As discussed with @OliverSkroblin , I will take another look at the PR and will remove the "unnecessary non-batch manual retry" (😅) again. Instead I will refactor the error mapping a bit so it will work with batchwise executions. With these changes, this PR will add the correct retry logic without breaking the error mapping and without unnecessary retries 👍 |
What is the community chat? The forum?
Yes, this conflicts in general with transactions. This is "more performant" because transactions are shorter and therefore are less likely to deadlock. But this "improvement" comes at a cost, i.e. one indexing action is split into multiple transactions and when an exception occurs, the previous transactions are already committed and the current one fails, thus bringing inconsistency into the data. Additionally, when the control flow is coming from EntityGateway and it already is in a transaction, this "improvement" is not possible at all. There are various ways to improve the situation. We can just call, if you want to know my opinion. Might be better than discussing in text form.
Well, it can stay, it's not seriously problematic, but personally I think it also works when it's not there.
Yes, this might require more than one step. But the current code, and also the code after merging this PR, is just not working correctly. It's either failing more often than necessary, or data consistency can be lost with a deadlock in the worst case. |
|
d300637
to
2846efd
Compare
@hanneswernery Looks good (to me). Should be squashed I guess. |
2846efd
to
cb79c07
Compare
cb79c07
to
83acc4c
Compare
…ansaction-safe. fixes shopware/shopware#1668
1. Why is this change necessary?
As discussed with @OliverSkroblin when writing entities in multiple threads and/or transactions at the same time deadlocks may occur in the database. When a deadlock occurs within a transaction, MySQL rolls back the transaction internally. Therefore the whole transaction can be retried. This is also stated in the MySQL error message:
The
EntityWriteGateway
uses a transaction but uses aRetryableQuery
inside this transaction (code). In case of a deadlock a single query of the transaction is now retried (code) even though the transaction was already rolled back internally by the database. Since queries inside a multi-query-transaction may depend on each other, this causes all kinds of unintended behaviour (i.e. foreign key constraint violations).The description of the
RetryableException
that is caught also suggest to retry the transaction, not a single query.Therefore the usage of the current
RetryableQuery
only works when running a single query in no transaction.2. What does this change do, exactly?
This PR adds a
RetryableTransaction
class which can be used to execute a number of commands (queries) inside a transaction which can be rolled back and retried as a whole without side effects or unwanted behaviour.Additionally the current usages of
RetryableQuery
are updated to check whether or not the query is executed inside a transaction. Now if this is the case, theRetryableQuery
will not be retried when a deadlock occurs.For that the functions
execute
andretryable
have been backwards-compatibly updated to accept aDoctrine\DBAL\Connection $connection
to ensure this new safe behaviour.3. Describe each step to reproduce the issue or behaviour.
Run two complex processes, e.g. create new versions of an existing order and update many products at the same time.
A real life example is the order document creation via the administration. This creates a new order version and triggers a complex product subscriber from the Pickware ERP plugin. Since this is a race condition error, your results may vary.
4. Please link to the relevant issues (if any).
5. Checklist