Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668

Conversation

hanneswernery
Copy link
Contributor

1. Why is this change necessary?

As discussed with @OliverSkroblin when writing entities in multiple threads and/or transactions at the same time deadlocks may occur in the database. When a deadlock occurs within a transaction, MySQL rolls back the transaction internally. Therefore the whole transaction can be retried. This is also stated in the MySQL error message:

ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction

The EntityWriteGateway uses a transaction but uses a RetryableQuery inside this transaction (code). In case of a deadlock a single query of the transaction is now retried (code) even though the transaction was already rolled back internally by the database. Since queries inside a multi-query-transaction may depend on each other, this causes all kinds of unintended behaviour (i.e. foreign key constraint violations).
The description of the RetryableException that is caught also suggest to retry the transaction, not a single query.
Therefore the usage of the current RetryableQuery only works when running a single query in no transaction.

2. What does this change do, exactly?

This PR adds a RetryableTransaction class which can be used to execute a number of commands (queries) inside a transaction which can be rolled back and retried as a whole without side effects or unwanted behaviour.
Additionally the current usages of RetryableQuery are updated to check whether or not the query is executed inside a transaction. Now if this is the case, the RetryableQuery will not be retried when a deadlock occurs.
For that the functions execute and retryable have been backwards-compatibly updated to accept a Doctrine\DBAL\Connection $connection to ensure this new safe behaviour.

3. Describe each step to reproduce the issue or behaviour.

Run two complex processes, e.g. create new versions of an existing order and update many products at the same time.
A real life example is the order document creation via the administration. This creates a new order version and triggers a complex product subscriber from the Pickware ERP plugin. Since this is a race condition error, your results may vary.

4. Please link to the relevant issues (if any).

5. Checklist

  • I have written tests and verified that they fail without my change
  • I have squashed any insignificant commits
  • I have created a changelog file with all necessary information about my changes
  • I have written or adjusted the documentation according to my changes
  • This change has comments for package types, values, functions, and non-obvious lines of code
  • I have read the contribution requirements and fulfil them.

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch from 93c0cf3 to ceace3b Compare March 3, 2021 17:03
@hanneswernery
Copy link
Contributor Author

I updated the changelog.

More importantly: I updated the EntityWriteGateway so it actually uses the new RetryableTransaction - which is described in the PR description and was the main reason for this PR. I lost these code changes somehow before opening the PR.

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch 3 times, most recently from 28b290b to 64290cb Compare March 5, 2021 09:28
@hanneswernery
Copy link
Contributor Author

I added the UPGRADE INFORMATION part of the changelog.
I changed the RetryableQuery constructor as discussed.
I rebased the branch onto trunk.

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch from 64290cb to b4fa53f Compare March 5, 2021 09:43
@shopwareBot
Copy link

Hello,

thank you for creating this pull request.
I have opened an issue on our Issue Tracker for you. See the issue link: https://issues.shopware.com/issues/NEXT-14082

Please use this issue to track the state of your pull request.

@J-Rahe J-Rahe assigned J-Rahe and unassigned bneumann97 Jun 4, 2021
@J-Rahe
Copy link
Contributor

J-Rahe commented Jun 4, 2021

Hey @hanneswernery ,
sorry that it took so long to get back to this.
I just checked out your PR and wanted to do the usual rebase, but there are a few merge conflicts.
Could you rebase your PR and fix these conflicts? Then I can put it into our internal review and hopefully get it done soon : )

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch 4 times, most recently from 618f909 to 3b28730 Compare June 7, 2021 06:41
@J-Rahe
Copy link
Contributor

J-Rahe commented Jun 7, 2021

thanks : )

@UlrichThomasGabor
Copy link
Contributor

What is the status of this PR? The faulty handling of deadlocks seems to cause other problems, e.g. https://issues.shopware.com/issues/NEXT-15805

Since nothing happened here for a while I prepared a hotfix from this PR and applied it to our installation. The mentioned issue is not fixed, so it seems something else regarding transaction handling is broken besides this PR.

@UlrichThomasGabor
Copy link
Contributor

Here are two beginTransaction calls where the transaction code is not wrapped in a try-catch block. Don't know if this is the cause, but it looks wrong to me either way:

I guess these could be wrapped in the new RetryableTransaction class as well.

@UlrichThomasGabor
Copy link
Contributor

These ain't working: (8289d01, 9da6473 @OliverSkroblin)

These lines catch exceptions which are needed for the control flow of transactions, i.e. which have to reach the new RetryableTransaction class.

It also does not make sense to split up large queries for deadlocks. This code is (or should be) wrapped in one transaction either way and InnoDB decides based on the size of the transaction, i.e. how many rows are modified, which one to roll back. Not based on the number of queries.

Best solution here is using RetryableQuery, which works with the new RetryableTransaction class and which works when the Updater is not wrapped in this class.

Fixing all mentioned issues (wrap ProductIndexer and CategoryIndexer in RetryableTransaction and removing the catch blocks, i.e. remove the split updates) resolves NEXT-15805.

@UlrichThomasGabor
Copy link
Contributor

By the way, I don't see why the sleep call should be necessary:
https://github.com/shopware/platform/pull/1668/files#diff-104cca720ab1150a9e4052ed394ce25bce03969849e2aeec954156e41189ba21R41
If a deadlock occurs, only one transaction is rolled back. The other one is committed. Likely, when the failing one is retried, the other one has already run through.

I guess in most cases the sleep statement will only prolong the execution of the script. If a transaction is long-running, it might help. But transactions should not be long running anyway and I doubt that a few microseconds will help in that case.

IMHO it would be better to leave the sleep out at the moment and observe how it works out in practice.

@windaishi
Copy link
Contributor

@UlrichThomasGabor Thanks for your feedback, we will have a look at this internally. Unfortunately we also are waiting for Shopware to merge this. I will ask them again, if they can merge this PR.

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch 2 times, most recently from 9b9aecb to 3190f5a Compare August 19, 2021 09:42
@UlrichThomasGabor
Copy link
Contributor

@hanneswernery Why do you think it makes a difference if queries are executed as batch or non-batch inside of a transaction?

From my understanding it should not make a difference:

It also does not make sense to split up large queries for deadlocks. This code is (or should be) wrapped in one transaction either way and InnoDB decides based on the size of the transaction, i.e. how many rows are modified, which one to roll back. Not based on the number of queries.

Or more specific: From my understanding, performing such work inside of a transaction non-batchy produces only SQL overhead, but will not result in less deadlocks.

@hanneswernery
Copy link
Contributor Author

hanneswernery commented Aug 19, 2021

Hi @UlrichThomasGabor , my last changes are the result of a discussion with @OliverSkroblin. The original PR caused some unwanted behaviour changes on the current trunk.

Main reason is that regular non-retryabe exceptions were not mapped (translated into a user-facing error) anymore. This is because this mapping is only called on non-batchwise executions. The original Shopware code retried every exception non-batchwise manually.
The original PR and the somewhat more correct retry logic that does not retry non-retryable errors (non-batchwise) broke this mapping behaviour.
The last commit reintroduces this manual retry which in turn calls the mapping again. The current state is still pending, I am waiting for a response from @OliverSkroblin

@UlrichThomasGabor
Copy link
Contributor

@OliverSkroblin Why is this behavior unwanted?

Basically, this method makes no sense to me:
https://github.com/shopware/platform/blob/8e56c285549ee9019a6cebffb15af4262361ef79/src/Core/Framework/DataAbstractionLayer/Dbal/EntityWriteGateway.php#L116-L146
This is all wrapped in one transaction, so should be executed atomically. Wether, this is executed as a single or as multiple SQL statements, has no effect on whether a deadlock will occur or not. It just prolongs the execution by generating overhead and thus conflicts with the most-basic best practice that transactions should be as fast/small as possible.

If the intend of "batch vs. non-batch" is "should be executed as one transaction or as multiple", then the solution should be to split the transaction or not. That does not happen at the moment.

If the intend of "batch vs. non-batch" is "should block other transaction as least as possible", then changing the isolation level is the way to go. (If this is possible. Otherwise, it becomes complicated...)

Either way, the last commit does not make it worse, but in my opinion does not solve any problem. I would revert it, remove any non-batch code, and if this performs bad in practice, then check if isolation levels can be lowered for certain transactions.

I don't get why batch vs. non-batch has anything to do with throwing Exceptions. This sounds like an unrelated issue to me.

@OliverSkroblin
Copy link
Member

Hi Ulrich,

first of all, thank you for looking so deeply into this topic.
As you probably noticed I'm not very active on github, so feel free to ping me in the community chat ;).

Regarding your questions and suggestions:

  1. Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668 (comment)

This has brought some performance optimizations, and it also happened more often that these places ran into a deadlock or unique constraints. So I added the try catch here to try an single insert/update/delete first and otherwise do a single update/insert.
Of course it can be that this now collides with the new RetryableTransaction.

  1. Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668 (comment)

The sleep call is necessary, because there could be not only 2 but also 10 transactions that are trying out the same records. These run then at the same time into a deadlock and will try at the same time also again an update. This sleep call is just to make sure that this happens a little bit staggered.

  1. Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668 (comment)

Perhaps there is a misunderstanding here. The behavior is not unintentional, but rather breaking. We have functions in the core that rely on the event thrown there in the try catch. By changing the entity gateway for the first time the pipeline did not run anymore and tests for duplicate constraints and error reportings were red. Therefore, I played this information back.

However, if this error reporting causes us to have a performance as well as stability loss on the transactions, then I am happy to replace this with an alternative in the next major.

Only as it is now I can't take it into the core because:

  1. It's a breaking change because existing mechanics will be undermined.
  2. The pipelines of us are red.

@hanneswernery
Copy link
Contributor Author

As discussed with @OliverSkroblin , I will take another look at the PR and will remove the "unnecessary non-batch manual retry" (😅) again. Instead I will refactor the error mapping a bit so it will work with batchwise executions.

With these changes, this PR will add the correct retry logic without breaking the error mapping and without unnecessary retries 👍

@UlrichThomasGabor
Copy link
Contributor

Hi Ulrich,

first of all, thank you for looking so deeply into this topic.
As you probably noticed I'm not very active on github, so feel free to ping me in the community chat ;).

What is the community chat? The forum?

Regarding your questions and suggestions:

  1. Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668 (comment)

This has brought some performance optimizations, and it also happened more often that these places ran into a deadlock or unique constraints. So I added the try catch here to try an single insert/update/delete first and otherwise do a single update/insert.
Of course it can be that this now collides with the new RetryableTransaction.

Yes, this conflicts in general with transactions.

This is "more performant" because transactions are shorter and therefore are less likely to deadlock. But this "improvement" comes at a cost, i.e. one indexing action is split into multiple transactions and when an exception occurs, the previous transactions are already committed and the current one fails, thus bringing inconsistency into the data.

Additionally, when the control flow is coming from EntityGateway and it already is in a transaction, this "improvement" is not possible at all.

There are various ways to improve the situation. We can just call, if you want to know my opinion. Might be better than discussing in text form.

  1. Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668 (comment)

The sleep call is necessary, because there could be not only 2 but also 10 transactions that are trying out the same records. These run then at the same time into a deadlock and will try at the same time also again an update. This sleep call is just to make sure that this happens a little bit staggered.

Well, it can stay, it's not seriously problematic, but personally I think it also works when it's not there.

  1. Add RetryableTransaction, update RetryableQuery to prevent database deadlocks in EntityWriteGateway #1668 (comment)

Perhaps there is a misunderstanding here. The behavior is not unintentional, but rather breaking. We have functions in the core that rely on the event thrown there in the try catch. By changing the entity gateway for the first time the pipeline did not run anymore and tests for duplicate constraints and error reportings were red. Therefore, I played this information back.

However, if this error reporting causes us to have a performance as well as stability loss on the transactions, then I am happy to replace this with an alternative in the next major.

Only as it is now I can't take it into the core because:

  1. It's a breaking change because existing mechanics will be undermined.
  2. The pipelines of us are red.

Yes, this might require more than one step. But the current code, and also the code after merging this PR, is just not working correctly. It's either failing more often than necessary, or data consistency can be lost with a deadlock in the worst case.

@mitelg
Copy link
Member

mitelg commented Aug 25, 2021

What is the community chat? The forum?

https://slack.shopware.com/ 😉

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch 2 times, most recently from d300637 to 2846efd Compare August 26, 2021 12:52
@UlrichThomasGabor
Copy link
Contributor

@hanneswernery Looks good (to me). Should be squashed I guess.

@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch from 2846efd to cb79c07 Compare September 6, 2021 15:44
@hanneswernery hanneswernery force-pushed the pickware/fix-retryablequery-in-sql-transactions branch from cb79c07 to 83acc4c Compare September 6, 2021 15:47
shopwareBot pushed a commit to shopware/core that referenced this pull request Sep 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

10 participants