Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert #469 and return to taking shared locks first #471

Merged
merged 1 commit into from May 21, 2021

Conversation

jamadden
Copy link
Member

Stress testing a large busy Zope application indicated that performance did not improve and was actually slightly worse this way.

The following is part of an internal discussion on why that could have been.

In the test run, there were only 8 UnableToLockRowsToModifyError recorded as triggering retries, vs 66 UnableToLockRowsToReadCurrentError. In at least one case, a transaction got both, first Modify then ReadCurrent on the retry.

In the other thread about BTree sizes, we ran through an analysis of all the reported ReadCurrent errors and came to the conclusion that they're all due to modifications to the BTrees involved in the IntId utility.

The hypothesis goes that many transactions wind up waiting on an exclusive lock for one of those objects, and only after waiting for some period do they discover the ReadCurrent failure and doom the transaction.

If they check for the ReadCurrent failure first, and either find one or deadlock, they doom the transaction quickly without waiting and hopefully a retry succeeds.

I think what we didn't pay close enough attention to in the original discussion about changing the lock order is that any time there was a deadlock, it always really represented a doomed transaction --- there would be no way to recover without retrying anyway. The hope was that (1) instead of letting the RDBMS kill an arbitrary transaction, we would be in control of which transaction died, thus letting the one with the most invested already proceed and commit; and (2) maybe we could do more conflict resolution. (1) didn't pan out because it turns out that if two or more processes are sitting waiting on locks and the RDBMS kills one of them, neither one really had that much of an advantage over the other --- they were basically both in the same place. (2) didn't pan out because one of the transactions was doomed anyway.

This also fixes two bugs in the PostgreSQL implementation:

  • Selecting the right kind of lock error
  • Raising an exception in the stored proc had a Syntax error.

Stress testing a large busy Zope application indicated that
performance did not improve and was actually slightly worse this way.

The following is part of an internal discussion on why that could have
been.

In the test run, there were only 8 UnableToLockRowsToModifyError recorded as triggering retries, vs 66 UnableToLockRowsToReadCurrentError. In at least one case, a transaction got *both*, first Modify then ReadCurrent on the retry.

In the other thread about BTree sizes, we ran through an analysis of all the reported ReadCurrent errors and came to the conclusion that they're *all* due to modifications to the BTrees involved in the IntId utility.

The hypothesis goes that many transactions wind up waiting on an exclusive lock for one of those objects, and only after waiting for some period do they discover the ReadCurrent failure and doom the transaction.

If they check for the ReadCurrent failure first, and either find one or deadlock, they doom the transaction quickly without waiting and hopefully a retry succeeds.

I think what we didn't pay close enough attention to in the original discussion about changing the lock order is that any time there was a deadlock, it always really represented a doomed transaction --- there would be no way to recover without retrying *anyway*. The hope was that (1) instead of letting the RDBMS kill an arbitrary transaction, we would be in control of which transaction died, thus letting the one with the most invested already proceed and commit; and (2) maybe we could do more conflict resolution. (1) didn't pan out because it turns out that if two or more processes are sitting waiting on locks and the RDBMS kills one of them, neither one really had that much of an advantage over the other --- they were basically both in the same place. (2) didn't pan out because one of the transactions was doomed anyway.
@jamadden jamadden merged commit 1c37cee into master May 21, 2021
@jamadden jamadden deleted the revert-lock-order branch May 21, 2021 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant