-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138
Conversation
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759
Affects all versions since 3.3. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. We should add some kind of unit test for this scenario.
Very good. Please follow up with a unit test or an improvement to the randomized stress tests. In addition we should have integration tests with RF=1 to prevent read repair from covering up such bugs. @roydahan let's add something to S-C-T scenarios. |
Agreed (on both :)). |
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes #16759 Closes #17138 (cherry picked from commit ed98102)
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes #16759 Closes #17138 (cherry picked from commit ed98102)
@bhalevy Reproduce? From the issue report and from the source code of janusgraph, it appeared that the problem is happening even with just basic inserts and selects. And given the reported circumstances, both mine and @tgrabiec's intuition was that this felt like it involved a race between cache population and cache update. So I wrote a workload which attempts to cause such a race, by spamming a scenario of the form: insert, wait for several seconds (the duration of flush) and select. And this turned out to be enough. I had a list of other ideas to try, but the first one worked. |
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138
I think it was backported everywhere and we can remove the label, no? |
@michoecho - can we remove the label? Anything else missing? |
We can remove the label. |
… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138
Commit e81fc1f accidentally broke the control flow of row_cache::do_update().
Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop,
return
was used.The bad commit removed the lambda, but didn't update the
return
accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates_prev_snapshot_pos
to reflect the work done by the loop.As a result, whenever
apply_to_incomplete()
(theupdater
) is preempted,do_update()
fails to update_prev_snapshot_pos
. It remains in a stale state, untildo_update()
runs again and either finishes or is preempted outside ofupdater
.If we read a partition processed by
do_update()
but not covered by_prev_snapshot_pos
, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data.This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth).
Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on all queried replicas at the time of the query.
Fixes #16759