row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138

michoecho · 2024-02-02T20:14:50Z

Commit e81fc1f accidentally broke the control flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, return was used.

The bad commit removed the lambda, but didn't update the return accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates _prev_snapshot_pos to reflect the work done by the loop.

As a result, whenever apply_to_incomplete() (the updater) is preempted, do_update() fails to update _prev_snapshot_pos. It remains in a stale state, until do_update() runs again and either finishes or is preempted outside of updater.

If we read a partition processed by do_update() but not covered by _prev_snapshot_pos, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on all queried replicas at the time of the query.

Fixes #16759

… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759

michoecho · 2024-02-02T20:15:28Z

Affects all versions since 3.3.

scylladb-promoter · 2024-02-02T22:07:11Z

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

Duration: 1 hr 50 min
Builder: spider1.cloudius-systems.com

tgrabiec

Good catch. We should add some kind of unit test for this scenario.

avikivity · 2024-02-04T09:20:10Z

Very good. Please follow up with a unit test or an improvement to the randomized stress tests.

In addition we should have integration tests with RF=1 to prevent read repair from covering up such bugs. @roydahan let's add something to S-C-T scenarios.

bhalevy · 2024-02-04T12:07:06Z

Good catch. We should add some kind of unit test for this scenario.

Agreed (on both :)).
@michoecho, how did you reproduce the issue?

… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes #16759 Closes #17138 (cherry picked from commit ed98102)

michoecho · 2024-02-05T10:53:25Z

@michoecho, how did you reproduce the issue?

@bhalevy Reproduce? From the issue report and from the source code of janusgraph, it appeared that the problem is happening even with just basic inserts and selects. And given the reported circumstances, both mine and @tgrabiec's intuition was that this felt like it involved a race between cache population and cache update. So I wrote a workload which attempts to cause such a race, by spamming a scenario of the form: insert, wait for several seconds (the duration of flush) and select. And this turned out to be enough.

I had a list of other ideas to try, but the first one worked.

… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138

mykaul · 2024-02-19T11:49:53Z

I think it was backported everywhere and we can remove the label, no?

mykaul · 2024-03-13T12:17:21Z

I think it was backported everywhere and we can remove the label, no?

@michoecho - can we remove the label? Anything else missing?

michoecho · 2024-03-13T12:21:15Z

I think it was backported everywhere and we can remove the label, no?

@michoecho - can we remove the label? Anything else missing?

We can remove the label.

… preempted Commit e81fc1f accidentally broke the control flow of row_cache::do_update(). Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, `return` was used. The bad commit removed the lambda, but didn't update the `return` accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates `_prev_snapshot_pos` to reflect the work done by the loop. As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted, `do_update()` fails to update `_prev_snapshot_pos`. It remains in a stale state, until `do_update()` runs again and either finishes or is preempted outside of `updater`. If we read a partition processed by `do_update()` but not covered by `_prev_snapshot_pos`, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data. This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth). Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on *all* queried replicas at the time of the query. Fixes scylladb#16759 Closes scylladb#17138

michoecho requested a review from tgrabiec as a code owner February 2, 2024 20:14

michoecho requested a review from avikivity February 2, 2024 20:15

michoecho added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Feb 2, 2024

tgrabiec approved these changes Feb 3, 2024

View reviewed changes

scylladb-promoter closed this in ed98102 Feb 4, 2024

mykaul added the Backport candidate label Feb 7, 2024

michoecho mentioned this pull request Feb 7, 2024

row_cache: test cache consistency during multi-partition cache updates #17208

Merged

michoecho removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138

row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138

michoecho commented Feb 2, 2024

michoecho commented Feb 2, 2024

scylladb-promoter commented Feb 2, 2024

tgrabiec left a comment

avikivity commented Feb 4, 2024

bhalevy commented Feb 4, 2024

michoecho commented Feb 5, 2024 •

edited

mykaul commented Feb 19, 2024

mykaul commented Mar 13, 2024

michoecho commented Mar 13, 2024

row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138

row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138

Conversation

michoecho commented Feb 2, 2024

michoecho commented Feb 2, 2024

scylladb-promoter commented Feb 2, 2024

🟢 CI State: SUCCESS

Build Details:

tgrabiec left a comment

Choose a reason for hiding this comment

avikivity commented Feb 4, 2024

bhalevy commented Feb 4, 2024

michoecho commented Feb 5, 2024 • edited

mykaul commented Feb 19, 2024

mykaul commented Mar 13, 2024

michoecho commented Mar 13, 2024

michoecho commented Feb 5, 2024 •

edited