Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row_cache: update _prev_snapshot_pos even if apply_to_incomplete() is preempted #17138

Closed
wants to merge 1 commit into from

Conversation

michoecho
Copy link
Contributor

Commit e81fc1f accidentally broke the control flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda. Thus, to break out of the loop, return was used.

The bad commit removed the lambda, but didn't update the return accordingly. Thus, since the commit, the statement doesn't just break out of the loop as intended, but also skips the code after the loop, which updates _prev_snapshot_pos to reflect the work done by the loop.

As a result, whenever apply_to_incomplete() (the updater) is preempted, do_update() fails to update _prev_snapshot_pos. It remains in a stale state, until do_update() runs again and either finishes or is preempted outside of updater.

If we read a partition processed by do_update() but not covered by _prev_snapshot_pos, we will read stale data (from the previous snapshot), which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica. (And perhaps in something worse if range tombstones are involved. I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden by reconciliation, because the reconciled query will only see stale data if the queried partition is affected by the bug on on all queried replicas at the time of the query.

Fixes #16759

… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes scylladb#16759
@michoecho
Copy link
Contributor Author

Affects all versions since 3.3.

@michoecho michoecho added backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Feb 2, 2024
@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Container Test
✅ - dtest
✅ - Unit Tests

Build Details:

  • Duration: 1 hr 50 min
  • Builder: spider1.cloudius-systems.com

Copy link
Contributor

@tgrabiec tgrabiec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. We should add some kind of unit test for this scenario.

@avikivity
Copy link
Member

Very good. Please follow up with a unit test or an improvement to the randomized stress tests.

In addition we should have integration tests with RF=1 to prevent read repair from covering up such bugs. @roydahan let's add something to S-C-T scenarios.

@bhalevy
Copy link
Member

bhalevy commented Feb 4, 2024

Good catch. We should add some kind of unit test for this scenario.

Agreed (on both :)).
@michoecho, how did you reproduce the issue?

avikivity pushed a commit that referenced this pull request Feb 4, 2024
… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes #16759

Closes #17138

(cherry picked from commit ed98102)
avikivity pushed a commit that referenced this pull request Feb 4, 2024
… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes #16759

Closes #17138

(cherry picked from commit ed98102)
@michoecho
Copy link
Contributor Author

michoecho commented Feb 5, 2024

@michoecho, how did you reproduce the issue?

@bhalevy Reproduce? From the issue report and from the source code of janusgraph, it appeared that the problem is happening even with just basic inserts and selects. And given the reported circumstances, both mine and @tgrabiec's intuition was that this felt like it involved a race between cache population and cache update. So I wrote a workload which attempts to cause such a race, by spamming a scenario of the form: insert, wait for several seconds (the duration of flush) and select. And this turned out to be enough.

I had a list of other ideas to try, but the first one worked.

syuu1228 pushed a commit to syuu1228/scylla that referenced this pull request Feb 19, 2024
… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes scylladb#16759

Closes scylladb#17138
@mykaul
Copy link
Contributor

mykaul commented Feb 19, 2024

I think it was backported everywhere and we can remove the label, no?

@mykaul
Copy link
Contributor

mykaul commented Mar 13, 2024

I think it was backported everywhere and we can remove the label, no?

@michoecho - can we remove the label? Anything else missing?

@michoecho
Copy link
Contributor Author

I think it was backported everywhere and we can remove the label, no?

@michoecho - can we remove the label? Anything else missing?

We can remove the label.

@michoecho michoecho removed Backport candidate backport/5.2 Issues that should be backported to 5.2 branch once they'll be fixed backport/5.4 Issues that should be backported to 5.4 branch once they'll be fixed labels Mar 13, 2024
dgarcia360 pushed a commit to dgarcia360/scylla that referenced this pull request Apr 30, 2024
… preempted

Commit e81fc1f accidentally broke the control
flow of row_cache::do_update().

Before that commit, the body of the loop was wrapped in a lambda.
Thus, to break out of the loop, `return` was used.

The bad commit removed the lambda, but didn't update the `return` accordingly.
Thus, since the commit, the statement doesn't just break out of the loop as
intended, but also skips the code after the loop, which updates `_prev_snapshot_pos`
to reflect the work done by the loop.

As a result, whenever `apply_to_incomplete()` (the `updater`) is preempted,
`do_update()` fails to update `_prev_snapshot_pos`. It remains in a
stale state, until `do_update()` runs again and either finishes or
is preempted outside of `updater`.

If we read a partition processed by `do_update()` but not covered by
`_prev_snapshot_pos`, we will read stale data (from the previous snapshot),
which will be remembered in the cache as the current data.

This results in outdated data being returned by the replica.
(And perhaps in something worse if range tombstones are involved.
I didn't investigate this possibility in depth).

Note: for queries with CL>1, occurences of this bug are likely to be hidden
by reconciliation, because the reconciled query will only see stale data if
the queried partition is affected by the bug on on *all* queried replicas
at the time of the query.

Fixes scylladb#16759

Closes scylladb#17138
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inserted data only becomes available after restart
6 participants