Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a few rare bugs in row cache #15945

Merged
merged 8 commits into from
Nov 17, 2023
Merged

Conversation

michoecho
Copy link
Contributor

This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details.

Fixes #15483

@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
❌ - Sanity Tests
✅ - Unit Tests

Failed Tests (1/21259):

Build Details:

@bhalevy
Copy link
Member

bhalevy commented Nov 5, 2023

🔴 CI State: FAILURE

✅ - Build ❌ - Sanity Tests ✅ - Unit Tests

Failed Tests (1/21259):

Build Details:

#15918

Copy link
Contributor

@nyh nyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the patches look like legal C++ to me ;-) But unfortunately I'm not familiar enough with how the Scylla caching code works to know if any of these changes are correct. Let's get someone who is familiar with this code (@tgrabiec ? @avikivity ? @denesb ?) to review it.

I would have liked to see a unit test that reproduces the need for every one of these changes, as this would be an even better "proof" that these fixes are correct than I can ever hope to achieve by reviewing the patches. I realize you discovered these issues through randomized testing, which is great, but surely you had to be able to reproduce each one of these issues - either by writing a new test - or by picking a specific random seed for the randomized test? In that case, we should know what these new tests/seeds are, so we can run them every time to ensure regressions do not happen in these cases which we now know is not being tested in any other test (otherwise, we would have known these bugs already).

if (!e.is_linked()) [[unlikely]] {
_snp->tracker()->touch(e);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know all of these issues were found by a randomized test, but could we had a separate regression test for each one?

Another thing you can do is to keep a list of random seeds that must be run on every run of the randomized test because they reproduced known bugs. The problem with this approach, though, is that you have to trust that the random number algorithm will never change, and I don't see how you can trust this unless you add your own random number generator function, and then rerun your million random tests to find reproducers for each of these bugs :-(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those random seeds would be invalidated by any change to the algorithm, even if made preemption choices deterministic.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we might do is add assertions (perhaps not in release mode) about invariants, so that the failure happens earlier. Thus, a failure that would otherwise not be detected (because we didn't read from the bad cache entry) would still cause a crash.

@michoecho
Copy link
Contributor Author

michoecho commented Nov 5, 2023

@nyh

Let's get someone who is familiar with this code (@tgrabiec ? @avikivity ? @denesb ?) to review it.

@tgrabiec is the reviewer for row cache. I also requested @denesb because he spent some time debugging these issues before I took over.

I realize you discovered these issues through randomized testing, which is great, but surely you had to be able to reproduce each one of these issues - either by writing a new test - or by picking a specific random seed for the randomized test? In that case, we should know what these new tests/seeds are, so we can run them every time to ensure regressions do not happen in these cases which we now know is not being tested in any other test (otherwise, we would have known these bugs already).

This test is nondeterministic. It performs many concurrent operations (reads, updates, evictions) on the cache, and its outcome depends on how exactly they are preempted and ordered by the reactor.

To debug the test, I had to make it deterministic: I modified Seastar so that preemption wasn't timing-dependent, but pseudorandom. Only then I could get the same execution every time.

But this technique only works for debugging, where you don't change any logic between runs. It can't be used for regression tests. As soon as you make some nontrivial change to the code, the seed will no longer be testing the case you want to test.

I would have liked to see a unit test that reproduces the need for every one of these changes, as this would be an even better "proof" that these fixes are correct than I can ever hope to achieve by reviewing the patches.

Unfortunately most of the bugs fixed by this PR are not testable. They require a specific state which can only be arrived at when preemptions happen at just the right points.

@michoecho
Copy link
Contributor Author

Unfortunately most of the bugs fixed by this PR are not testable. They require a specific state which can only be arrived at when preemptions happen at just the right points.

Well, of course we could construct the specific state (at the point right before the buggy function runs) manually. But such a test would be very fragile.

Another thing you can do is to keep a list of random seeds that must be run on every run of the randomized test because they reproduced known bugs.

We could also run the test 10 million times before releases. (For reference: the rarest of the bugs fixed by this PR were reproducing about once in ~300000 runs).

But there is only so much you can do with hand-written or random tests. The space of states to test is just too big. I wish we had a more systematic approach.

@nyh
Copy link
Contributor

nyh commented Nov 5, 2023

Let's get someone who is familiar with this code (@tgrabiec ? @avikivity ? @denesb ?) to review it.

@tgrabiec is the reviewer for row cache. I also requested @denesb because he spent some time debugging these issues before I took over.

Yes, I know, this is why I asked them again. Pinging people on github isn't rude - it's the only real way to remind people if you want them to review your code. Because people don't normally (at least, I don't...) perform a query of all open PRs where they were mentioned in the past. Of course, it's also possible that Tomek and Botond have much better memory than I do, and remember they need to review your patch and just didn't get around to it yet, in which case - sorry for the spam.

To debug the test, I had to make it deterministic: I modified Seastar so that preemption wasn't timing-dependent, but pseudorandom. Only then I could get the same execution every time.

But this technique only works for debugging, where you don't change any logic between runs. It can't be used for regression tests. As soon as you make some nontrivial change to the code, the seed will no longer be testing the case you want to test.

I see. Thanks. Did all these bugs depend on Seastar preemption - or just some of them did? For bugs that really can only be reproduced with particular sequence of preemption, maybe we could add some hooks in normal Seastar (without hacking Seastar) for disabling and forcing preemption and then somehow using those hooks in unit tests?

Unfortunately most of the bugs fixed by this PR are not testable. They require a specific state which can only be arrived at when preemptions happen at just the right points.

Then how can you prove - or at least did you "prove" to yourself - that your fixes really fixed those bugs which you found?
How can anyone in the future ever modify the code which you change here without worrying he might re-introduce these bugs and not be aware that it happens?

I have another idea. You said that you discovered all these bugs using a randomized test. How long did this run take? Maybe we need to include this randomized test - with its million random attempts (and not just a handful) - in our battery of long-running tests - which we don't have to run on every CI run, but maybe we need to re-run at least once a week or something?

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Sanity Tests
✅ - Unit Tests

Build Details:

@michoecho
Copy link
Contributor Author

michoecho commented Nov 5, 2023

Of course, it's also possible that Tomek and Botond have much better memory than I do, and remember they need to review your patch and just didn't get around to it yet, in which case - sorry for the spam.

This PR was posted on Friday evening, and now it's Sunday evening.

Did all these bugs depend on Seastar preemption - or just some of them did?

I think the bug in partition_version: fix violation of "older versions are evicted first" during schema upgrades is the only one that can be reproduced without depending on specific preemption sequences.

For bugs that really can only be reproduced with particular sequence of preemption, maybe we could add some hooks in normal Seastar (without hacking Seastar) for disabling and forcing preemption and then somehow using those hooks in unit tests?

I don't see any sane way of solving the problem with hooks and injections. Yes, if you add enough of them, you can force preemption to work exactly as you need, but is this something worth doing? How much effort is reasonable to be put into writing tests designed extremely specifically to reproduce an already-fixed bug?

Then how can you prove - or at least did you "prove" to yourself - that your fixes really fixed those bugs which you found?

  1. By understanding the bugs (via detailed observation — through logs and debuggers — of the failing random test cases).
  2. By validating that the random test isn't failing over millions of runs. (I.e. orders of magnitude more than the mean time between failures before the patch).

How can anyone in the future ever modify the code which you change here without worrying he might re-introduce these bugs and not be aware that it happens?

Without worrying? He can't.

And even if you could write a test that prevents specifically these bugs from being reintroduced, he would still have to worry about introducing new bugs in the other 1000 places where they can be introduced.

I have another idea. You said that you discovered all these bugs using a randomized test. How long did this run take?

I ran it with shell loops (i.e. it the test binary had to do its setup and teardown every time — so perhaps it can be sped up), and it was doing ~2 attempts per second per core. So a million attempts on a 16-core machine took about 1000000 / 16 / (2/s) = 9 hours.

Maybe we need to include this randomized test - with its million random attempts (and not just a handful) - in our battery of long-running tests - which we don't have to run on every CI run, but maybe we need to re-run at least once a week or something?

There's probably no reason to run it periodically. Row cache code doesn't change every week. But it could be a good idea to do that when testing patchsets relevant to the cache.

@denesb
Copy link
Contributor

denesb commented Nov 6, 2023

@nyh the randomized test which revealed these bugs was failing intermittently for some time now. Because it is run in CI, it gets a lot of runtime spread over many PRs and it was telling us that there are bugs in the cache for some time now.
All this is to say -- for bugs found by looking into failures produced by a [flaky] test, I don't think we need separate regression tests -- there is already a test failing before and passing after.

@nyh
Copy link
Contributor

nyh commented Nov 6, 2023

@nyh the randomized test which revealed these bugs was failing intermittently for some time now. Because it is run in CI, it gets a lot of runtime spread over many PRs and it was telling us that there are bugs in the cache for some time now. All this is to say -- for bugs found by looking into failures produced by a [flaky] test, I don't think we need separate regression tests -- there is already a test failing before and passing after.

On one hand, you're right. On the other hand, not entirely:

First of all, we (or at least I) don't have a good idea on how frequently or infrequently this test fails on actual Jenkins runs - if it only failed one a month, how will we ever notice if it fails again? @michoecho had to run this test in a loop for many hours to see some of these bugs - it is possible that we never saw them happen in any Jenkins run.

Second, and more importantly, as a reviewer I always suspect a fix that doesn't have a regression test - a test that shows the failure before the fix, and passing after the fix. Without such a regression test it's hard to know if a certain fix, which "looks" reasonable, actually fixes anything. This PR has many separate changes, how does its author or its reviewer know all of them are necessary or correct? But I agree that if @michoecho ran his many-hour-loop again after these patches, and it passed all of them, it increases the confidence that at least this PR is correct. It won't prevent other people from breaking this code in the future and then not running the many-hour loop, but I guess nothing is perfect.

@bhalevy
Copy link
Member

bhalevy commented Nov 6, 2023

Also, we should consider the benefit of merging these fixes despite imperfect testing (and the risk of them introducing something new that our current test doesn't cover, even with 1M runs), vs. the risk of exposure if we don't merge those fixes, beyond the test failing from time to time.

@nyh
Copy link
Contributor

nyh commented Nov 6, 2023

Also, we should consider the benefit of merging these fixes despite imperfect testing (and the risk of them introducing something new that our current test doesn't cover, even with 1M runs), vs. the risk of exposure if we don't merge those fixes, beyond the test failing from time to time.

I agree. This is why I asked that someone who is intimately familiar with this code (i.e., not me) will review this PR before I merge it, because even if I don't fully trust the testing, a person who is familar with this code can look at the fixes and say "wow, that's right! great catch!". In a less optimistic case, such a reviewer may also look at the fix and say "no, this doesn't do what you think, if you didn't see the bug again after this fix you were just lucky".

I also agree that if @michoecho personally ran his 1-million-time test before and after his patches, and it failed many times before and not even once after, than his patches are most likely all correct, and I'm just making a mountain out of a mole-hill :-)

@bhalevy
Copy link
Member

bhalevy commented Nov 6, 2023

@tgrabiec @denesb please review and ack

@denesb
Copy link
Contributor

denesb commented Nov 6, 2023

I also agree that if @michoecho personally ran his 1-million-time test before and after his patches, and it failed many times before and not even once after, than his patches are most likely all correct, and I'm just making a mountain out of a mole-hill :-)

The test failed quite often, I could reproduce a failure with a mere --repeat=20. Once making some changes to the test, it would fail on every other run. @michoecho maybe we could have a variant of this test with the equivalent of debuggable=true, which really translates to a smaller list of possible key values, producing more collisions and interesting overlaps. This was greatly improving the chances of reproducing bugs, when debugging this.

Copy link
Contributor

@denesb denesb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I cannot claim to have understood all the patches. In fact, as soon as I read "older versions are evicted first", my head starts to shutdown.

We need a review from @tgrabiec.

@avikivity
Copy link
Member

LGTM, but I cannot claim to have understood all the patches. In fact, as soon as I read "older versions are evicted first", my head starts to shutdown.

We need a review from @tgrabiec.

We need to simplify MVCC

@@ -510,7 +510,7 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
return false;
}

if (cmp(cur.position(), _last_row.position()) != 0) {
if (cmp(cur.table_position(), _last_row.position()) != 0) {
return false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one looks as if a simple unit test can be constructed for it (I'm not saying that creating the unit test would be simple, just that in the end it would probably be a large sequence of range tombstones or something).

@avikivity
Copy link
Member

The conclusion of my review is that @tgrabiec must review it.

@bhalevy
Copy link
Member

bhalevy commented Nov 8, 2023

ping @tgrabiec

@@ -696,7 +696,7 @@ bool cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone_change
if (ensure_population_lower_bound()) {
// underlying may emit range_tombstone_change fragments with the same position.
// In such case, the range to which the tombstone from the first fragment applies is empty and should be ignored.
if (q_cmp(_last_row.position(), it->position()) < 0) {
if (q_cmp(_last_row.position(), it->position()) != 0) {
Copy link
Contributor

@tgrabiec tgrabiec Nov 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why the operator change. Did you find a case where _last_row.position() is > it->position()? That would be incorrect as it violates monotonicity, and the code below doesn't handle it properly, so the old comparison may actually be more correct.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, we're using query-domain comparator for table-domain positions here. For equality, it should give the same result. But it's a bit confusing, so a comment would be good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You got it right: < only works as intended for non-reverse queries, != should work as intended for both types of queries.

I will add a comment about this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment.

mutation/mutation_partition_v2.cc Show resolved Hide resolved
@michoecho michoecho force-pushed the fix_15483 branch 2 times, most recently from 9e107e4 to c10d6a5 Compare November 8, 2023 19:02
@scylladb-promoter
Copy link
Contributor

🔴 CI State: FAILURE

✅ - Build
✅ - Sanity Tests
❌ - Unit Tests

Failed Tests (6/21815):

Build Details:

@bhalevy
Copy link
Member

bhalevy commented Nov 16, 2023

@michoecho please rebase to get a clean baseline for re-testing.

@scylladb-promoter
Copy link
Contributor

🔴 CI State: ABORTED

❌ - Build

Build Details:

It is illegal for an unlinked last dummy to be continuous,
(this is how last dummies respect the "older verions are evicted first" rule),
but it is technically possible for an unlinked last dummy to be
made continuous by read_from_underlying. This commit fixes that.

Found by row_cache_test.

The bug is very unlikely to happen in practice because the relevant
rows_entry is bumped in LRU before read_from_underlying starts.
For the bug to manifest, the entry has to fall down to the end of the
LRU list and be evicted before read_from_underlying() ends.
Usually it takes several minutes for an entry to fall out of LRU,
and read_from_underlying takes maybe a few hundred milliseconds.

And even if the above happened, there still needs to appear a new
version, which needs to have its continuous last dummy evicted
before it's merged.
…r_bound() with reverse reads

`_last_row` is in table schema, while `cur.position()` is in query schema
(which is either equal to table schema, or its reverse).

Thus, the comparison affected by this patch doesn't work as intended.
In reverse reads, the check will pass even if `_last_row` has the same key,
but opposite bound weight to `cur`, which will lead to the dummy being inserted
at the wrong position, which can e.g. wrongly extend a range tombstone.

Fix that.
…e reads

`_last_row` is in table schema, but it is sometimes compared with positions in
query schema. This leads to unexpected behaviour when reverse reads
are used.
The previous patch fixed one such case, which was affecting correctness.

As far as I can tell, the three cases affected by this patch aren't
a correctness problem, but can cause some intervals to fail to be made continuous.
(And they won't be cached even if the same read is repeated many times).
…in_latest() with reverse reads

The FIXME comment claims that setting continity isn't very important in this
place, but in fact this is just wrong.

If two calls to read_from_underlying() get into a race, the one which finishes
later can call ensure_entry_in_latest() on a position which lies inside a
continuous interval in the newest version. If we don't take care to preserve
the total continuity of the version, this can punch a hole in the continuity of the
newest version, potentially reverting the affected interval to an older version.

Fix that.
…tion races with reverse reads

Cache population routines insert new row entries.

In non-reverse reads, the new entries (except for the lower bound of the query
range) are filled with the correct continuity and range tombstones immediately
after insertion, because that information has already arrived from underlying.
when the entries are inserted.

But in reverse reads, it's the interval *after* the newly-inserted entry
that's made continuous. The continuity information in the new entries isn't
filled. When two population routines race, the one which comes later can
punch holes in the continuity left by the first routine, which can break
the "older versions are evicted first" rule and revert the affected
interval to an older version.

To fix this, we must make sure that inserting new row entries doesn't
change the total continuity of the version.
…tinuity()

To reflect the final range tombstone change in the populated range,
maybe_update_continuity() might insert a dummy at `before_key(_next_row.table_position())`.

But the relevant logic breaks down in the special case when that position is
equal to `_last_row.position()`. The code treats the dummy as a part of
the (_last_row, _next_row) range, but this is wrong in the special case.

This can lead to inconsistent state. For example, `_last_row` can be wrongly made
continuous, or its range tombstone can be wrongly nulled.

The proposed fix is to only modify the dummy if it was actually inserted.
If it had been inserted beforehand (which is true in the special case, because
of the `ensure_population_lower_bound()` call earlier), then it's already in a
valid state and doesn't need changes.
…in ensure_population_lower_bound()

ensure_population_lower_bound() guarantees that _last_row is valid or null.

However, it fails to provide this guarantee in the special rare case when
`_population_range_starts_before_all_rows == true` and _last_row is non-null.

(This can happen in practice if there is a dummy at before_all_clustering_rows
and eviction makes the `(before_all_clustering_rows, ...)` interval
discontinous. When the interval is read in this state, _last_row will point to
the dummy, while _population_range_starts_before_all_rows will still be true.)

In this special case, `ensure_population_lower_bound()` does not refresh
`_last_row`, so it can be non-null but invalid after the call.
If it is accessed in this state, undefined behaviour occurs.
This was observed to happen in a test,
in the `read_from_underlying() -- maybe_drop_last_entry()` codepath.

The proposed fix is to make the meaning of _population_range_starts_before_all_rows
closer to its real intention. Namely: it's supposed to handle the special case of a
left-open interval, not the case of an interval starting at -inf.
…" during schema upgrades

A schema upgrade appends a MVCC version B after an existing version A.

The last dummy in B is added to the front of LRU,
so it will be evicted after the entries in A.

This alone doesn't quite violate the "older versions are evicted first" rule,
because the new last dummy carries no information. But apply_monotonically
generally assumes that entries on the same position have the obvious
eviction order, even if they carry no information. Thus, after the merge,
the rule can become broken.

The proposed fix is as follows:

- In the case where A is merged into B, the merged last dummy
  inherits the link of A.
- The merging of B into anything is prevented until its merge with A is finished.

This is relatively hacky, because it still involves a state that
goes against some natural expectations granted by the "older versions..."
rule. A less hacky fix would be to ensure that the new dummy is inserted
into a proper place in the eviction order to begin with.

Or, better yet, we could eliminate the rule altogether.
Aside from being very hard to maintain, it also prevents the introduction
of any eviction algorithm other than LRU.
@michoecho
Copy link
Contributor Author

Rebased and restarted CI.

@scylladb-promoter
Copy link
Contributor

🟢 CI State: SUCCESS

✅ - Build
✅ - Unit Tests
✅ - Sanity Tests

Build Details:

@bhalevy
Copy link
Member

bhalevy commented Nov 16, 2023

@scylladb/scylla-maint pls merge

@scylladb-promoter scylladb-promoter merged commit 6bcf3ac into scylladb:master Nov 17, 2023
3 checks passed
denesb pushed a commit that referenced this pull request Jan 15, 2024
This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details.

Fixes #15483

Closes #15945

* github.com:scylladb/scylladb:
  partition_version: fix violation of "older versions are evicted first" during schema upgrades
  cache_flat_mutation_reader: fix a broken iterator validity guarantee in ensure_population_lower_bound()
  cache_flat_mutation_reader: fix a continuity loss in maybe_update_continuity()
  cache_flat_mutation_reader: fix continuity losses during cache population races with reverse reads
  partition_snapshot_row_cursor: fix a continuity loss in ensure_entry_in_latest() with reverse reads
  cache_flat_mutation_reader: fix some cache mispopulations with reverse reads
  cache_flat_mutation_reader: fix a logic bug in ensure_population_lower_bound() with reverse reads
  cache_flat_mutation_reader: never make an unlinked last dummy continuous

(cherry picked from commit 6bcf3ac)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

test_concurrent_reads_and_eviction: Mutation read doesn't match any expected version
7 participants