-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a few rare bugs in row cache #15945
Conversation
🔴 CI State: FAILURE✅ - Build Failed Tests (1/21259):Build Details:
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the patches look like legal C++ to me ;-) But unfortunately I'm not familiar enough with how the Scylla caching code works to know if any of these changes are correct. Let's get someone who is familiar with this code (@tgrabiec ? @avikivity ? @denesb ?) to review it.
I would have liked to see a unit test that reproduces the need for every one of these changes, as this would be an even better "proof" that these fixes are correct than I can ever hope to achieve by reviewing the patches. I realize you discovered these issues through randomized testing, which is great, but surely you had to be able to reproduce each one of these issues - either by writing a new test - or by picking a specific random seed for the randomized test? In that case, we should know what these new tests/seeds are, so we can run them every time to ensure regressions do not happen in these cases which we now know is not being tested in any other test (otherwise, we would have known these bugs already).
if (!e.is_linked()) [[unlikely]] { | ||
_snp->tracker()->touch(e); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know all of these issues were found by a randomized test, but could we had a separate regression test for each one?
Another thing you can do is to keep a list of random seeds that must be run on every run of the randomized test because they reproduced known bugs. The problem with this approach, though, is that you have to trust that the random number algorithm will never change, and I don't see how you can trust this unless you add your own random number generator function, and then rerun your million random tests to find reproducers for each of these bugs :-(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those random seeds would be invalidated by any change to the algorithm, even if made preemption choices deterministic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we might do is add assertions (perhaps not in release mode) about invariants, so that the failure happens earlier. Thus, a failure that would otherwise not be detected (because we didn't read from the bad cache entry) would still cause a crash.
@tgrabiec is the reviewer for row cache. I also requested @denesb because he spent some time debugging these issues before I took over.
This test is nondeterministic. It performs many concurrent operations (reads, updates, evictions) on the cache, and its outcome depends on how exactly they are preempted and ordered by the reactor. To debug the test, I had to make it deterministic: I modified Seastar so that preemption wasn't timing-dependent, but pseudorandom. Only then I could get the same execution every time. But this technique only works for debugging, where you don't change any logic between runs. It can't be used for regression tests. As soon as you make some nontrivial change to the code, the seed will no longer be testing the case you want to test.
Unfortunately most of the bugs fixed by this PR are not testable. They require a specific state which can only be arrived at when preemptions happen at just the right points. |
Well, of course we could construct the specific state (at the point right before the buggy function runs) manually. But such a test would be very fragile.
We could also run the test 10 million times before releases. (For reference: the rarest of the bugs fixed by this PR were reproducing about once in ~300000 runs). But there is only so much you can do with hand-written or random tests. The space of states to test is just too big. I wish we had a more systematic approach. |
Yes, I know, this is why I asked them again. Pinging people on github isn't rude - it's the only real way to remind people if you want them to review your code. Because people don't normally (at least, I don't...) perform a query of all open PRs where they were mentioned in the past. Of course, it's also possible that Tomek and Botond have much better memory than I do, and remember they need to review your patch and just didn't get around to it yet, in which case - sorry for the spam.
I see. Thanks. Did all these bugs depend on Seastar preemption - or just some of them did? For bugs that really can only be reproduced with particular sequence of preemption, maybe we could add some hooks in normal Seastar (without hacking Seastar) for disabling and forcing preemption and then somehow using those hooks in unit tests?
Then how can you prove - or at least did you "prove" to yourself - that your fixes really fixed those bugs which you found? I have another idea. You said that you discovered all these bugs using a randomized test. How long did this run take? Maybe we need to include this randomized test - with its million random attempts (and not just a handful) - in our battery of long-running tests - which we don't have to run on every CI run, but maybe we need to re-run at least once a week or something? |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
This PR was posted on Friday evening, and now it's Sunday evening.
I think the bug in
I don't see any sane way of solving the problem with hooks and injections. Yes, if you add enough of them, you can force preemption to work exactly as you need, but is this something worth doing? How much effort is reasonable to be put into writing tests designed extremely specifically to reproduce an already-fixed bug?
Without worrying? He can't. And even if you could write a test that prevents specifically these bugs from being reintroduced, he would still have to worry about introducing new bugs in the other 1000 places where they can be introduced.
I ran it with shell loops (i.e. it the test binary had to do its setup and teardown every time — so perhaps it can be sped up), and it was doing ~2 attempts per second per core. So a million attempts on a 16-core machine took about 1000000 / 16 / (2/s) = 9 hours.
There's probably no reason to run it periodically. Row cache code doesn't change every week. But it could be a good idea to do that when testing patchsets relevant to the cache. |
@nyh the randomized test which revealed these bugs was failing intermittently for some time now. Because it is run in CI, it gets a lot of runtime spread over many PRs and it was telling us that there are bugs in the cache for some time now. |
On one hand, you're right. On the other hand, not entirely: First of all, we (or at least I) don't have a good idea on how frequently or infrequently this test fails on actual Jenkins runs - if it only failed one a month, how will we ever notice if it fails again? @michoecho had to run this test in a loop for many hours to see some of these bugs - it is possible that we never saw them happen in any Jenkins run. Second, and more importantly, as a reviewer I always suspect a fix that doesn't have a regression test - a test that shows the failure before the fix, and passing after the fix. Without such a regression test it's hard to know if a certain fix, which "looks" reasonable, actually fixes anything. This PR has many separate changes, how does its author or its reviewer know all of them are necessary or correct? But I agree that if @michoecho ran his many-hour-loop again after these patches, and it passed all of them, it increases the confidence that at least this PR is correct. It won't prevent other people from breaking this code in the future and then not running the many-hour loop, but I guess nothing is perfect. |
Also, we should consider the benefit of merging these fixes despite imperfect testing (and the risk of them introducing something new that our current test doesn't cover, even with 1M runs), vs. the risk of exposure if we don't merge those fixes, beyond the test failing from time to time. |
I agree. This is why I asked that someone who is intimately familiar with this code (i.e., not me) will review this PR before I merge it, because even if I don't fully trust the testing, a person who is familar with this code can look at the fixes and say "wow, that's right! great catch!". In a less optimistic case, such a reviewer may also look at the fix and say "no, this doesn't do what you think, if you didn't see the bug again after this fix you were just lucky". I also agree that if @michoecho personally ran his 1-million-time test before and after his patches, and it failed many times before and not even once after, than his patches are most likely all correct, and I'm just making a mountain out of a mole-hill :-) |
The test failed quite often, I could reproduce a failure with a mere --repeat=20. Once making some changes to the test, it would fail on every other run. @michoecho maybe we could have a variant of this test with the equivalent of debuggable=true, which really translates to a smaller list of possible key values, producing more collisions and interesting overlaps. This was greatly improving the chances of reproducing bugs, when debugging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I cannot claim to have understood all the patches. In fact, as soon as I read "older versions are evicted first", my head starts to shutdown.
We need a review from @tgrabiec.
We need to simplify MVCC |
@@ -510,7 +510,7 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() { | |||
return false; | |||
} | |||
|
|||
if (cmp(cur.position(), _last_row.position()) != 0) { | |||
if (cmp(cur.table_position(), _last_row.position()) != 0) { | |||
return false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one looks as if a simple unit test can be constructed for it (I'm not saying that creating the unit test would be simple, just that in the end it would probably be a large sequence of range tombstones or something).
The conclusion of my review is that @tgrabiec must review it. |
ping @tgrabiec |
@@ -696,7 +696,7 @@ bool cache_flat_mutation_reader::maybe_add_to_cache(const range_tombstone_change | |||
if (ensure_population_lower_bound()) { | |||
// underlying may emit range_tombstone_change fragments with the same position. | |||
// In such case, the range to which the tombstone from the first fragment applies is empty and should be ignored. | |||
if (q_cmp(_last_row.position(), it->position()) < 0) { | |||
if (q_cmp(_last_row.position(), it->position()) != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder why the operator change. Did you find a case where _last_row.position() is > it->position()? That would be incorrect as it violates monotonicity, and the code below doesn't handle it properly, so the old comparison may actually be more correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, we're using query-domain comparator for table-domain positions here. For equality, it should give the same result. But it's a bit confusing, so a comment would be good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You got it right: <
only works as intended for non-reverse queries, !=
should work as intended for both types of queries.
I will add a comment about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment.
9e107e4
to
c10d6a5
Compare
🔴 CI State: FAILURE✅ - Build Failed Tests (6/21815):
Build Details:
|
@michoecho please rebase to get a clean baseline for re-testing. |
🔴 CI State: ABORTED❌ - Build Build Details:
|
It is illegal for an unlinked last dummy to be continuous, (this is how last dummies respect the "older verions are evicted first" rule), but it is technically possible for an unlinked last dummy to be made continuous by read_from_underlying. This commit fixes that. Found by row_cache_test. The bug is very unlikely to happen in practice because the relevant rows_entry is bumped in LRU before read_from_underlying starts. For the bug to manifest, the entry has to fall down to the end of the LRU list and be evicted before read_from_underlying() ends. Usually it takes several minutes for an entry to fall out of LRU, and read_from_underlying takes maybe a few hundred milliseconds. And even if the above happened, there still needs to appear a new version, which needs to have its continuous last dummy evicted before it's merged.
…r_bound() with reverse reads `_last_row` is in table schema, while `cur.position()` is in query schema (which is either equal to table schema, or its reverse). Thus, the comparison affected by this patch doesn't work as intended. In reverse reads, the check will pass even if `_last_row` has the same key, but opposite bound weight to `cur`, which will lead to the dummy being inserted at the wrong position, which can e.g. wrongly extend a range tombstone. Fix that.
…e reads `_last_row` is in table schema, but it is sometimes compared with positions in query schema. This leads to unexpected behaviour when reverse reads are used. The previous patch fixed one such case, which was affecting correctness. As far as I can tell, the three cases affected by this patch aren't a correctness problem, but can cause some intervals to fail to be made continuous. (And they won't be cached even if the same read is repeated many times).
…in_latest() with reverse reads The FIXME comment claims that setting continity isn't very important in this place, but in fact this is just wrong. If two calls to read_from_underlying() get into a race, the one which finishes later can call ensure_entry_in_latest() on a position which lies inside a continuous interval in the newest version. If we don't take care to preserve the total continuity of the version, this can punch a hole in the continuity of the newest version, potentially reverting the affected interval to an older version. Fix that.
…tion races with reverse reads Cache population routines insert new row entries. In non-reverse reads, the new entries (except for the lower bound of the query range) are filled with the correct continuity and range tombstones immediately after insertion, because that information has already arrived from underlying. when the entries are inserted. But in reverse reads, it's the interval *after* the newly-inserted entry that's made continuous. The continuity information in the new entries isn't filled. When two population routines race, the one which comes later can punch holes in the continuity left by the first routine, which can break the "older versions are evicted first" rule and revert the affected interval to an older version. To fix this, we must make sure that inserting new row entries doesn't change the total continuity of the version.
…tinuity() To reflect the final range tombstone change in the populated range, maybe_update_continuity() might insert a dummy at `before_key(_next_row.table_position())`. But the relevant logic breaks down in the special case when that position is equal to `_last_row.position()`. The code treats the dummy as a part of the (_last_row, _next_row) range, but this is wrong in the special case. This can lead to inconsistent state. For example, `_last_row` can be wrongly made continuous, or its range tombstone can be wrongly nulled. The proposed fix is to only modify the dummy if it was actually inserted. If it had been inserted beforehand (which is true in the special case, because of the `ensure_population_lower_bound()` call earlier), then it's already in a valid state and doesn't need changes.
…in ensure_population_lower_bound() ensure_population_lower_bound() guarantees that _last_row is valid or null. However, it fails to provide this guarantee in the special rare case when `_population_range_starts_before_all_rows == true` and _last_row is non-null. (This can happen in practice if there is a dummy at before_all_clustering_rows and eviction makes the `(before_all_clustering_rows, ...)` interval discontinous. When the interval is read in this state, _last_row will point to the dummy, while _population_range_starts_before_all_rows will still be true.) In this special case, `ensure_population_lower_bound()` does not refresh `_last_row`, so it can be non-null but invalid after the call. If it is accessed in this state, undefined behaviour occurs. This was observed to happen in a test, in the `read_from_underlying() -- maybe_drop_last_entry()` codepath. The proposed fix is to make the meaning of _population_range_starts_before_all_rows closer to its real intention. Namely: it's supposed to handle the special case of a left-open interval, not the case of an interval starting at -inf.
…" during schema upgrades A schema upgrade appends a MVCC version B after an existing version A. The last dummy in B is added to the front of LRU, so it will be evicted after the entries in A. This alone doesn't quite violate the "older versions are evicted first" rule, because the new last dummy carries no information. But apply_monotonically generally assumes that entries on the same position have the obvious eviction order, even if they carry no information. Thus, after the merge, the rule can become broken. The proposed fix is as follows: - In the case where A is merged into B, the merged last dummy inherits the link of A. - The merging of B into anything is prevented until its merge with A is finished. This is relatively hacky, because it still involves a state that goes against some natural expectations granted by the "older versions..." rule. A less hacky fix would be to ensure that the new dummy is inserted into a proper place in the eviction order to begin with. Or, better yet, we could eliminate the rule altogether. Aside from being very hard to maintain, it also prevents the introduction of any eviction algorithm other than LRU.
Rebased and restarted CI. |
🟢 CI State: SUCCESS✅ - Build Build Details:
|
@scylladb/scylla-maint pls merge |
This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details. Fixes #15483 Closes #15945 * github.com:scylladb/scylladb: partition_version: fix violation of "older versions are evicted first" during schema upgrades cache_flat_mutation_reader: fix a broken iterator validity guarantee in ensure_population_lower_bound() cache_flat_mutation_reader: fix a continuity loss in maybe_update_continuity() cache_flat_mutation_reader: fix continuity losses during cache population races with reverse reads partition_snapshot_row_cursor: fix a continuity loss in ensure_entry_in_latest() with reverse reads cache_flat_mutation_reader: fix some cache mispopulations with reverse reads cache_flat_mutation_reader: fix a logic bug in ensure_population_lower_bound() with reverse reads cache_flat_mutation_reader: never make an unlinked last dummy continuous (cherry picked from commit 6bcf3ac)
This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details.
Fixes #15483