New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_concurrent_reads_and_eviction: Mutation read doesn't match any expected version #15483
Comments
I looked into it and it appears to be a real problem. It seems that reverse reads are reading range tombstones incorrectly sometimes. But it will take some work to understand the cause. |
Cc @denesb |
Could this be a regression caused by v2 internal representation? |
@michoecho did you managed to reproduce this? I couldn't reproduce even with |
Yes, I managed to reproduce it, but it's rare. It reproduced about once per ~20 minutes of running in a loop. However, I noticed that when I set |
Indeed, I immediately got a failing run once I set |
@michoecho if you don't mind, I will take over this issue. |
Sure, why not. But when you are done with it, please tell how you debugged it. |
Ok, will do. |
Wow, I was super lucky and got a reallly small mutation reproducer:
Usually the printout is 10K+ lines. --- /dev/fd/63 2023-10-09 09:13:21.545162034 -0400
+++ /dev/fd/62 2023-10-09 09:13:21.545162034 -0400
@@ -1,9 +1,9 @@
-test/lib/mutation_assertions.hh(134): fatal error: in "test_concurrent_reads_and_eviction": Mutations differ, expected
+ ...but got:
{table: 'ks.cf', key: {'pk': ef4b9e22, token: -8892763474812993290}, mutation_partition: {
tombstone: {tombstone: timestamp=3330002, deletion_time=777000602},
range_tombstones: {
- {range_tombstone: start={position: clustered, ckp{00040e8c563e0004e1f8219b}, 1}, end={position: clustered, ckp{00040e8c563e0004f32c13c1}, -1}, {tombstone: timestamp=3330002, deletion_time=777000918}},
- {range_tombstone: start={position: clustered, ckp{00040e8c563e0004f32c13c1}, -1}, end={position: clustered, ckp{000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001113}},
+ {range_tombstone: start={position: clustered, ckp{00040e8c563e0004e1f8219b}, 1}, end={position: clustered, ckp{00040e8c563e}, 1}, {tombstone: timestamp=3330002, deletion_time=777000918}},
+ {range_tombstone: start={position: clustered, ckp{00040e8c563e}, 1}, end={position: clustered, ckp{000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001113}},
{range_tombstone: start={position: clustered, ckp{000416eb7874}, -1}, end={position: clustered, ckp{000416eb7874}, 1}, {tombstone: timestamp=3330002, deletion_time=777001120}},
{range_tombstone: start={position: clustered, ckp{000416eb7874}, 1}, end={position: clustered, ckp{00047750d3ef}, 1}, {tombstone: timestamp=3330002, deletion_time=777001128}},
{range_tombstone: start={position: clustered, ckp{00047750d3ef}, 1}, end={position: clustered, ckp{000486dd59e5000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001121}}, |
Some observations I've made, they should help:
|
You weren't super lucky, it happens quite often. There is a fairly high chance that the range tombstones will remove all rows and only range tombstones will remain in the compacted output. |
Thanks for the tips. From what I see, the problem is that the first range tombstone is extended. It should only go as far as |
After a lot of staring at different results and annotating the generated mutations with metadata, it seems that the bug is that the read will sometimes mix in range tombstones from a later version. Don't know yet how/why this happens. |
I expted this to be though but I got more than what I bargained for. I managed to narrow down the problem to prefix keys. I think we are bitten by broken prefix keys compare, in reverse mode. |
After yet more investigation, both of my above comments are wrong. The bug is not limited to prefix keys and the "wrong" keys do not come from later versions, they "bleed through" from earlier versions. I have manually verified the merging rules in the test and the expected mutation is correct. Which means that we have a bug in cache, where it merges range-tombstones from two versions wrong, sometimes a range tombstone, which is overridden by a later version, is still included. I still don't understand why we need reverse reads for this to reproduce. The bug itself can happen with forward reads, but only if there are prior reverse reads. Maybe a reverse read mis-populates and then a later read steps on this? I don't yet know. My 3 main suspicions for the source of the bug are:
Debugging is made extra hard by the fact that despite hard-coding the random seeds, the test has innate randomness due to the timing between the concurrent reads and the concurrent updates/eviction. This means that each time I add more logging, I have to analyze a new case, digging up information from ~50K line of logs. |
If this is a problem, make the test deterministic. This test doesn't do any IO and doesn't depend on real time, so you only have to ensure that One way to do it is to run as usual, but write down all the outcomes of A much easier (but more invasive) way is to make the result of +
+namespace testing {
+ extern thread_local std::default_random_engine local_random_engine;
+}
+
bool need_preempt() noexcept {
+ return testing::local_random_engine() % 100 == 0;
#ifndef SEASTAR_DEBUG
// prevent compiler from eliminating loads in a loop
std::atomic_signal_fence(std::memory_order_seq_cst); I just verified that this also works for this test. With this, once you find a failing seed, it will always keep resulting in the same failed run. |
This seem to have worked (random-seed for |
Yes |
It might be useful sometimes, but it also covers only a subset of a general problem. Non-reproducible tests are evil. In a better world, we would have designed our test frameworks in such a way that all tests can be run with recording, and be replayed when they fail. It's probably not even that hard to reach. I think it would be sufficient to record and replay the trace of:
Seastar is in a very good position for making all tests determinizable, and it's a shame that we don't make use of it. We could be two weeks of work away from eliminating the problem of non-reproducible test failures altogether. |
Found another bug. Hope this is the last. |
@bhalevy please find time for it |
@denesb Don't tease us. What are the bugs? Are they fun? |
I still don't know what the second bug is. I just know it exists. As for the first one, it is this: diff --git a/cache_flat_mutation_reader.hh b/cache_flat_mutation_reader.hh
index 8f5fc0a412..6de2fb2cd0 100644
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -513,7 +513,9 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
return false;
}
- if (cmp(cur.position(), _last_row.position()) != 0) {
+ if (cmp(cur.table_position(), _last_row.position()) != 0) {
return false;
} |
The second bug is much more elusive. Each time I'm convinced I found it, it turns out the code at hand is correct and the bug happened at an earlier point, creating bad data. |
I found at least one logic bug: diff --git a/cache_flat_mutation_reader.hh b/cache_flat_mutation_reader.hh
index 30f2f5c503..db93546980 100644
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -579,12 +579,12 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
if (insert_result.second) {
clogger.trace("csm {}: L{}: inserted dummy at {}", fmt::ptr(this), __LINE__, insert_result.first->position());
_snp->tracker()->insert(*insert_result.first);
+ clogger.trace("csm {}: set_continuous({}), prev={}, rt={}", fmt::ptr(this), insert_result.first->position(),
+ _last_row.position(), _current_tombstone);
+ insert_result.first->set_continuous(true);
+ insert_result.first->set_range_tombstone(_current_tombstone);
}
- clogger.trace("csm {}: set_continuous({}), prev={}, rt={}", fmt::ptr(this), insert_result.first->position(),
- _last_row.position(), _current_tombstone);
- insert_result.first->set_continuous(true);
- insert_result.first->set_range_tombstone(_current_tombstone);
clogger.trace("csm {}: set_continuous({})", fmt::ptr(this), e.position());
e.set_continuous(true);
});
} else { Explanation: assume that cache contains the following row entries for a partition:
I'm not sure if the fix proposed above is sound, though. Also, I found a third, unrelated bug (which results in undefined behavior, rather than a test failure): diff --git a/cache_flat_mutation_reader.hh b/cache_flat_mutation_reader.hh
index 3a71326de4..30f2f5c503 100644
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -341,7 +342,7 @@ future<> cache_flat_mutation_reader::do_fill_buffer() {
});
}
_state = state::reading_from_underlying;
- _population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema) && !_read_context.is_reversed();
+ _population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema) && !_read_context.is_reversed() && !_last_row;
_underlying_upper_bound = _next_row_in_range ? position_in_partition::before_key(_next_row.position())
: position_in_partition(_upper_bound);
if (!_read_context.partition_exists()) { There is a place in the code which assumes that But the test is still failing. I'm trying to find the fourth bug now. |
@denesb The rabbit hole goes deep. Other than the bug you've found, I've found 4 bugs so far (I've listed two in the previous comment, the other two are continuity breaks during races in cache population), and it's not over yet. I'm trying to understand the sixth bug now. |
After fixing 8 bugs (or 10, depending on how you count), I haven't seen the test fail for 100000 consecutive runs. I'll let it run a million times and then I'll get to properly documenting and submitting the patches. |
It points at the thing being way to complicated for our own good. I don't know what to suggest, apart from flattening MVCC to use the timestamp as a discriminant instead of a version object. |
@denesb For the record, my debugging strategy was:
In other words: identify the bad rows_entry, then set watchpoints on it. Watchpoints are indispensable for debugging this. |
Watch-points are indeed awesome, the hard part is coming up with what is worth watching. Great job! |
Possibly reversible debugging can help in these cases. |
rr can't support aio or io_uring, so we'll have to add an epoll+threadpool reactor backend that shunts disk I/O through the syscall thread. Shouldn't be hard. |
This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details. Fixes #15483 Closes #15945 * github.com:scylladb/scylladb: partition_version: fix violation of "older versions are evicted first" during schema upgrades cache_flat_mutation_reader: fix a broken iterator validity guarantee in ensure_population_lower_bound() cache_flat_mutation_reader: fix a continuity loss in maybe_update_continuity() cache_flat_mutation_reader: fix continuity losses during cache population races with reverse reads partition_snapshot_row_cursor: fix a continuity loss in ensure_entry_in_latest() with reverse reads cache_flat_mutation_reader: fix some cache mispopulations with reverse reads cache_flat_mutation_reader: fix a logic bug in ensure_population_lower_bound() with reverse reads cache_flat_mutation_reader: never make an unlinked last dummy continuous (cherry picked from commit 6bcf3ac)
Backported to 5.4, 5.2 is not affected. |
@michoecho can you double check this please? AFAIK the bugs fixed here were introduced by the v2 migration. Did I miss something? |
That's hard to say. Most of the affected code has greatly changed between 5.2 and 5.4, so it's hard to see which bugs could have been also present in 5.2. But, for example, I think that 47299d6 also applies to 5.2. |
Please create a 5.2 backport PR with the patches you think apply to it. |
This was hit in CI with an unrelated change so it can be attrobuted to baseline version b87660f
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3706/artifact/testlog/x86_64/release/boost.row_cache_test.test_concurrent_reads_and_eviction.3.log
boost.row_cache_test.test_concurrent_reads_and_eviction.3.log.gz
@michoecho please look into this
The text was updated successfully, but these errors were encountered: