Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_concurrent_reads_and_eviction: Mutation read doesn't match any expected version #15483

Closed
bhalevy opened this issue Sep 20, 2023 · 43 comments · Fixed by #15945
Closed

test_concurrent_reads_and_eviction: Mutation read doesn't match any expected version #15483

bhalevy opened this issue Sep 20, 2023 · 43 comments · Fixed by #15945
Assignees
Labels
area/row cache Backport candidate P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Milestone

Comments

@bhalevy
Copy link
Member

bhalevy commented Sep 20, 2023

This was hit in CI with an unrelated change so it can be attrobuted to baseline version b87660f
https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/3706/artifact/testlog/x86_64/release/boost.row_cache_test.test_concurrent_reads_and_eviction.3.log

random-seed=2470087052
random_mutation_generator seed: 2568563592
test/boost/row_cache_test.cc(3447): fatal error: in "test_concurrent_reads_and_eviction": Mutation read doesn't match any expected version ...

boost.row_cache_test.test_concurrent_reads_and_eviction.3.log.gz

@michoecho please look into this

@bhalevy bhalevy added area/row cache P2 High Priority symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework labels Sep 20, 2023
@michoecho
Copy link
Contributor

I looked into it and it appears to be a real problem. It seems that reverse reads are reading range tombstones incorrectly sometimes. But it will take some work to understand the cause.

@mykaul mykaul added this to the 5.4 milestone Sep 27, 2023
@bhalevy
Copy link
Member Author

bhalevy commented Sep 28, 2023

I looked into it and it appears to be a real problem. It seems that reverse reads are reading range tombstones incorrectly sometimes. But it will take some work to understand the cause.

Cc @denesb

@denesb
Copy link
Contributor

denesb commented Sep 28, 2023

Could this be a regression caused by v2 internal representation?

@avikivity
Copy link
Member

@avikivity avikivity added P1 Urgent and removed P2 High Priority labels Oct 9, 2023
@denesb
Copy link
Contributor

denesb commented Oct 9, 2023

@michoecho did you managed to reproduce this? I couldn't reproduce even with --count=100 and the seed from the last failure hardcoded.

@michoecho
Copy link
Contributor

michoecho commented Oct 9, 2023

Yes, I managed to reproduce it, but it's rare. It reproduced about once per ~20 minutes of running in a loop. --count=100 is not good enough.

However, I noticed that when I set debuggable = true; in test/lib/mutation_source_test.cc, the failure reproduces very easily. (About once per ~3 runs.)

@denesb
Copy link
Contributor

denesb commented Oct 9, 2023

However, I noticed that when I set debuggable = true; in test/lib/mutation_source_test.cc, the failure reproduces very easily. (About once per ~3 runs.)

Indeed, I immediately got a failing run once I set debuggable = true. Thanks for the tip.

@denesb
Copy link
Contributor

denesb commented Oct 9, 2023

@michoecho if you don't mind, I will take over this issue.

@michoecho
Copy link
Contributor

@michoecho if you don't mind, I will take over this issue.

Sure, why not. But when you are done with it, please tell how you debugged it. test_concurrent_reads_and_eviction is very hard to debug. I'm interested what techniques other people use to do it.

@denesb denesb self-assigned this Oct 9, 2023
@denesb
Copy link
Contributor

denesb commented Oct 9, 2023

@michoecho if you don't mind, I will take over this issue.

Sure, why not. But when you are done with it, please tell how you debugged it. test_concurrent_reads_and_eviction is very hard to debug. I'm interested what techniques other people use to do it.

Ok, will do.

@denesb
Copy link
Contributor

denesb commented Oct 9, 2023

Wow, I was super lucky and got a reallly small mutation reproducer:

test/lib/mutation_assertions.hh(134): fatal error: in "test_concurrent_reads_and_eviction": Mutations differ, expected {table: 'ks.cf', key: {'pk': ef4b9e22, token: -8892763474812993290}, mutation_partition: {
  tombstone: {tombstone: timestamp=3330002, deletion_time=777000602},
  range_tombstones: {
    {range_tombstone: start={position: clustered, ckp{00040e8c563e0004e1f8219b}, 1}, end={position: clustered, ckp{00040e8c563e0004f32c13c1}, -1}, {tombstone: timestamp=3330002, deletion_time=777000918}},
    {range_tombstone: start={position: clustered, ckp{00040e8c563e0004f32c13c1}, -1}, end={position: clustered, ckp{000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001113}},
    {range_tombstone: start={position: clustered, ckp{000416eb7874}, -1}, end={position: clustered, ckp{000416eb7874}, 1}, {tombstone: timestamp=3330002, deletion_time=777001120}},
    {range_tombstone: start={position: clustered, ckp{000416eb7874}, 1}, end={position: clustered, ckp{00047750d3ef}, 1}, {tombstone: timestamp=3330002, deletion_time=777001128}},
    {range_tombstone: start={position: clustered, ckp{00047750d3ef}, 1}, end={position: clustered, ckp{000486dd59e5000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001121}},
    {range_tombstone: start={position: clustered, ckp{000486dd59e5000416eb7874}, -1}, end={position: clustered, ckp{00048b105fe0000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001120}},
    {range_tombstone: start={position: clustered, ckp{00048b105fe0000416eb7874}, -1}, end={position: clustered, ckp{0004e1f8219b}, -1}, {tombstone: timestamp=3330002, deletion_time=777001117}},
    {range_tombstone: start={position: clustered, ckp{0004e1f8219b}, -1}, end={position: clustered, ckp{0004ef4b9e22}, -1}, {tombstone: timestamp=3330002, deletion_time=777001104}},
    {range_tombstone: start={position: clustered, ckp{0004ef4b9e22}, -1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=3330002, deletion_time=777001004}}},
  rows: [
  ]
}
}
 ...but got: {table: 'ks.cf', key: {'pk': ef4b9e22, token: -8892763474812993290}, mutation_partition: {
  tombstone: {tombstone: timestamp=3330002, deletion_time=777000602},
  range_tombstones: {
    {range_tombstone: start={position: clustered, ckp{00040e8c563e0004e1f8219b}, 1}, end={position: clustered, ckp{00040e8c563e}, 1}, {tombstone: timestamp=3330002, deletion_time=777000918}},
    {range_tombstone: start={position: clustered, ckp{00040e8c563e}, 1}, end={position: clustered, ckp{000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001113}},
    {range_tombstone: start={position: clustered, ckp{000416eb7874}, -1}, end={position: clustered, ckp{000416eb7874}, 1}, {tombstone: timestamp=3330002, deletion_time=777001120}},
    {range_tombstone: start={position: clustered, ckp{000416eb7874}, 1}, end={position: clustered, ckp{00047750d3ef}, 1}, {tombstone: timestamp=3330002, deletion_time=777001128}},
    {range_tombstone: start={position: clustered, ckp{00047750d3ef}, 1}, end={position: clustered, ckp{000486dd59e5000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001121}},
    {range_tombstone: start={position: clustered, ckp{000486dd59e5000416eb7874}, -1}, end={position: clustered, ckp{00048b105fe0000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001120}},
    {range_tombstone: start={position: clustered, ckp{00048b105fe0000416eb7874}, -1}, end={position: clustered, ckp{0004e1f8219b}, -1}, {tombstone: timestamp=3330002, deletion_time=777001117}},
    {range_tombstone: start={position: clustered, ckp{0004e1f8219b}, -1}, end={position: clustered, ckp{0004ef4b9e22}, -1}, {tombstone: timestamp=3330002, deletion_time=777001104}},
    {range_tombstone: start={position: clustered, ckp{0004ef4b9e22}, -1}, end={position: clustered, ckp{}, 1}, {tombstone: timestamp=3330002, deletion_time=777001004}}},
  rows: [
  ]
}
}

Usually the printout is 10K+ lines.
The diff:

--- /dev/fd/63	2023-10-09 09:13:21.545162034 -0400
+++ /dev/fd/62	2023-10-09 09:13:21.545162034 -0400
@@ -1,9 +1,9 @@
-test/lib/mutation_assertions.hh(134): fatal error: in "test_concurrent_reads_and_eviction": Mutations differ, expected 
+ ...but got: 
 {table: 'ks.cf', key: {'pk': ef4b9e22, token: -8892763474812993290}, mutation_partition: {
   tombstone: {tombstone: timestamp=3330002, deletion_time=777000602},
   range_tombstones: {
-    {range_tombstone: start={position: clustered, ckp{00040e8c563e0004e1f8219b}, 1}, end={position: clustered, ckp{00040e8c563e0004f32c13c1}, -1}, {tombstone: timestamp=3330002, deletion_time=777000918}},
-    {range_tombstone: start={position: clustered, ckp{00040e8c563e0004f32c13c1}, -1}, end={position: clustered, ckp{000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001113}},
+    {range_tombstone: start={position: clustered, ckp{00040e8c563e0004e1f8219b}, 1}, end={position: clustered, ckp{00040e8c563e}, 1}, {tombstone: timestamp=3330002, deletion_time=777000918}},
+    {range_tombstone: start={position: clustered, ckp{00040e8c563e}, 1}, end={position: clustered, ckp{000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001113}},
     {range_tombstone: start={position: clustered, ckp{000416eb7874}, -1}, end={position: clustered, ckp{000416eb7874}, 1}, {tombstone: timestamp=3330002, deletion_time=777001120}},
     {range_tombstone: start={position: clustered, ckp{000416eb7874}, 1}, end={position: clustered, ckp{00047750d3ef}, 1}, {tombstone: timestamp=3330002, deletion_time=777001128}},
     {range_tombstone: start={position: clustered, ckp{00047750d3ef}, 1}, end={position: clustered, ckp{000486dd59e5000416eb7874}, -1}, {tombstone: timestamp=3330002, deletion_time=777001121}},

@michoecho
Copy link
Contributor

michoecho commented Oct 9, 2023

Some observations I've made, they should help:

  • Commenting out the possibility of reverse reads from the test eliminates the failures. Thus it seems that reverse reads are essential.
  • The bug seems to be about range tombstones. You can set the clustering row number in random_mutation_generator to 0 and it will still happen.
  • You can change random_mutation_generator::impl::gen_timestamp() to generate monotonically growing timestamps, and the bug will still happen. This makes it easier to see that some range tombstones are wrongly applied. (By default, with debuggable, the timestamp of all tombstones will almost always be 3330002, which makes some things harder to see).

@michoecho
Copy link
Contributor

Wow, I was super lucky and got a reallly small mutation reproducer:

You weren't super lucky, it happens quite often. There is a fairly high chance that the range tombstones will remove all rows and only range tombstones will remain in the compacted output.

@denesb
Copy link
Contributor

denesb commented Oct 9, 2023

Thanks for the tips. From what I see, the problem is that the first range tombstone is extended. It should only go as far as 00040e8c563e0004f32c13c1 but it is extended to cover the entire 00040e8c563e prefix.

@avikivity
Copy link
Member

@denesb
Copy link
Contributor

denesb commented Oct 10, 2023

After a lot of staring at different results and annotating the generated mutations with metadata, it seems that the bug is that the read will sometimes mix in range tombstones from a later version. Don't know yet how/why this happens.

@denesb
Copy link
Contributor

denesb commented Oct 11, 2023

I expted this to be though but I got more than what I bargained for.

I managed to narrow down the problem to prefix keys. I think we are bitten by broken prefix keys compare, in reverse mode.

@denesb
Copy link
Contributor

denesb commented Oct 13, 2023

After yet more investigation, both of my above comments are wrong. The bug is not limited to prefix keys and the "wrong" keys do not come from later versions, they "bleed through" from earlier versions. I have manually verified the merging rules in the test and the expected mutation is correct. Which means that we have a bug in cache, where it merges range-tombstones from two versions wrong, sometimes a range tombstone, which is overridden by a later version, is still included.

I still don't understand why we need reverse reads for this to reproduce. The bug itself can happen with forward reads, but only if there are prior reverse reads. Maybe a reverse read mis-populates and then a later read steps on this? I don't yet know.

My 3 main suspicions for the source of the bug are:

  • switching between reading from cache and underlying
  • partition snapshot row cursor
  • cache population logic

Debugging is made extra hard by the fact that despite hard-coding the random seeds, the test has innate randomness due to the timing between the concurrent reads and the concurrent updates/eviction. This means that each time I add more logging, I have to analyze a new case, digging up information from ~50K line of logs.

@michoecho
Copy link
Contributor

michoecho commented Oct 13, 2023

Debugging is made extra hard by the fact that despite hard-coding the random seeds, the test has innate randomness due to the timing between the concurrent reads and the concurrent updates/eviction. This means that each time I add more logging, I have to analyze a new case, digging up information from ~50K line of logs.

If this is a problem, make the test deterministic. This test doesn't do any IO and doesn't depend on real time, so you only have to ensure that need_preempt() returns the same sequence of decisions on every run.

One way to do it is to run as usual, but write down all the outcomes of need_preempt() to a file. After you get a file from a failed run, make need_preempt() read its decisions from this file, and you will always get the same run.
That's what I did when debugging #12462, and it worked well.

A much easier (but more invasive) way is to make the result of need_preempt() pseudorandom. I.e. do this:

+
+namespace testing {
+    extern thread_local std::default_random_engine local_random_engine;
+}
+
 bool need_preempt() noexcept {
+    return testing::local_random_engine() % 100 == 0;
 #ifndef SEASTAR_DEBUG
     // prevent compiler from eliminating loads in a loop
     std::atomic_signal_fence(std::memory_order_seq_cst);

I just verified that this also works for this test. With this, once you find a failing seed, it will always keep resulting in the same failed run.

@denesb
Copy link
Contributor

denesb commented Oct 16, 2023

This seem to have worked (random-seed for need_preempt(), thanks for the tip.

@avikivity
Copy link
Member

This seem to have worked (random-seed for need_preempt(), thanks for the tip.

Sounds like we should it make it an official and present a command line option, possible available only in non-release modes.

Yes

@michoecho
Copy link
Contributor

michoecho commented Oct 16, 2023

Sounds like we should it make it an official and present a command line option, possible available only in non-release modes.

It might be useful sometimes, but it also covers only a subset of a general problem.

Non-reproducible tests are evil. In a better world, we would have designed our test frameworks in such a way that all tests can be run with recording, and be replayed when they fail.

It's probably not even that hard to reach. I think it would be sufficient to record and replay the trace of:

  1. All preempt_needed decisions.
  2. The results of all clock reads.
  3. The IDs of all I/O completions and their lengths.
  4. The IDs of timer completions.
  5. The shard IDs of smp message receives.
  6. The wakeups of condition_variables due to SIGTERM and SIGHUP.

Seastar is in a very good position for making all tests determinizable, and it's a shame that we don't make use of it. We could be two weeks of work away from eliminating the problem of non-reproducible test failures altogether.

@denesb
Copy link
Contributor

denesb commented Oct 16, 2023

I found a bug. But looks like there might be more.

Found another bug. Hope this is the last.

@avikivity
Copy link
Member

Sounds like we should it make it an official and present a command line option, possible available only in non-release modes.

It might be useful sometimes, but it also covers only a subset of a general problem.

Non-reproducible tests are evil. In a better world, we would have designed our test frameworks in such a way that all tests can be run with recording, and be replayed when they fail.

It's probably not even that hard to reach. I think it would be sufficient to record and replay the trace of:

  1. All preempt_needed decisions.
  2. The results of all clock reads.
  3. The IDs of all I/O completions and their lengths.
  4. The IDs of timer completions.
  5. The shard IDs of smp message receives.
  6. The wakeups of condition_variables due to SIGTERM and SIGHUP.

Seastar is in a very good position for making all tests determinizable, and it's a shame that we don't make use of it. We could be two weeks of work away from eliminating the problem of non-reproducible test failures altogether.

@bhalevy please find time for it

@michoecho
Copy link
Contributor

michoecho commented Oct 16, 2023

@bhalevy please find time for it

@bhalevy Maybe I could do this (making Seastar-based tests deterministic/replayable) as a "20% project"?

@bhalevy
Copy link
Member Author

bhalevy commented Oct 16, 2023

@bhalevy please find time for it

@bhalevy Maybe I could do this (making Seastar-based tests deterministic/replayable) as a "20% project"?

Yes, it'd be a very good use of your time. Thanks

@michoecho
Copy link
Contributor

I found a bug. But looks like there might be more.

Found another bug. Hope this is the last.

@denesb Don't tease us. What are the bugs? Are they fun?

@denesb
Copy link
Contributor

denesb commented Oct 18, 2023

I found a bug. But looks like there might be more.

Found another bug. Hope this is the last.

@denesb Don't tease us. What are the bugs? Are they fun?

I still don't know what the second bug is. I just know it exists.

As for the first one, it is this:

diff --git a/cache_flat_mutation_reader.hh b/cache_flat_mutation_reader.hh
index 8f5fc0a412..6de2fb2cd0 100644
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -513,7 +513,9 @@ bool cache_flat_mutation_reader::ensure_population_lower_bound() {
             return false;
         }
 
-        if (cmp(cur.position(), _last_row.position()) != 0) {
+        if (cmp(cur.table_position(), _last_row.position()) != 0) {
             return false;
         }

@denesb
Copy link
Contributor

denesb commented Oct 18, 2023

The second bug is much more elusive. Each time I'm convinced I found it, it turns out the code at hand is correct and the bug happened at an earlier point, creating bad data.

@michoecho
Copy link
Contributor

michoecho commented Oct 20, 2023

I found at least one logic bug:

diff --git a/cache_flat_mutation_reader.hh b/cache_flat_mutation_reader.hh
index 30f2f5c503..db93546980 100644
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -579,12 +579,12 @@ void cache_flat_mutation_reader::maybe_update_continuity() {
                         if (insert_result.second) {
                             clogger.trace("csm {}: L{}: inserted dummy at {}", fmt::ptr(this), __LINE__, insert_result.first->position());
                             _snp->tracker()->insert(*insert_result.first);
+                            clogger.trace("csm {}: set_continuous({}), prev={}, rt={}", fmt::ptr(this), insert_result.first->position(),
+                                          _last_row.position(), _current_tombstone);
+                            insert_result.first->set_continuous(true);
+                            insert_result.first->set_range_tombstone(_current_tombstone);
                         }
-                        clogger.trace("csm {}: set_continuous({}), prev={}, rt={}", fmt::ptr(this), insert_result.first->position(),
-                                      _last_row.position(), _current_tombstone);
-                        insert_result.first->set_continuous(true);
-                        insert_result.first->set_range_tombstone(_current_tombstone);
                         clogger.trace("csm {}: set_continuous({})", fmt::ptr(this), e.position());
                         e.set_continuous(true);
                     });
                 } else {

Explanation: assume that cache contains the following row entries for a partition:

  • (pos: before 1, continuous: true, rt: ts=0)
  • (pos: at 1, continuous: false, rt: ts=1)
    When a reader attempts to read this partition, it will attempt to read underlying for range (before 1, at 1). This will set _current_tombstone to false, then it will read underlying (without consuming any rows, since the range is obviously empty), then it will call maybe_update_continuity(). maybe_update_continuity() will see that, after finishing the read from underlying, _current_tombstone is null, which is different from the tombstone of _next_row. So it will insert a dummy before next_row and set its rt to _current_tombstone.
    But that's invalid, because an entry at this position (before 1) already exists, and is meaningful. Setting its rt to null makes the ts=0 tombstone disappear.

I'm not sure if the fix proposed above is sound, though.

Also, I found a third, unrelated bug (which results in undefined behavior, rather than a test failure):

diff --git a/cache_flat_mutation_reader.hh b/cache_flat_mutation_reader.hh
index 3a71326de4..30f2f5c503 100644
--- a/cache_flat_mutation_reader.hh
+++ b/cache_flat_mutation_reader.hh
@@ -341,7 +342,7 @@ future<> cache_flat_mutation_reader::do_fill_buffer() {
             });
         }
         _state = state::reading_from_underlying;
-        _population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema) && !_read_context.is_reversed();
+        _population_range_starts_before_all_rows = _lower_bound.is_before_all_clustered_rows(*_schema) && !_read_context.is_reversed() && !_last_row;
         _underlying_upper_bound = _next_row_in_range ? position_in_partition::before_key(_next_row.position())
                                                      : position_in_partition(_upper_bound);
         if (!_read_context.partition_exists()) {

There is a place in the code which assumes that _population_range_starts_before_all_rows implies !_last_row. But this is not necessarily true if there exists a rows_entry with position=before_all_rows. In this case _last_row might be referenced even if invalid, leading e.g. to segfaults or assert violations.

But the test is still failing. I'm trying to find the fourth bug now.

@michoecho
Copy link
Contributor

@denesb The rabbit hole goes deep. Other than the bug you've found, I've found 4 bugs so far (I've listed two in the previous comment, the other two are continuity breaks during races in cache population), and it's not over yet. I'm trying to understand the sixth bug now.

@michoecho
Copy link
Contributor

After fixing 8 bugs (or 10, depending on how you count), I haven't seen the test fail for 100000 consecutive runs. I'll let it run a million times and then I'll get to properly documenting and submitting the patches.

@avikivity
Copy link
Member

It points at the thing being way to complicated for our own good.

I don't know what to suggest, apart from flattening MVCC to use the timestamp as a discriminant instead of a version object.

@michoecho
Copy link
Contributor

@denesb For the record, my debugging strategy was:

  1. Apply the deterministic preemption patch, set debuggable = true.
  2. Run the test in a loop until it fails, write down the seed.
  3. Find the difference between the expected and the actual output. Write down its keys and timestamps.
  4. Log the full contents of the cache entry before and after each update and read. Using the keys and timestamps, look for the first moment where the difference appears. (Where the wrong row entry appears, or where something supposed to shadow it disappears). Include a log ID (a monotonically rising number) in each such log.
  5. Add the address of each row entry (and its _flags, and its _row_tombstone) to the logs. Write down the addresses of the suspicious row entry (identified in step 4).
  6. Run under GDB until the last good log (via a breakpoint on the log line, conditional on the log ID), then set a watchpoint on the bad row entry, and run until the first bad log, looking at the backtraces of each watchpoint trigger.
  7. When you find the moment where the state becomes bad, you can break there and figure out the cause.

In other words: identify the bad rows_entry, then set watchpoints on it. Watchpoints are indispensable for debugging this.

@denesb
Copy link
Contributor

denesb commented Oct 26, 2023

Watch-points are indeed awesome, the hard part is coming up with what is worth watching. Great job!

@avikivity
Copy link
Member

Possibly reversible debugging can help in these cases.

@avikivity
Copy link
Member

rr can't support aio or io_uring, so we'll have to add an epoll+threadpool reactor backend that shunts disk I/O through the syscall thread. Shouldn't be hard.

denesb pushed a commit that referenced this issue Jan 15, 2024
This is a loose collection of fixes to rare row cache bugs flushed out by running test_concurrent_reads_and_eviction several million times. See individual commits for details.

Fixes #15483

Closes #15945

* github.com:scylladb/scylladb:
  partition_version: fix violation of "older versions are evicted first" during schema upgrades
  cache_flat_mutation_reader: fix a broken iterator validity guarantee in ensure_population_lower_bound()
  cache_flat_mutation_reader: fix a continuity loss in maybe_update_continuity()
  cache_flat_mutation_reader: fix continuity losses during cache population races with reverse reads
  partition_snapshot_row_cursor: fix a continuity loss in ensure_entry_in_latest() with reverse reads
  cache_flat_mutation_reader: fix some cache mispopulations with reverse reads
  cache_flat_mutation_reader: fix a logic bug in ensure_population_lower_bound() with reverse reads
  cache_flat_mutation_reader: never make an unlinked last dummy continuous

(cherry picked from commit 6bcf3ac)
@denesb
Copy link
Contributor

denesb commented Jan 15, 2024

Backported to 5.4, 5.2 is not affected.

@denesb
Copy link
Contributor

denesb commented Jan 15, 2024

5.2 is not affected

@michoecho can you double check this please? AFAIK the bugs fixed here were introduced by the v2 migration. Did I miss something?

@michoecho
Copy link
Contributor

5.2 is not affected

@michoecho can you double check this please? AFAIK the bugs fixed here were introduced by the v2 migration. Did I miss something?

That's hard to say. Most of the affected code has greatly changed between 5.2 and 5.4, so it's hard to see which bugs could have been also present in 5.2.

But, for example, I think that 47299d6 also applies to 5.2.

@denesb
Copy link
Contributor

denesb commented Jan 15, 2024

But, for example, I think that 47299d6 also applies to 5.2.

Please create a 5.2 backport PR with the patches you think apply to it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/row cache Backport candidate P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants