Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

row_cache_test/test_concurrent_reads_and_eviction fails sometimes. #12462

Closed
michoecho opened this issue Jan 6, 2023 · 31 comments
Closed

row_cache_test/test_concurrent_reads_and_eviction fails sometimes. #12462

michoecho opened this issue Jan 6, 2023 · 31 comments
Assignees
Labels
area/test Issues related to the testing system code and environment P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework type/bug
Milestone

Comments

@michoecho
Copy link
Contributor

During a CI run for an unrelated commit, row_cache_test/test_concurrent_reads_and_eviction failed with a logic error:

00:00:18  test/boost/row_cache_test.cc(0): Entering test case "test_concurrent_reads_and_eviction"
00:00:18  INFO  2023-01-06 00:24:43,853 seastar - Reactor backend: linux-aio
00:00:18  WARN  2023-01-06 00:24:43,854 [shard 0] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
00:00:18  INFO  2023-01-06 00:24:43,856 [shard 0] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
00:00:18  INFO  2023-01-06 00:24:43,856 [shard 0] seastar - IO queue uses 0.75ms latency goal for device 0
00:00:18  INFO  2023-01-06 00:24:43,856 [shard 0] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
00:00:18  INFO  2023-01-06 00:24:43,856 [shard 0] seastar - Created io queue dev(0) capacities: 512:2000:2000 1024:3000:3000 2048:5000:5000 4096:9000:9000 8192:17000:17000 16384:33000:33000 32768:65000:65000 65536:129000:129000 131072:257000:257000
00:00:18  WARN  2023-01-06 00:24:43,860 [shard 1] seastar - Creation of perf_event based stall detector failed, falling back to posix timer: std::system_error (error system:13, perf_event_open() failed: Permission denied)
00:00:18  INFO  2023-01-06 00:24:43,861 [shard 1] seastar - Created fair group io-queue-0, capacity rate 2147483:2147483, limit 12582912, rate 16777216 (factor 1), threshold 2000
00:00:18  INFO  2023-01-06 00:24:43,861 [shard 1] seastar - IO queue uses 0.75ms latency goal for device 0
00:00:18  INFO  2023-01-06 00:24:43,861 [shard 1] seastar - Created io group dev(0), length limit 4194304:4194304, rate 2147483647:2147483647
00:00:18  random-seed=960153225
00:00:18  random_mutation_generator seed: 3210547813
00:00:18  unknown location(0): fatal error: in "test_concurrent_reads_and_eviction": std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}
00:00:18  test/boost/row_cache_test.cc(3415): last checkpoint

This is not a fluke (i.e. a cosmic bitflip), because I could reproduce this failure on my PC — but it only happened once over hundreds of runs (with the same seed), so it's most likely timing-related.

Command used:

build/release/test/boost/row_cache_test_g --run_test=test_concurrent_reads_and_eviction --report_level=no --logger=HRF,test_suite:XML,test_suite,/home/michal/praca/scylla-latency/testlog/release/xml/boost.row_cache_test.test_concurrent_reads_and_eviction.1.xunit.xml --catch_system_errors=no --color_output=false -- -c2 -m2G --unsafe-bypass-fsync 1 --kernel-page-cache 1 --blocked-reactor-notify-ms 2000000 --collectd 0 --max-networking-io-control-blocks=100 --random-seed=960153225

Scylla commit:
1d273a98b989a4a3cacc84b2f5d1f38dc207ed45

@michoecho
Copy link
Contributor Author

cc @tgrabiec Does this look like some known bug? Or maybe the test is wrong?

@tgrabiec
Copy link
Contributor

tgrabiec commented Jan 7, 2023

I recall that @bhalevy also observed this in one CI run recently. This is the second time I hear about this. It could be a regression we should investigate.

@bhalevy bhalevy added type/bug area/test Issues related to the testing system code and environment triage/master Looking for assignee labels Jan 8, 2023
@DoronArazii DoronArazii added this to the 5.2 milestone Jan 8, 2023
@DoronArazii
Copy link

@bhalevy does anyone dealing with this?
It's in you team's queue?

@DoronArazii DoronArazii modified the milestones: 5.2, 5.3 Feb 2, 2023
@mykaul mykaul added the symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework label Feb 9, 2023
@DoronArazii
Copy link

@bhalevy does anyone dealing with this? It's in you team's queue?

@michoecho ?

@michoecho
Copy link
Contributor Author

@bhalevy does anyone dealing with this? It's in you team's queue?

@michoecho ?

Nobody picked this bug up.

@bhalevy Should I pick it up?

@bhalevy
Copy link
Member

bhalevy commented Mar 30, 2023

@bhalevy does anyone dealing with this? It's in you team's queue?

@michoecho ?

Nobody picked this bug up.

@bhalevy Should I pick it up?

yes, please do

@DoronArazii DoronArazii removed the triage/master Looking for assignee label Apr 3, 2023
@DoronArazii
Copy link

Increasing priority as we are trying to fight hard with CI-Stability cases.

/Cc @michoecho @bhalevy

@DoronArazii DoronArazii added the P1 Urgent label Apr 13, 2023
@DoronArazii
Copy link

@michoecho did you had a chance to explore it ?

@michoecho
Copy link
Contributor Author

No, not yet.

By the way, after this bug was seen, #12048 was merged, which brought great changes to the handling of range tombstones in cache. It's quite likely that the bug was caught up in the rewrite and disappeared as a side effect. (But still is present in versions <=5.2).

@michoecho
Copy link
Contributor Author

michoecho commented Apr 21, 2023

@denesb After a long debugging session, I arrived at the conclusion that this bug is caused by this line:
(note: the entire discussion here applies only to 5.2, not to master)

const bool allow_eq = end_of_range || upper_bound.is_after_all_clustered_rows(_schema);
(added in 5cc5fd4), specifically by the end_of_range.

If the generator is flushed right on the end bound of a range tombstone, and the next tombstone is adjacent to it,
the lower-position tombstone will become "prev" and the next tombstone will become "current", even though it's outside of the flush range, and this messes up the logic.

Here's a unit test which proves the problem:

SEASTAR_TEST_CASE(test_range_tombstone_change_generator) {
    simple_schema s;
    auto rtcg = range_tombstone_change_generator(*s.schema());
    auto rt0 = s.make_range_tombstone(query::clustering_range::make_starting_with(s.make_ckey(0)));
    auto rt1 = s.make_range_tombstone(query::clustering_range::make_starting_with(s.make_ckey(1)));
    rtcg.consume(rt0);
    rtcg.consume(rt1);

    auto pos = position_in_partition::before_key(s.make_ckey(1));

    auto changes = std::vector<range_tombstone_change>();
    rtcg.flush(pos, [&] (range_tombstone_change rtc) {changes.push_back(rtc);}, true);

    BOOST_REQUIRE(changes.size() > 0);
    BOOST_REQUIRE(changes[0].equal(*s.schema(), range_tombstone_change(rt0.position(), rt0.tomb)));
    BOOST_REQUIRE(changes.size() > 1); // Fails.
    BOOST_REQUIRE(changes[1].equal(*s.schema(), range_tombstone_change(pos, tombstone())));
    BOOST_REQUIRE(changes.size() == 2);
    return make_ready_future<>();
}

Does this make sense? Please review the unit test and the logic of range_tombstone_change_generator::flush, and tell me if this really is a bug.

My other problem is: I don't understand how this situation occurs in practice. I don't know how to write a high level test (i.e. based on memtable/cache operations, rather than calling range_tombstone_change_generator directly) which triggers this bug. Do you have an idea about how to do that?

I debugged this issue by determinizing row_cache_test/test_concurrent_reads_and_eviction. The failure was reproducing about once per ~1 hour of repeated runs (with the same seed), so I couldn't debug it comfortably. I solved the problem by dumping all nondeterministic decisions (i.e. the sequence of results of need_preempt()) to a file, and when a run eventually failed, I got a reliable reproducer by replaying the decisions which lead to the failed run.

This reproducer was enough to (slowly and painfully) lead me to the range_tombstone_change_generator problem, but it's too complex for me to figure out what kind of sequence of cache reads and updates may lead to the bug being triggered.

I'm looking for a simpler reproducer to make sure that the cache didn't get into some bad state in the first place, and that the range_tombstone_change_generator bug wasn't only triggered as an side effect of the bad state.

cc @bhalevy @tgrabiec Maybe you too can give me a hint about how to write a simpler reproducer for this bug.

@michoecho
Copy link
Contributor Author

@DoronArazii This is a 5.2 correctness bug. Since it wasn't seen earlier, it's likely a 5.2 regression. Maybe it should be considered a release blocker.

@tgrabiec
Copy link
Contributor

I solved the problem by dumping all nondeterministic decisions (i.e. the sequence of results of need_preempt()) to a file, and when a run eventually failed, I got a reliable reproducer by replaying the decisions which lead to the failed run.

Would it be possible to upstream this mode so that we can debug such failures more easily in the future?

@tgrabiec
Copy link
Contributor

Looking at the contract of range_tombstone_change_generator::flush() with end_of_range = true, the implementation is wrong here:

while (!_range_tombstones.empty() && should_flush(_range_tombstones.begin()->end_position())) {

It should call should_flush() on position(), not end_position().

I don't understand why we need this change of behavior though. cache reader uses it here:

inline
void cache_flat_mutation_reader::move_to_next_range() {
    if (_queued_underlying_fragment) {
        add_to_buffer(*std::exchange(_queued_underlying_fragment, {}));
    }
    flush_tombstones(position_in_partition::for_range_end(*_ck_ranges_curr), true);

Setting the flag to true is supposed to emit an rtc at position position_in_partition::for_range_end(*_ck_ranges_curr), but since this is end of range, this rtc carries no useful information. The only rtc we should emit at that position is the one which changes the tombstone to null, according to the contract of flat_mutation_reader_v2:

///   1) The ranges of non-neutral clustered tombstones must be enclosed in requested
///      ranges. In other words, range tombstones don't extend beyond boundaries of requested ranges.

When I refactored this code, I didn't try to understand the existing logic, I replaced it what I though was correct.

@tgrabiec
Copy link
Contributor

For reference, the new logic in master looks like this:


inline
void cache_flat_mutation_reader::move_to_next_range() {
    if (_current_tombstone) {
        clogger.trace("csm {}: move_to_next_range: emit rtc({}, null)", fmt::ptr(this), _upper_bound);
        push_mutation_fragment(mutation_fragment_v2(*_schema, _permit, range_tombstone_change(_upper_bound, {})));
        _current_tombstone = {};
    }

@tgrabiec
Copy link
Contributor

Actually, I was confused in my analysis. The end_of_range=true is not just supposed to emit positions which are equal to the upper_bound, like the documentation of flush() suggests, it is supposed to emit rtc to null at upper_bound if there is an active range tombstone (that's what cache reader expects). The quoted line is fine, but the logic is incomplete.

Here's a row_cache-level test case:


SEASTAR_THREAD_TEST_CASE(test_range_tombstone_is_closed) {
    simple_schema s;
    tests::reader_concurrency_semaphore_wrapper semaphore;
    cache_tracker tracker;
    memtable_snapshot_source underlying(s.schema());

    auto pk = s.make_pkey(0);
    auto pr = dht::partition_range::make_singular(pk);

    mutation m1(s.schema(), pk);
    auto rt0 = s.make_range_tombstone(*position_range_to_clustering_range(position_range(
            position_in_partition::before_key(s.make_ckey(1)),
            position_in_partition::before_key(s.make_ckey(3))), *s.schema()));
    m1.partition().apply_delete(*s.schema(), rt0);
    s.add_row(m1, s.make_ckey(0), "v1");

    underlying.apply(m1);

    row_cache cache(s.schema(), snapshot_source([&] { return underlying(); }), tracker);
    populate_range(cache);

    // Create a reader to pin the MVCC version
    auto rd0 = cache.make_reader(s.schema(), semaphore.make_permit(), pr);
    auto close_rd0 = deferred_close(rd0);
    rd0.set_max_buffer_size(1);
    rd0.fill_buffer().get();

    mutation m2(s.schema(), pk);
    auto rt1 = s.make_range_tombstone(*position_range_to_clustering_range(position_range(
            position_in_partition::before_key(s.make_ckey(1)),
            position_in_partition::before_key(s.make_ckey(2))), *s.schema()));
    m2.partition().apply_delete(*s.schema(), rt1);
    apply(cache, underlying, m2);

    // State of cache:
    //  v2: ROW(0), RT(before(1), before(2))@t1
    //  v1: RT(before(1), before(3))@t0

    // range_tombstone_change_generator will work with the stream: RT(1, before(2))@t1, RT(before(2), before(3))@t0
    // It's important that there is an RT which starts exactly at the slice upper bound to trigger
    // the problem, and the RT will be in the stream only because it is a residual from RT(before(1), before(3)),
    // which overlaps with the slice in the older version. That's why we need two MVCC versions.

    auto slice = partition_slice_builder(*s.schema())
        .with_range(*position_range_to_clustering_range(position_range(
                position_in_partition::before_key(s.make_ckey(0)),
                position_in_partition::before_key(s.make_ckey(2))), *s.schema()))
        .build();

    assert_that(cache.make_reader(s.schema(), semaphore.make_permit(), pr, slice))
        .produces_partition_start(pk)
        .produces_row_with_key(s.make_ckey(0))
        .produces_range_tombstone_change(start_change(rt1))
        .produces_range_tombstone_change(end_change(rt1))
        .produces_partition_end()
        .produces_end_of_stream();
}

@tgrabiec
Copy link
Contributor

This fixes it:

diff --git a/range_tombstone_change_generator.hh b/range_tombstone_change_generator.hh
index 6f98be5dceb..e5fd3b8cb84 100644
--- a/range_tombstone_change_generator.hh
+++ b/range_tombstone_change_generator.hh
@@ -114,6 +114,7 @@ class range_tombstone_change_generator {
         // It cannot get adjacent later because prev->end_position() < upper_bound,
         // so nothing == prev->end_position() can be added after this invocation.
         if (prev && (_range_tombstones.empty()
+                     || end_of_range
                      || (cmp(prev->end_position(), _range_tombstones.begin()->position()) < 0))) {
             consumer(range_tombstone_change(prev->end_position(), tombstone())); // [2]
         }

But I'm not sure it's the best way.

@michoecho
Copy link
Contributor Author

Here's a row_cache-level test case:

Thanks, that's what I was looking for.

My attempt looked like this:

SEASTAR_TEST_CASE(test_invariant) {
    return seastar::async([] {
        simple_schema s;
        tests::reader_concurrency_semaphore_wrapper semaphore;
        cache_tracker tracker;
        memtable_snapshot_source underlying(s.schema());
        row_cache cache(s.schema(), snapshot_source([&] { return underlying(); }), tracker);

        auto pk = s.make_pkey(0);
        auto m1 = mutation(s.schema(), pk);
        auto rt1 = s.make_range_tombstone(query::clustering_range::make_ending_with(s.make_ckey(1)));
        m1.partition().apply_delete(*s.schema(), rt1);
        apply(cache, underlying, m1);

        auto pr = dht::partition_range::make_singular(pk);
        auto make_reader = [&] {
            auto rd = cache.make_reader(s.schema(), semaphore.make_permit(), pr);
            rd.set_max_buffer_size(1);
            rd.fill_buffer().get();
            return rd;
        };
        auto rd1 = make_reader();
        auto close_rd1 = deferred_close(rd1);

        auto m2 = mutation(s.schema(), pk);
        auto rt2 = s.make_range_tombstone(query::clustering_range::make_ending_with(s.make_ckey(0)));
        m2.partition().apply_delete(*s.schema(), rt2);
        apply(cache, underlying, m2);

        using bound = query::clustering_range::bound;
        auto end_bound = bound(s.make_ckey(0), true);
        auto slice_range = query::clustering_range::make_ending_with(bound(end_bound));
        auto slice = partition_slice_builder(*s.schema())
                .with_range(slice_range)
                .build();

        assert_that(cache.make_reader(s.schema(), semaphore.make_permit(), pr, slice))
                .produces_partition_start(pk)
                .produces_range_tombstone_change(start_change(rt2))
                .produces_range_tombstone_change(range_tombstone_change(position_in_partition::after_key(*s.schema(), s.make_ckey(0)), tombstone()))
                .produces_partition_end()
                .produces_end_of_stream();
    });
}

but when it didn't work, I decided that I probably misunderstand how range tombstones are read and I should ask for help before I spend too much time on debugging. I now see that I was only forgetting to populate the cache after the first apply.

But I'm not sure it's the best way.

My idea for a fix was:

diff --git a/range_tombstone_change_generator.hh b/range_tombstone_change_generator.hh
index 6f98be5dce..48d3fc0043 100644
--- a/range_tombstone_change_generator.hh
+++ b/range_tombstone_change_generator.hh
@@ -83,7 +83,7 @@ class range_tombstone_change_generator {

         position_in_partition::tri_compare cmp(_schema);
         std::optional<range_tombstone> prev;
-        const bool allow_eq = end_of_range || upper_bound.is_after_all_clustered_rows(_schema);
+        const bool allow_eq = upper_bound.is_after_all_clustered_rows(_schema);
         const auto should_flush = [&] (position_in_partition_view pos) {
             const auto res = cmp(pos, upper_bound);
             if (allow_eq) {

In this case the range tombstone ending on upper_bound would be handled by the last two ifs, instead of the first while and if. But I also don't know if that's a good way. Let's wait for others' opinions.

Would it be possible to upstream this mode so that we can debug such failures more easily in the future?

It would, with some effort, but I'm not sure how useful it would be in general. This method is good if preemptions are the only source of nondeterminism in the test (which might be the case for quite many tests, I guess), but as soon as more sources get involved (i.e. timers, smp or IO), it won't help.

@tgrabiec
Copy link
Contributor

tgrabiec commented Apr 22, 2023

It would, with some effort, but I'm not sure how useful it would be in general. This method is good if preemptions are the only source of nondeterminism in the test (which might be the case for quite many tests, I guess), but as soon as more sources get involved (i.e. timers, smp or IO), it won't help.

Even if it works only for test_concurrent_reads_and_eviction test, I think it's worth it. This test is notorious for finding hard to reproduce problems.

tgrabiec added a commit to tgrabiec/scylla that referenced this issue Apr 24, 2023
… range tombstone

Adds a reproducer for scylladb#12462, which doesn't manifest in master any
more after f73e2c9. It's still useful
to keep the test to avoid regresions.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.
denesb pushed a commit that referenced this issue Apr 25, 2023
… range tombstone

Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c9. It's still useful
to keep the test to avoid regresions.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.

Closes #13665
@mykaul
Copy link
Contributor

mykaul commented Apr 25, 2023

@tgrabiec , @bhalevy , @michoecho - I have missed the question above (or more accurately the answer) - is this a 5.2.0 blocker? I sincerely hope we can fix it in 5.2.1 or so...

@DoronArazii
Copy link

@avikivity please have a look and determine if it's a blocker

kbr-scylla pushed a commit that referenced this issue Apr 25, 2023
… range tombstone

Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c9. It's still useful
to keep the test to avoid regresions.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.

Closes #13665
@avikivity
Copy link
Member

According to #12462 (comment) this was introduced in 5.1 and is therefore not a regression, and therefore not a 5.2 blocker.

However, we should fix it promptly.

kbr-scylla pushed a commit that referenced this issue Apr 25, 2023
… range tombstone

Adds a reproducer for #12462, which doesn't manifest in master any
more after f73e2c9. It's still useful
to keep the test to avoid regresions.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the rework of the cache reader,
range_tombstone_generator::flush() was used with end_of_range=true to
produce the closing range_tombstone_change and it did not handle
correctly the case when there are two adjacent range tombstones and
flush(pos, end_of_range=true) is called such that pos is the boundary
between the two.

Closes #13665
@michoecho
Copy link
Contributor Author

@denesb @bhalevy This is the issue we were just talking about. Please take a look.

@denesb
Copy link
Contributor

denesb commented May 3, 2023

This fixes it:

diff --git a/range_tombstone_change_generator.hh b/range_tombstone_change_generator.hh
index 6f98be5dceb..e5fd3b8cb84 100644
--- a/range_tombstone_change_generator.hh
+++ b/range_tombstone_change_generator.hh
@@ -114,6 +114,7 @@ class range_tombstone_change_generator {
         // It cannot get adjacent later because prev->end_position() < upper_bound,
         // so nothing == prev->end_position() can be added after this invocation.
         if (prev && (_range_tombstones.empty()
+                     || end_of_range
                      || (cmp(prev->end_position(), _range_tombstones.begin()->position()) < 0))) {
             consumer(range_tombstone_change(prev->end_position(), tombstone())); // [2]
         }

But I'm not sure it's the best way.

To the extent I was able to understand the original context for 5cc5fd4, I think this is a right way.
If all the unit tests are passing, it should be good. I remember I had to add end_of_range because slicing unit tests were giving me trouble. If they continue to pass then we should be fine.

@mykaul
Copy link
Contributor

mykaul commented May 8, 2023

@michoecho - will you be sending a patch to fix this? Anyone else who should be assigned to this?

@michoecho
Copy link
Contributor Author

@michoecho - will you be sending a patch to fix this?

Yes.

michoecho added a commit to michoecho/scylla that referenced this issue May 15, 2023
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes scylladb#12462
michoecho pushed a commit to michoecho/scylla that referenced this issue May 15, 2023
… range tombstone

Adds a reproducer for scylladb#12462.

The bug manifests by reader throwing:

  std::logic_error: Stream ends with an active range tombstone: {range_tombstone_change: pos={position: clustered,ckp{},-1}, {tombstone: timestamp=-9223372036854775805, deletion_time=2}}

The reason is that prior to the fix range_tombstone_change_generator::flush()
was used with end_of_range=true to produce the closing range_tombstone_change
and it did not handle correctly the case when there are two adjacent range
tombstones and flush(pos, end_of_range=true) is called such that pos is the
boundary between the two.

Cherry-picked from a717c80.
tgrabiec added a commit that referenced this issue May 15, 2023
… from Michał Chojnowski

range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes #12462

Closes #13894

* github.com:scylladb/scylladb:
  tests: row_cache: Add reproducer for reader producing missing closing range tombstone
  range_tombstone_change_generator: fix an edge case in flush()
michoecho added a commit to michoecho/scylla that referenced this issue May 16, 2023
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes scylladb#12462
@raphaelsc
Copy link
Member

Backport to 5.1 missing apparently.

@michoecho
Copy link
Contributor Author

Backport to 5.1 missing apparently.

Note it still has the backport candidate label though. I thought that maintainers just didn't go over it yet.

@raphaelsc
Copy link
Member

Backport to 5.1 missing apparently.

Note it still has the backport candidate label though. I thought that maintainers just didn't go over it yet.

@scylladb/scylla-maint ping

denesb pushed a commit that referenced this issue Jun 27, 2023
range_tombstone_change_generator::flush() mishandles the case when two range
tombstones are adjacent and flush(pos, end_of_range=true) is called with pos
equal to the end bound of the lesser-position range tombstone.

In such case, the start change of the greater-position rtc will be accidentally
emitted, and there won't be an end change, which breaks reader assumptions by
ending the stream with an unclosed range tombstone, triggering an assertion.

This is due to a non-strict inequality used in a place where strict inequality
should be used. The modified line was intended to close range tombstones
which end exactly on the flush position, but this is unnecessary because such
range tombstones are handled by the last `if` in the function anyway.
Instead, this line caused range tombstones beginning right after the flush
position to be emitted sometimes.

Fixes #12462

Closes #13906

(cherry picked from commit 9b0679c)
@denesb
Copy link
Contributor

denesb commented Jun 27, 2023

Backported to 5.1 as f13f895.

@denesb
Copy link
Contributor

denesb commented Jun 27, 2023

BTW 9b0679c is present in 5.2 as 24d966f, but it has no cherry-pick notice. Maybe the backport was done without -x?

@michoecho
Copy link
Contributor Author

BTW 9b0679c is present in 5.2 as 24d966f, but it has no cherry-pick notice. Maybe the backport was done without -x?

My fault. I initially thought that this bug wasn't present in master (I though the buggy code was deleted), so I posted a PR against 5.2. After it got merged, I realized that it's present in master after all, so I posted a separate PR against master. (So it wasn't a standard backport procedure).
I should have added a cross reference to the master PR, but I didn't.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/test Issues related to the testing system code and environment P1 Urgent symptom/ci stability Issues that failed in ScyllaDB CI - tests and framework type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants