storage: assert no inflight writes when closing segment #17032

nvartolomei · 2024-03-13T15:14:35Z

Register the inflight write before calling append on the segment. Append
may dispatch a background write and a subsequent call to
maybe_advance_stable_offset before the append continuation may run.

This was happening in CI when running with a RAM disk (very fast
storage).

When appender advances stable offset, a callback for advancing stable
stable offset at segment level is also called,
segment::advance_stable_offset. This method uses the inflight writes
to determine how far the segment stable offset can be advanced. Because
the last inflight write wasn't registered yet, the stable offset can get
stuck.

In theory, because we use the segment stable offset as reader upper
bound, this can lead to a situation where acked data is not visible and
worse consequences.

Backports Required

Release Notes

none

Register the inflight write before calling append on the segment. Append may dispatch a background write and a subsequent call to `maybe_advance_stable_offset` before the append continuation may run. This was happening in CI when running with a RAM disk (very fast storage). When appender advances stable offset, a callback for advancing stable stable offset at segment level is also called, `segment::advance_stable_offset`. This method uses the inflight writes to determine how far the segment stable offset can be advanced. Because the last inflight write wasn't registered yet, the stable offset can get stuck. In theory, because we use the segment stable offset as reader upper bound, this can lead to a situation where acked data is not visible and worse consequences.

dotnwat · 2024-03-13T22:29:28Z

src/v/storage/segment.cc

@@ -96,6 +96,8 @@ ss::future<> segment::close() {
    co_await do_flush();
    co_await do_close();

+    vassert(_inflight.empty(), "Closing segment with inflight writes.");


Is this a problem if the appender continues to live on and complete the inflight writes? Here inflight seems to only be tracking the inflight writes, not owning them.

We did acquire the write lock above, we did flush too. Why would we expect outstanding inflight writes?

Ahh I didn't notice that you had the fix below.

vbotbuildovich · 2024-03-14T00:38:24Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46144#018e3a16-8890-4ec4-9d10-1d3032c6e5e1

dotnwat · 2024-03-14T02:06:42Z

src/v/storage/segment.cc

+    _inflight.emplace(expected_end_physical, b.last_offset());
+
    // proxy serialization to segment_appender
    auto write_fut = _appender->append(b).then(
-      [this, &b, start_physical_offset] {
+      [this, &b, start_physical_offset, expected_end_physical] {


When appender advances stable offset, a callback for advancing stable
stable offset at segment level is also called,
segment::advance_stable_offset. This method uses the inflight writes
to determine how far the segment stable offset can be advanced.

Ahh, I see now. Great find. I'm very surprised this never happened in Debug builds where the scheduler randomizes the order of execution of runnable continuations, and that it took very fast storage to trigger.

Because
the last inflight write wasn't registered yet, the stable offset can get
stuck.

This part is pretty cool, that normally we could drop the upcall, but it would only really matter if no new appends arrived promptly.

I'm very surprised this never happened in Debug builds

+1 also I'm wondering if this is somehow more likely with the new dirty tracking logic in the batch cache, though nothing really jumps out.

Generally it was masked by code in segment::do_flush

ss::future<> segment::do_flush() { _generation_id++; if (!_appender) { return ss::make_ready_future<>(); } auto o = _tracker.dirty_offset; auto fsize = _appender->file_byte_offset(); return _appender->flush().then([this, o, fsize] { // never move committed offset backward, there may be multiple // outstanding flushes once the one executed later in terms of offset // finishes we guarantee that all previous flushes finished. _tracker.committed_offset = std::max(o, _tracker.committed_offset); // ########## HERE ########## _tracker.stable_offset = _tracker.committed_offset; // ########## HERE ########## _reader->set_file_size(std::max(fsize, _reader->file_size())); clear_cached_disk_usage(); }); }

However, very important!, this also does contain a bug and with concurrent flushes data can become missing #17039

dotnwat · 2024-03-14T02:10:39Z

src/v/storage/segment.cc

@@ -96,6 +96,8 @@ ss::future<> segment::close() {
    co_await do_flush();
    co_await do_close();

+    vassert(_inflight.empty(), "Closing segment with inflight writes.");


Ahh I didn't notice that you had the fix below.

dotnwat · 2024-03-14T02:11:01Z

@nvartolomei needs a cover letter too.

andrwng

Great catch! We really ought to keep a diagram of sorts around to keep track of control flow of the append path and all its background work...

Is this not marked for backport because the chances of a reader being unable to read acked data is very low (even theoretical only, maybe)?

andrwng · 2024-03-14T06:16:55Z

src/v/storage/segment.cc

@@ -96,6 +96,8 @@ ss::future<> segment::close() {
    co_await do_flush();
    co_await do_close();

+    vassert(_inflight.empty(), "Closing segment with inflight writes.");
+


because we use the segment stable offset as reader upper
bound, this can lead to a situation where acked data is not visible and
worse consequences.

Just making sure, even though we're adding to _inflight before appending, we aren't at risk for having readers read that data though right (especially important since the append may not have been dispatched)? It's just that now, advance_stable_offset will at least be able to consider the in-flight write even if we haven't called append() yet?

Correct, advance_stable_offset operates on file offsets which were written and inflight entries which are at or below that offset.

andrwng · 2024-03-14T06:19:06Z

src/v/storage/segment.cc

+    _inflight.emplace(expected_end_physical, b.last_offset());
+
    // proxy serialization to segment_appender
    auto write_fut = _appender->append(b).then(
-      [this, &b, start_physical_offset] {
+      [this, &b, start_physical_offset, expected_end_physical] {


I'm very surprised this never happened in Debug builds

+1 also I'm wondering if this is somehow more likely with the new dirty tracking logic in the batch cache, though nothing really jumps out.

nvartolomei · 2024-03-14T08:42:05Z

/cdt

Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.

nvartolomei · 2024-03-14T20:30:15Z

Known cdt failures:

CI Failure (timeout error) in ManyPartitionsTest.test_many_topics_throughput #16818

Unrelated rptest.services.utils.BadLogLines: <BadLogLines nodes=ip-172-31-1-9(1) example="ERROR 2024-03-14 11:55:32,224 [shard 0:au ] s3 - s3_client.cc:469 - Accessing panda-bucket-b3a2fc8c-e1f8-11ee-8182-c3cdd68eb941, unexpected REST API error "" detected, code: RequestTimeout, request_id: E6840FAB839FF7F5, resource:"> #17101

Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.

github-actions bot added the area/redpanda label Mar 13, 2024

nvartolomei force-pushed the nv/no-inflight-in-close branch from 3557c61 to 4e4fad0 Compare March 13, 2024 21:58

dotnwat reviewed Mar 13, 2024

View reviewed changes

dotnwat requested a review from andrwng March 13, 2024 22:29

dotnwat approved these changes Mar 14, 2024

View reviewed changes

andrwng approved these changes Mar 14, 2024

View reviewed changes

dotnwat merged commit 6d0a7bf into redpanda-data:dev Mar 14, 2024
15 of 18 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: assert no inflight writes when closing segment #17032

storage: assert no inflight writes when closing segment #17032

nvartolomei commented Mar 13, 2024 •

edited

dotnwat Mar 13, 2024

nvartolomei Mar 13, 2024

dotnwat Mar 14, 2024

vbotbuildovich commented Mar 14, 2024

dotnwat Mar 14, 2024

andrwng Mar 14, 2024

nvartolomei Mar 14, 2024

dotnwat Mar 14, 2024

dotnwat commented Mar 14, 2024

andrwng left a comment

andrwng Mar 14, 2024

nvartolomei Mar 14, 2024

andrwng Mar 14, 2024

nvartolomei commented Mar 14, 2024

nvartolomei commented Mar 14, 2024 •

edited

storage: assert no inflight writes when closing segment #17032

storage: assert no inflight writes when closing segment #17032

Conversation

nvartolomei commented Mar 13, 2024 • edited

Backports Required

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbotbuildovich commented Mar 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dotnwat commented Mar 14, 2024

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvartolomei commented Mar 14, 2024

nvartolomei commented Mar 14, 2024 • edited

nvartolomei commented Mar 13, 2024 •

edited

nvartolomei commented Mar 14, 2024 •

edited