storage: ensure monotonic stable offset updates #17039

nvartolomei · 2024-03-13T22:30:02Z

This change is similar to
#1826 but for stable offsets this time.

Backports Required

Release Notes

none

This change is similar to redpanda-data#1826 but for stable offsets this time.

nvartolomei · 2024-03-14T08:35:40Z

/dt

Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.

andrwng

overall lgtm, just some clarifying questions

andrwng · 2024-03-14T16:00:48Z

src/v/storage/segment.cc

@@ -340,7 +340,7 @@ ss::future<> segment::do_flush() {
        // outstanding flushes once the one executed later in terms of offset
        // finishes we guarantee that all previous flushes finished.
        _tracker.committed_offset = std::max(o, _tracker.committed_offset);
-        _tracker.stable_offset = _tracker.committed_offset;
+        _tracker.stable_offset = std::max(o, _tracker.stable_offset);


Trying to connect the dots here: should o here be exactly the same as what was advanced to during the appender flush?

Also did you have a specific sequence in mind that leads to this not being non-monotonic previously? That would really help understand this change (this change looks much more intuitive, but just want to know what went wrong before)

Trying to connect the dots here: should o here be exactly the same as what was advanced to during the appender flush?

Yes.

Also did you have a specific sequence in mind that leads to this not being non-monotonic previously? That would really help understand this change (this change looks much more intuitive, but just want to know what went wrong before)

The updated test fails without this change.

A issues a flush, stores the dirty offset in o
B does 10 additional writes , issues a flush, stores the dirty offset in o, actually flushes, updates stable_offset to N+10
A finally does the actual appender::flush, moves the stable_offset back to N overwriting the value set by B

vbotbuildovich · 2024-03-14T18:11:07Z

/backport v23.3.x

vbotbuildovich · 2024-03-14T18:11:08Z

/backport v23.2.x

vbotbuildovich · 2024-03-14T18:12:01Z

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17039-v23.2.x-647 remotes/upstream/v23.2.x
git cherry-pick -x b9e93efc7d0c2e81064abfd7cb847d44a2ac5910

Workflow run logs.

Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.

storage: ensure monotonic stable offset updates

b9e93ef

This change is similar to redpanda-data#1826 but for stable offsets this time.

github-actions bot added the area/redpanda label Mar 13, 2024

nvartolomei mentioned this pull request Mar 14, 2024

storage: assert no inflight writes when closing segment #17032

Merged

6 tasks

nvartolomei requested review from andrwng, dotnwat and mmaslankaprv March 14, 2024 15:34

andrwng reviewed Mar 14, 2024

View reviewed changes

andrwng approved these changes Mar 14, 2024

View reviewed changes

piyushredpanda merged commit 1c4ca8c into redpanda-data:dev Mar 14, 2024
17 checks passed

This was referenced Mar 14, 2024

[v23.2.x] storage: ensure monotonic stable offset updates #17098

Closed

[v23.3.x] storage: ensure monotonic stable offset updates #17099

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: ensure monotonic stable offset updates #17039

storage: ensure monotonic stable offset updates #17039

nvartolomei commented Mar 13, 2024

nvartolomei commented Mar 14, 2024

andrwng left a comment

andrwng Mar 14, 2024

andrwng Mar 14, 2024

nvartolomei Mar 14, 2024

vbotbuildovich commented Mar 14, 2024

vbotbuildovich commented Mar 14, 2024

vbotbuildovich commented Mar 14, 2024

storage: ensure monotonic stable offset updates #17039

storage: ensure monotonic stable offset updates #17039

Conversation

nvartolomei commented Mar 13, 2024

Backports Required

Release Notes

nvartolomei commented Mar 14, 2024

andrwng left a comment

Choose a reason for hiding this comment

andrwng Mar 14, 2024

Choose a reason for hiding this comment

andrwng Mar 14, 2024

Choose a reason for hiding this comment

nvartolomei Mar 14, 2024

Choose a reason for hiding this comment

vbotbuildovich commented Mar 14, 2024

vbotbuildovich commented Mar 14, 2024

vbotbuildovich commented Mar 14, 2024