storage: support dirty entries in batch cache #16954

nvartolomei · 2024-03-07T22:48:42Z

Cover letter is in the first commit of the PR https://github.com/redpanda-data/redpanda/pull/16954/commits

Backports Required

Release Notes

none

andrwng

Not a thorough review yet, but I like the direction

src/v/storage/batch_cache.h

dotnwat

i think the direction here makes sense.

src/v/storage/segment.cc

src/v/storage/batch_cache.h

src/v/storage/batch_cache.cc

src/v/storage/batch_cache.h

vbotbuildovich · 2024-03-15T00:32:07Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46224#018e3f3a-9095-4243-9c79-309ffdfc5b96

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46311#018e4444-106f-400e-aa73-bdd1e2e0f872

nvartolomei · 2024-03-15T04:18:32Z

/cdt

andrwng

Nothing really blocking. Some clarifying questions, some test suggestions, but overall this looks good!

src/v/storage/batch_cache.h

src/v/storage/batch_cache.cc

src/v/storage/tests/batch_cache_test.cc

andrwng · 2024-03-15T05:40:44Z

src/v/storage/tests/storage_e2e_test.cc

+    log
+      ->truncate(
+        storage::truncate_config(truncate_offset, ss::default_priority_class()))
+      .get();
+
+    // Reclaim everything from cache.
+    storage::testing_details::log_manager_accessor::batch_cache(mgr).clear();


nit: might also be interesting to continue appending dirty batches, to make sure we don't hit any of the vasserts ensuring monotonicity

That's a good call. Thinking about a good place where to add this which will also cover the randomization you mentioned earlier. We have a read_write_truncate test which I'm thinking about expanding (or copying) to cover batch cache too.

Likely a follow up PR.

Maintain `stable_offset <= dirty_offset` invariant. `advance_stable_offset` may be called before the continuation attached to the `segment_appender::append` where we are advancing the dirty offset.

andrwng · 2024-03-15T15:56:00Z

src/v/storage/segment.cc

@@ -648,8 +648,13 @@ void segment::advance_stable_offset(size_t filepos) {
    }

    _reader->set_file_size(it->first);
+


nit: squash with the previous commit?

🤦 git absorb and rebase didn't work great here. Will do after i get the CI results.

src/v/storage/segment.cc

src/v/storage/batch_cache.cc

I plan to extend the test suite to cover write caching functionality too. In the case of write caching we will want to read past the committed offset. I couldn't infer a good motivation for bounding test readers to committed offset. I believe that not having the upper bound by default can help uncover hidden bugs both in the implementation and in the testing code. If anything needs to bound reads, it must do it explicitly like the regular application code does.

Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.

This proves that `<` is the right comparison operator in the following if statement of the `batch_cache_index::mark_clean` routine: ```cpp if (_dirty_tracker.clean() || up_to < _dirty_tracker.min()) { // No dirty data in the cache. return; } ```

nvartolomei · 2024-03-16T08:45:26Z

Known failure:

CI Failure (timeout) in gtest_raft_rpunit #14425

dotnwat · 2024-03-19T01:35:50Z

Cover letter is in the first commit of the PR https://github.com/redpanda-data/redpanda/pull/16954/commits

Is this still true?

dotnwat

looks great. left a couple questions, but not blockers. really glad to see so many assertions in there!

dotnwat · 2024-03-19T01:45:33Z

src/v/storage/segment.cc

+            // that case, the batch entry should be considered clean.
+            _tracker.stable_offset < b.last_offset()
+              ? batch_cache::is_dirty_entry::yes
+              : batch_cache::is_dirty_entry::no);


looks ok. curious why not to do this in the batch cache itself by tracking the cleaned-up-to offset? i guess maybe are doing that accounting outside the cache?

dotnwat · 2024-03-19T01:51:09Z

src/v/storage/batch_cache.h

+        // Maximum dirty batch offset that is stored in this range. If a range
+        // contains any dirty batch offsets we prevent its eviction so that
+        // readers get cache hits until the data is persisted to disk and we can
+        // rehydrate on a miss.
+        //
+        // Dirty batches are inserted with monotonically increasing offsets and
+        // marked as clean with an upper bound.
+        //
+        // You can imagine the put operation moving one cursor forward and
+        // mark_clean operation moving another one from behind. Once they
+        // overlap, the range is considered clean.
+        //
+        // Let's take the following example range with clean and dirty batches:
+        //
+        // +------------+------------|------------+----------+
+        // |  c=10..15  |  d=16..18  |  d=19..25  |  c=4..6  |
+        // +------------+------------|------------+----------+
+        //                                        ^
+        //              _max_dirty_offset=25 -----+
+        //
+        // Note 1: Dirty batches always are monotonically increasing but clean
+        //   batches might not be as they might have been added by concurrent
+        //   read operations.
+        //
+        // Note 2: We don't actually track whether a particular batch is clean
+        //   or dirty. The `c` and `d` on the diagram are for illustration
+        //   purposes only.


dotnwat · 2024-03-19T01:52:40Z

src/v/storage/batch_cache.h

+    /// 2. Truncation is requested only with a clean index (we can evict only at
+    /// range level).


i take this truncation isn't the same as truncating a segment where i guess it would be fine to throw away dirty data?

dotnwat · 2024-03-19T01:58:11Z

src/v/storage/batch_cache.cc

+      first != _index.end(),
+      "Iterator must exist if dirty tracker isn't clean.");
+
+    auto last = std::next(find_first(up_to_inclusive));


what guarantees that find_first here doesn't return end()?

nvartolomei requested review from dotnwat and andrwng March 7, 2024 22:48

github-actions bot added the area/redpanda label Mar 7, 2024

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 1ecda4f to 49c1e29 Compare March 8, 2024 14:35

andrwng reviewed Mar 8, 2024

View reviewed changes

src/v/storage/batch_cache.h Outdated Show resolved Hide resolved

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 49c1e29 to 9767245 Compare March 11, 2024 14:11

dotnwat reviewed Mar 11, 2024

View reviewed changes

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 9767245 to 0409642 Compare March 12, 2024 21:53

nvartolomei requested review from dotnwat and andrwng March 12, 2024 21:58

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch 2 times, most recently from d9cf0db to a169ce1 Compare March 14, 2024 12:40

nvartolomei changed the title ~~write caching rfc draft~~ storage: support dirty entries in batch cache Mar 14, 2024

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from a169ce1 to ce420b9 Compare March 14, 2024 21:32

redpanda-data deleted a comment from vbotbuildovich Mar 14, 2024

andrwng reviewed Mar 15, 2024

View reviewed changes

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from ce420b9 to 32f82e1 Compare March 15, 2024 14:24

storage: advance dirty offset together with stable offset if needed

470a67d

Maintain `stable_offset <= dirty_offset` invariant. `advance_stable_offset` may be called before the continuation attached to the `segment_appender::append` where we are advancing the dirty offset.

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 32f82e1 to f6bcdd7 Compare March 15, 2024 14:48

andrwng reviewed Mar 15, 2024

View reviewed changes

nvartolomei added 3 commits March 15, 2024 17:14

nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from f6bcdd7 to 31c451a Compare March 15, 2024 21:44

andrwng approved these changes Mar 16, 2024

View reviewed changes

dotnwat approved these changes Mar 19, 2024

View reviewed changes

dotnwat merged commit 8a2deb5 into redpanda-data:dev Mar 19, 2024
13 of 16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: support dirty entries in batch cache #16954

storage: support dirty entries in batch cache #16954

nvartolomei commented Mar 7, 2024 •

edited

andrwng left a comment

dotnwat left a comment

vbotbuildovich commented Mar 15, 2024 •

edited

nvartolomei commented Mar 15, 2024

andrwng left a comment

andrwng Mar 15, 2024

nvartolomei Mar 15, 2024

nvartolomei Mar 15, 2024

andrwng Mar 15, 2024

nvartolomei Mar 15, 2024

nvartolomei commented Mar 16, 2024

dotnwat commented Mar 19, 2024

dotnwat left a comment

dotnwat Mar 19, 2024

dotnwat Mar 19, 2024

dotnwat Mar 19, 2024

dotnwat Mar 19, 2024

		@@ -648,8 +648,13 @@ void segment::advance_stable_offset(size_t filepos) {
		}

		_reader->set_file_size(it->first);

		/// 2. Truncation is requested only with a clean index (we can evict only at
		/// range level).

storage: support dirty entries in batch cache #16954

storage: support dirty entries in batch cache #16954

Conversation

nvartolomei commented Mar 7, 2024 • edited

Backports Required

Release Notes

andrwng left a comment

Choose a reason for hiding this comment

dotnwat left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Mar 15, 2024 • edited

nvartolomei commented Mar 15, 2024

andrwng left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvartolomei commented Mar 16, 2024

dotnwat commented Mar 19, 2024

dotnwat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nvartolomei commented Mar 7, 2024 •

edited

vbotbuildovich commented Mar 15, 2024 •

edited