Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: support dirty entries in batch cache #16954

Merged

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Mar 7, 2024

Cover letter is in the first commit of the PR https://github.com/redpanda-data/redpanda/pull/16954/commits

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a thorough review yet, but I like the direction

src/v/storage/batch_cache.h Outdated Show resolved Hide resolved
@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 49c1e29 to 9767245 Compare March 11, 2024 14:11
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the direction here makes sense.

src/v/storage/segment.cc Outdated Show resolved Hide resolved
src/v/storage/segment.cc Show resolved Hide resolved
src/v/storage/batch_cache.h Outdated Show resolved Hide resolved
src/v/storage/batch_cache.cc Outdated Show resolved Hide resolved
src/v/storage/batch_cache.h Outdated Show resolved Hide resolved
@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 9767245 to 0409642 Compare March 12, 2024 21:53
@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch 2 times, most recently from d9cf0db to a169ce1 Compare March 14, 2024 12:40
@nvartolomei nvartolomei changed the title write caching rfc draft storage: support dirty entries in batch cache Mar 14, 2024
@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from a169ce1 to ce420b9 Compare March 14, 2024 21:32
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Mar 14, 2024
@redpanda-data redpanda-data deleted a comment from vbotbuildovich Mar 14, 2024
@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 15, 2024

@nvartolomei
Copy link
Contributor Author

/cdt

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing really blocking. Some clarifying questions, some test suggestions, but overall this looks good!

src/v/storage/batch_cache.h Show resolved Hide resolved
src/v/storage/batch_cache.cc Outdated Show resolved Hide resolved
src/v/storage/tests/batch_cache_test.cc Show resolved Hide resolved
src/v/storage/tests/batch_cache_test.cc Show resolved Hide resolved
Comment on lines +377 to +383
log
->truncate(
storage::truncate_config(truncate_offset, ss::default_priority_class()))
.get();

// Reclaim everything from cache.
storage::testing_details::log_manager_accessor::batch_cache(mgr).clear();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might also be interesting to continue appending dirty batches, to make sure we don't hit any of the vasserts ensuring monotonicity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good call. Thinking about a good place where to add this which will also cover the randomization you mentioned earlier. We have a read_write_truncate test which I'm thinking about expanding (or copying) to cover batch cache too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likely a follow up PR.

@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from ce420b9 to 32f82e1 Compare March 15, 2024 14:24
Maintain `stable_offset <= dirty_offset` invariant.
`advance_stable_offset` may be called before the continuation attached
to the `segment_appender::append` where we are advancing the dirty
offset.
@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from 32f82e1 to f6bcdd7 Compare March 15, 2024 14:48
@@ -648,8 +648,13 @@ void segment::advance_stable_offset(size_t filepos) {
}

_reader->set_file_size(it->first);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: squash with the previous commit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦 git absorb and rebase didn't work great here. Will do after i get the CI results.

src/v/storage/segment.cc Show resolved Hide resolved
src/v/storage/batch_cache.cc Outdated Show resolved Hide resolved
I plan to extend the test suite to cover write caching functionality
too. In the case of write caching we will want to read past the
committed offset.

I couldn't infer a good motivation for bounding test readers to
committed offset. I believe that not having the upper bound by default
can help uncover hidden bugs both in the implementation and in the
testing code.

If anything needs to bound reads, it must do it explicitly like the
regular application code does.
Dirty entries are ones that aren't eligible for eviction as they might
have not yet been written durably to disk. Evicting them would otherwise
lead to "disappeared" data on cache miss.

We are making an implicit assumption that readers that want to read
beyond storage stable offset will always read through the batch cache.
Otherwise, the read-after-write semantics won't hold as the data
buffered in appender chunk cache isn't exposed.

---

Batch cache entries are `batch_cache::range` data structures. A range is
a 32KiB memory arena with some ancillary data structures. It is
associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1
mapping with a segment. When a batch is put in the cache, the
`batch_cache_index` will append to an existing range or allocate a new
one and insert it into the `batch_cache`.

We propose to track dirty batches at `batch_cache::range` granularity
rather than per batch.

Using the insight that dirty batches are inserted with monotonically
increasing offsets, and the fact that we always mark as clean (clear the
dirty flag) for a prefix of offsets, it is enough to track only the
maximum dirty batch offset in a range and advance it when new dirty
batches are inserted. For marking offsets as clean (on flush) we check
if `clean_offset >= max_dirty_offset` and mark the range as clean. In
the alternative case, nothing needs to be done.

A `batch_cache_index` can be associated with multiple ranges which
potentially need to be marked as clean. As an optimization, we propose
to track which ranges might contain dirty data and when the request for
marking a prefix of offsets comes in to iterate only through this subset
of ranges. We will use this information together with the already
existing index which maps offsets to ranges
`absl::btree_map<model::offset, batch_cache::entry>` to iterate only
through the entries and ranges which contain the dirty offsets we are
interested in. The complexity of this iteration is O(N) where N is the
number of dirty batches that must be marked as clean.

To this end, we are introducing a new data structure called
`batch_cache_dirty_tracker` to keep track of minimum and maximum dirty
offsets in an index that we need for the optimization.

Technically, we could achieve a similar performance optimization by
doing a binary search over the index using desired mark clean `up_to`
offset as a needle and iterate backwards through ranges marking them as
clean and stopping once we hit a clean range. However, the new data
structure has some additional benefits:

This data structure facilitates some invariants checks:
  - Dirty offsets are monotonic: `vassert(_max <= range.first && _max <
    range.second, ...)`
  - Truncation is requested only with a clean index:
    `vassert(_dirty_tracker.clean(), ...)`. See the next section for why
    this is important.
  - At the end of batch cache / batch cache index lifetime there are no
    dirty batches left in the index. This already caught 2 bugs in the
    existing code:
      1) redpanda-data#17032
      2) redpanda-data#17039

Also, it improves the debugging experience as we include the state of
the dirty tracker in the `batch_cache_index` ostream insertion operator.

---

Truncation needs special attention too. When truncation is requested we
must evict batches with offsets higher than truncation point but still
retain dirty batches with smaller offsets. Since we can't have holes in
the ranges and can evict only at the range granularity we require a
flush prior truncation. It is already present so no changes required.
This proves that `<` is the right comparison operator in the following
if statement of the `batch_cache_index::mark_clean` routine:

```cpp
if (_dirty_tracker.clean() || up_to < _dirty_tracker.min()) {
    // No dirty data in the cache.
    return;
}
```
@nvartolomei nvartolomei force-pushed the nv/write-caching-rfc-final-draft branch from f6bcdd7 to 31c451a Compare March 15, 2024 21:44
@nvartolomei
Copy link
Contributor Author

@dotnwat
Copy link
Member

dotnwat commented Mar 19, 2024

Cover letter is in the first commit of the PR https://github.com/redpanda-data/redpanda/pull/16954/commits

Is this still true?

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. left a couple questions, but not blockers. really glad to see so many assertions in there!

Comment on lines +557 to +560
// that case, the batch entry should be considered clean.
_tracker.stable_offset < b.last_offset()
? batch_cache::is_dirty_entry::yes
: batch_cache::is_dirty_entry::no);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks ok. curious why not to do this in the batch cache itself by tracking the cleaned-up-to offset? i guess maybe are doing that accounting outside the cache?

Comment on lines +208 to +234
// Maximum dirty batch offset that is stored in this range. If a range
// contains any dirty batch offsets we prevent its eviction so that
// readers get cache hits until the data is persisted to disk and we can
// rehydrate on a miss.
//
// Dirty batches are inserted with monotonically increasing offsets and
// marked as clean with an upper bound.
//
// You can imagine the put operation moving one cursor forward and
// mark_clean operation moving another one from behind. Once they
// overlap, the range is considered clean.
//
// Let's take the following example range with clean and dirty batches:
//
// +------------+------------|------------+----------+
// | c=10..15 | d=16..18 | d=19..25 | c=4..6 |
// +------------+------------|------------+----------+
// ^
// _max_dirty_offset=25 -----+
//
// Note 1: Dirty batches always are monotonically increasing but clean
// batches might not be as they might have been added by concurrent
// read operations.
//
// Note 2: We don't actually track whether a particular batch is clean
// or dirty. The `c` and `d` on the diagram are for illustration
// purposes only.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Comment on lines +468 to +469
/// 2. Truncation is requested only with a clean index (we can evict only at
/// range level).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i take this truncation isn't the same as truncating a segment where i guess it would be fine to throw away dirty data?

first != _index.end(),
"Iterator must exist if dirty tracker isn't clean.");

auto last = std::next(find_first(up_to_inclusive));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what guarantees that find_first here doesn't return end()?

@dotnwat dotnwat merged commit 8a2deb5 into redpanda-data:dev Mar 19, 2024
13 of 16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants