-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: support dirty entries in batch cache #16954
storage: support dirty entries in batch cache #16954
Conversation
1ecda4f
to
49c1e29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a thorough review yet, but I like the direction
49c1e29
to
9767245
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the direction here makes sense.
9767245
to
0409642
Compare
d9cf0db
to
a169ce1
Compare
a169ce1
to
ce420b9
Compare
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46224#018e3f3a-9095-4243-9c79-309ffdfc5b96 ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46311#018e4444-106f-400e-aa73-bdd1e2e0f872 |
/cdt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing really blocking. Some clarifying questions, some test suggestions, but overall this looks good!
log | ||
->truncate( | ||
storage::truncate_config(truncate_offset, ss::default_priority_class())) | ||
.get(); | ||
|
||
// Reclaim everything from cache. | ||
storage::testing_details::log_manager_accessor::batch_cache(mgr).clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: might also be interesting to continue appending dirty batches, to make sure we don't hit any of the vasserts ensuring monotonicity
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good call. Thinking about a good place where to add this which will also cover the randomization you mentioned earlier. We have a read_write_truncate test which I'm thinking about expanding (or copying) to cover batch cache too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likely a follow up PR.
ce420b9
to
32f82e1
Compare
Maintain `stable_offset <= dirty_offset` invariant. `advance_stable_offset` may be called before the continuation attached to the `segment_appender::append` where we are advancing the dirty offset.
32f82e1
to
f6bcdd7
Compare
@@ -648,8 +648,13 @@ void segment::advance_stable_offset(size_t filepos) { | |||
} | |||
|
|||
_reader->set_file_size(it->first); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: squash with the previous commit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦 git absorb and rebase didn't work great here. Will do after i get the CI results.
I plan to extend the test suite to cover write caching functionality too. In the case of write caching we will want to read past the committed offset. I couldn't infer a good motivation for bounding test readers to committed offset. I believe that not having the upper bound by default can help uncover hidden bugs both in the implementation and in the testing code. If anything needs to bound reads, it must do it explicitly like the regular application code does.
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
This proves that `<` is the right comparison operator in the following if statement of the `batch_cache_index::mark_clean` routine: ```cpp if (_dirty_tracker.clean() || up_to < _dirty_tracker.min()) { // No dirty data in the cache. return; } ```
f6bcdd7
to
31c451a
Compare
Known failure: |
Is this still true? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks great. left a couple questions, but not blockers. really glad to see so many assertions in there!
// that case, the batch entry should be considered clean. | ||
_tracker.stable_offset < b.last_offset() | ||
? batch_cache::is_dirty_entry::yes | ||
: batch_cache::is_dirty_entry::no); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks ok. curious why not to do this in the batch cache itself by tracking the cleaned-up-to offset? i guess maybe are doing that accounting outside the cache?
// Maximum dirty batch offset that is stored in this range. If a range | ||
// contains any dirty batch offsets we prevent its eviction so that | ||
// readers get cache hits until the data is persisted to disk and we can | ||
// rehydrate on a miss. | ||
// | ||
// Dirty batches are inserted with monotonically increasing offsets and | ||
// marked as clean with an upper bound. | ||
// | ||
// You can imagine the put operation moving one cursor forward and | ||
// mark_clean operation moving another one from behind. Once they | ||
// overlap, the range is considered clean. | ||
// | ||
// Let's take the following example range with clean and dirty batches: | ||
// | ||
// +------------+------------|------------+----------+ | ||
// | c=10..15 | d=16..18 | d=19..25 | c=4..6 | | ||
// +------------+------------|------------+----------+ | ||
// ^ | ||
// _max_dirty_offset=25 -----+ | ||
// | ||
// Note 1: Dirty batches always are monotonically increasing but clean | ||
// batches might not be as they might have been added by concurrent | ||
// read operations. | ||
// | ||
// Note 2: We don't actually track whether a particular batch is clean | ||
// or dirty. The `c` and `d` on the diagram are for illustration | ||
// purposes only. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
/// 2. Truncation is requested only with a clean index (we can evict only at | ||
/// range level). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i take this truncation isn't the same as truncating a segment where i guess it would be fine to throw away dirty data?
first != _index.end(), | ||
"Iterator must exist if dirty tracker isn't clean."); | ||
|
||
auto last = std::next(find_first(up_to_inclusive)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what guarantees that find_first here doesn't return end()?
Cover letter is in the first commit of the PR https://github.com/redpanda-data/redpanda/pull/16954/commits
Backports Required
Release Notes