-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: assert no inflight writes when closing segment #17032
storage: assert no inflight writes when closing segment #17032
Conversation
Register the inflight write before calling append on the segment. Append may dispatch a background write and a subsequent call to `maybe_advance_stable_offset` before the append continuation may run. This was happening in CI when running with a RAM disk (very fast storage). When appender advances stable offset, a callback for advancing stable stable offset at segment level is also called, `segment::advance_stable_offset`. This method uses the inflight writes to determine how far the segment stable offset can be advanced. Because the last inflight write wasn't registered yet, the stable offset can get stuck. In theory, because we use the segment stable offset as reader upper bound, this can lead to a situation where acked data is not visible and worse consequences.
3557c61
to
4e4fad0
Compare
@@ -96,6 +96,8 @@ ss::future<> segment::close() { | |||
co_await do_flush(); | |||
co_await do_close(); | |||
|
|||
vassert(_inflight.empty(), "Closing segment with inflight writes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem if the appender continues to live on and complete the inflight writes? Here inflight seems to only be tracking the inflight writes, not owning them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did acquire the write lock above, we did flush too. Why would we expect outstanding inflight writes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I didn't notice that you had the fix below.
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46144#018e3a16-8890-4ec4-9d10-1d3032c6e5e1 |
_inflight.emplace(expected_end_physical, b.last_offset()); | ||
|
||
// proxy serialization to segment_appender | ||
auto write_fut = _appender->append(b).then( | ||
[this, &b, start_physical_offset] { | ||
[this, &b, start_physical_offset, expected_end_physical] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When appender advances stable offset, a callback for advancing stable
stable offset at segment level is also called,
segment::advance_stable_offset
. This method uses the inflight writes
to determine how far the segment stable offset can be advanced.
Ahh, I see now. Great find. I'm very surprised this never happened in Debug builds where the scheduler randomizes the order of execution of runnable continuations, and that it took very fast storage to trigger.
Because
the last inflight write wasn't registered yet, the stable offset can get
stuck.
This part is pretty cool, that normally we could drop the upcall, but it would only really matter if no new appends arrived promptly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very surprised this never happened in Debug builds
+1 also I'm wondering if this is somehow more likely with the new dirty tracking logic in the batch cache, though nothing really jumps out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it was masked by code in segment::do_flush
ss::future<> segment::do_flush() {
_generation_id++;
if (!_appender) {
return ss::make_ready_future<>();
}
auto o = _tracker.dirty_offset;
auto fsize = _appender->file_byte_offset();
return _appender->flush().then([this, o, fsize] {
// never move committed offset backward, there may be multiple
// outstanding flushes once the one executed later in terms of offset
// finishes we guarantee that all previous flushes finished.
_tracker.committed_offset = std::max(o, _tracker.committed_offset);
// ########## HERE ##########
_tracker.stable_offset = _tracker.committed_offset;
// ########## HERE ##########
_reader->set_file_size(std::max(fsize, _reader->file_size()));
clear_cached_disk_usage();
});
}
However, very important!, this also does contain a bug and with concurrent flushes data can become missing #17039
@@ -96,6 +96,8 @@ ss::future<> segment::close() { | |||
co_await do_flush(); | |||
co_await do_close(); | |||
|
|||
vassert(_inflight.empty(), "Closing segment with inflight writes."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh I didn't notice that you had the fix below.
@nvartolomei needs a cover letter too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch! We really ought to keep a diagram of sorts around to keep track of control flow of the append path and all its background work...
Is this not marked for backport because the chances of a reader being unable to read acked data is very low (even theoretical only, maybe)?
@@ -96,6 +96,8 @@ ss::future<> segment::close() { | |||
co_await do_flush(); | |||
co_await do_close(); | |||
|
|||
vassert(_inflight.empty(), "Closing segment with inflight writes."); | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because we use the segment stable offset as reader upper
bound, this can lead to a situation where acked data is not visible and
worse consequences.
Just making sure, even though we're adding to _inflight
before appending, we aren't at risk for having readers read that data though right (especially important since the append may not have been dispatched)? It's just that now, advance_stable_offset
will at least be able to consider the in-flight write even if we haven't called append() yet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, advance_stable_offset
operates on file offsets which were written and inflight entries which are at or below that offset.
_inflight.emplace(expected_end_physical, b.last_offset()); | ||
|
||
// proxy serialization to segment_appender | ||
auto write_fut = _appender->append(b).then( | ||
[this, &b, start_physical_offset] { | ||
[this, &b, start_physical_offset, expected_end_physical] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very surprised this never happened in Debug builds
+1 also I'm wondering if this is somehow more likely with the new dirty tracking logic in the batch cache, though nothing really jumps out.
/cdt |
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
Known cdt failures: Unrelated |
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
Dirty entries are ones that aren't eligible for eviction as they might have not yet been written durably to disk. Evicting them would otherwise lead to "disappeared" data on cache miss. We are making an implicit assumption that readers that want to read beyond storage stable offset will always read through the batch cache. Otherwise, the read-after-write semantics won't hold as the data buffered in appender chunk cache isn't exposed. --- Batch cache entries are `batch_cache::range` data structures. A range is a 32KiB memory arena with some ancillary data structures. It is associated with a `batch_cache_index`. `batch_cache_index` has 1-to-1 mapping with a segment. When a batch is put in the cache, the `batch_cache_index` will append to an existing range or allocate a new one and insert it into the `batch_cache`. We propose to track dirty batches at `batch_cache::range` granularity rather than per batch. Using the insight that dirty batches are inserted with monotonically increasing offsets, and the fact that we always mark as clean (clear the dirty flag) for a prefix of offsets, it is enough to track only the maximum dirty batch offset in a range and advance it when new dirty batches are inserted. For marking offsets as clean (on flush) we check if `clean_offset >= max_dirty_offset` and mark the range as clean. In the alternative case, nothing needs to be done. A `batch_cache_index` can be associated with multiple ranges which potentially need to be marked as clean. As an optimization, we propose to track which ranges might contain dirty data and when the request for marking a prefix of offsets comes in to iterate only through this subset of ranges. We will use this information together with the already existing index which maps offsets to ranges `absl::btree_map<model::offset, batch_cache::entry>` to iterate only through the entries and ranges which contain the dirty offsets we are interested in. The complexity of this iteration is O(N) where N is the number of dirty batches that must be marked as clean. To this end, we are introducing a new data structure called `batch_cache_dirty_tracker` to keep track of minimum and maximum dirty offsets in an index that we need for the optimization. Technically, we could achieve a similar performance optimization by doing a binary search over the index using desired mark clean `up_to` offset as a needle and iterate backwards through ranges marking them as clean and stopping once we hit a clean range. However, the new data structure has some additional benefits: This data structure facilitates some invariants checks: - Dirty offsets are monotonic: `vassert(_max <= range.first && _max < range.second, ...)` - Truncation is requested only with a clean index: `vassert(_dirty_tracker.clean(), ...)`. See the next section for why this is important. - At the end of batch cache / batch cache index lifetime there are no dirty batches left in the index. This already caught 2 bugs in the existing code: 1) redpanda-data#17032 2) redpanda-data#17039 Also, it improves the debugging experience as we include the state of the dirty tracker in the `batch_cache_index` ostream insertion operator. --- Truncation needs special attention too. When truncation is requested we must evict batches with offsets higher than truncation point but still retain dirty batches with smaller offsets. Since we can't have holes in the ranges and can evict only at the range granularity we require a flush prior truncation. It is already present so no changes required.
Register the inflight write before calling append on the segment. Append
may dispatch a background write and a subsequent call to
maybe_advance_stable_offset
before the append continuation may run.This was happening in CI when running with a RAM disk (very fast
storage).
When appender advances stable offset, a callback for advancing stable
stable offset at segment level is also called,
segment::advance_stable_offset
. This method uses the inflight writesto determine how far the segment stable offset can be advanced. Because
the last inflight write wasn't registered yet, the stable offset can get
stuck.
In theory, because we use the segment stable offset as reader upper
bound, this can lead to a situation where acked data is not visible and
worse consequences.
Backports Required
Release Notes