-
Notifications
You must be signed in to change notification settings - Fork 553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: mutex compaction and prefix truncation #17019
Conversation
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46056#018e334b-d5ab-4516-b85f-363aa9f7123f ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46184#018e3d42-db78-4474-9563-e0dc22e6ab11 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.
+1, fwiw this solution still feels intuitive, given it's used for other flavors of segment removal
/cdt |
ssx::semaphore_units seg_rewrite_units | ||
= co_await _segment_rewrite_lock.get_units(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can also be fixed by just taking a read/write lock on the
segment but the change is more intrusive as it requires additional
refactoring.
What's the gist of the refactor? If this is more intuitive and the change is small then it might make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Routines for closing the segment acquire a write lock. The refactor I was thinking about is about refactoring these methods so that they can be passed the lock to avoid trying and failing to write lock it twice.
A crash was observed in CI caused by storage layer metadata inconsistency and accompanied by data loss. ``` Assert failure: (src/v/raft/consensus.cc:2434) 'last_included_term.has_value()' Unable to get term for snapshot last included offset: 9499, log: { offsets: {start_offset:9457, committed_offset:11164, committed_offset_term:2, dirty_offset:11164, dirty_offset_term:2}, is_closed: false, segments: [ {size: 12, [ {offset_tracker:{term:2, base_offset:9500, committed_offset:9669, dirty_offset:9669}, ``` Notice that `start_offset` as reported by the storage layer is lower than the `base_offset` of the first segment. `start_offset` must be great-or-equal. In the logs the following lines were observed: ``` Compacting 2 adjacent segments: [ Segment 1: {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ... Segment 2: {offset_tracker:{term:1, base_offset:9457, committed_offset:9499, dirty_offset:9499} ... ``` Followed shortly after by: ``` log_eviction_stm.cc:164 - requested to write raft snapshot (prefix_truncate) at 9456 disk_log_impl.cc:917 - Final compacted segment {offset_tracker:{term:1, base_offset:9190, committed_offset:9499, dirty_offset:9499} segment_utils.cc:483 - swapping compacted segment temp file /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log.compaction.staging with the segment /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log disk_log_impl.cc:2403 - Removing "/var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ... ``` This points to a race condition between adjacent segment compaction and prefix truncation. Segment 2 was "folded" into Segment 1. When prefix truncation tried to remove the pre-compaction Segment 1 it ended up removing data for Segment 2 as well. The result is data loss and metadata inconsistency. We fix this by introducing mutual exclusion of truncation and compaction routine similar to what we do in suffix truncation. --- I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.
af74190
to
d2ee527
Compare
/cdt |
All CDT failures are known and referenced in #17042 Please merge this. 🙇♂️ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome cover letter, thanks for the explanation.
should this not be backported? we've only seen it in CI and it's changes to concurrency. Happy either way, just wanted to check.
/backport v23.3.x |
/backport v23.2.x |
A crash was observed in CI caused by storage layer metadata inconsistency and accompanied by data loss.
Notice that
start_offset
as reported by the storage layer is lower than thebase_offset
of the first segment.start_offset
must be great-or-equal.In the logs the following lines were observed:
Followed shortly after by:
This points to a race condition between adjacent segment compaction and prefix truncation. Segment 2 was "folded" into Segment 1. When prefix truncation tried to remove the pre-compaction Segment 1 it ended up removing data for Segment 2 as well. The result is data loss and metadata inconsistency.
We fix this by introducing mutual exclusion of truncation and compaction routine similar to what we do in suffix truncation.
I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.
Reproduced with these changes https://gist.github.com/nvartolomei/d25ee2b9b2c25e6a2726f96ad89ab723
Backports Required
Release Notes
Bug Fixes