storage: mutex compaction and prefix truncation #17019

nvartolomei · 2024-03-12T14:22:09Z

A crash was observed in CI caused by storage layer metadata inconsistency and accompanied by data loss.

Assert failure: (src/v/raft/consensus.cc:2434) 'last_included_term.has_value()' Unable to get term for snapshot last included offset: 9499, log: {
    offsets: {start_offset:9457, committed_offset:11164, committed_offset_term:2, dirty_offset:11164, dirty_offset_term:2},
    is_closed: false, segments: [
        {size: 12, [
            {offset_tracker:{term:2, base_offset:9500, committed_offset:9669, dirty_offset:9669},

Notice that start_offset as reported by the storage layer is lower than the base_offset of the first segment. start_offset must be great-or-equal.

In the logs the following lines were observed:

Compacting 2 adjacent segments: [
   Segment 1: {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ...
   Segment 2: {offset_tracker:{term:1, base_offset:9457, committed_offset:9499, dirty_offset:9499} ...

Followed shortly after by:

log_eviction_stm.cc:164 - requested to write raft snapshot (prefix_truncate) at 9456

disk_log_impl.cc:917 - Final compacted segment {offset_tracker:{term:1, base_offset:9190, committed_offset:9499, dirty_offset:9499}
segment_utils.cc:483 - swapping compacted segment temp file /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log.compaction.staging with the segment /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log

disk_log_impl.cc:2403 - Removing "/var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ...

This points to a race condition between adjacent segment compaction and prefix truncation. Segment 2 was "folded" into Segment 1. When prefix truncation tried to remove the pre-compaction Segment 1 it ended up removing data for Segment 2 as well. The result is data loss and metadata inconsistency.

We fix this by introducing mutual exclusion of truncation and compaction routine similar to what we do in suffix truncation.

I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.

Reproduced with these changes https://gist.github.com/nvartolomei/d25ee2b9b2c25e6a2726f96ad89ab723

Backports Required

Release Notes

Bug Fixes

Fix a race condition between suffix truncation / delete records and adjacent segment compaction that can lead to crashes and data-loss.

vbotbuildovich · 2024-03-12T16:55:20Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46056#018e334b-d5ab-4516-b85f-363aa9f7123f

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/46184#018e3d42-db78-4474-9563-e0dc22e6ab11

andrwng

I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.

+1, fwiw this solution still feels intuitive, given it's used for other flavors of segment removal

nvartolomei · 2024-03-13T07:19:45Z

/cdt

dotnwat · 2024-03-14T05:10:35Z

src/v/storage/disk_log_impl.cc

+        ssx::semaphore_units seg_rewrite_units
+          = co_await _segment_rewrite_lock.get_units();


I believe this can also be fixed by just taking a read/write lock on the
segment but the change is more intrusive as it requires additional
refactoring.

What's the gist of the refactor? If this is more intuitive and the change is small then it might make sense?

Routines for closing the segment acquire a write lock. The refactor I was thinking about is about refactoring these methods so that they can be passed the lock to avoid trying and failing to write lock it twice.

A crash was observed in CI caused by storage layer metadata inconsistency and accompanied by data loss. ``` Assert failure: (src/v/raft/consensus.cc:2434) 'last_included_term.has_value()' Unable to get term for snapshot last included offset: 9499, log: { offsets: {start_offset:9457, committed_offset:11164, committed_offset_term:2, dirty_offset:11164, dirty_offset_term:2}, is_closed: false, segments: [ {size: 12, [ {offset_tracker:{term:2, base_offset:9500, committed_offset:9669, dirty_offset:9669}, ``` Notice that `start_offset` as reported by the storage layer is lower than the `base_offset` of the first segment. `start_offset` must be great-or-equal. In the logs the following lines were observed: ``` Compacting 2 adjacent segments: [ Segment 1: {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ... Segment 2: {offset_tracker:{term:1, base_offset:9457, committed_offset:9499, dirty_offset:9499} ... ``` Followed shortly after by: ``` log_eviction_stm.cc:164 - requested to write raft snapshot (prefix_truncate) at 9456 disk_log_impl.cc:917 - Final compacted segment {offset_tracker:{term:1, base_offset:9190, committed_offset:9499, dirty_offset:9499} segment_utils.cc:483 - swapping compacted segment temp file /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log.compaction.staging with the segment /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log disk_log_impl.cc:2403 - Removing "/var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ... ``` This points to a race condition between adjacent segment compaction and prefix truncation. Segment 2 was "folded" into Segment 1. When prefix truncation tried to remove the pre-compaction Segment 1 it ended up removing data for Segment 2 as well. The result is data loss and metadata inconsistency. We fix this by introducing mutual exclusion of truncation and compaction routine similar to what we do in suffix truncation. --- I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.

nvartolomei · 2024-03-14T18:15:04Z

/cdt

nvartolomei · 2024-03-15T05:48:41Z

All CDT failures are known and referenced in #17042

Please merge this. 🙇‍♂️

dotnwat

awesome cover letter, thanks for the explanation.

should this not be backported? we've only seen it in CI and it's changes to concurrency. Happy either way, just wanted to check.

vbotbuildovich · 2024-03-22T03:18:48Z

/backport v23.3.x

vbotbuildovich · 2024-03-22T03:18:49Z

/backport v23.2.x

nvartolomei requested review from dotnwat and andrwng March 12, 2024 14:22

github-actions bot added the area/redpanda label Mar 12, 2024

andrwng approved these changes Mar 12, 2024

View reviewed changes

dotnwat reviewed Mar 14, 2024

View reviewed changes

nvartolomei force-pushed the nv/truncate-mutex branch from af74190 to d2ee527 Compare March 14, 2024 12:47

nvartolomei requested a review from dotnwat March 15, 2024 15:38

dotnwat approved these changes Mar 22, 2024

View reviewed changes

dotnwat merged commit def0cd5 into redpanda-data:dev Mar 22, 2024
14 of 17 checks passed

This was referenced Mar 22, 2024

[v23.3.x] storage: mutex compaction and prefix truncation #17253

Merged

[v23.2.x] storage: mutex compaction and prefix truncation #17254

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: mutex compaction and prefix truncation #17019

storage: mutex compaction and prefix truncation #17019

nvartolomei commented Mar 12, 2024 •

edited

vbotbuildovich commented Mar 12, 2024 •

edited

andrwng left a comment

nvartolomei commented Mar 13, 2024

dotnwat Mar 14, 2024 •

edited

nvartolomei Mar 15, 2024

nvartolomei commented Mar 14, 2024

nvartolomei commented Mar 15, 2024

dotnwat left a comment

vbotbuildovich commented Mar 22, 2024

vbotbuildovich commented Mar 22, 2024

		ssx::semaphore_units seg_rewrite_units
		= co_await _segment_rewrite_lock.get_units();

storage: mutex compaction and prefix truncation #17019

storage: mutex compaction and prefix truncation #17019

Conversation

nvartolomei commented Mar 12, 2024 • edited

Backports Required

Release Notes

Bug Fixes

vbotbuildovich commented Mar 12, 2024 • edited

andrwng left a comment

Choose a reason for hiding this comment

nvartolomei commented Mar 13, 2024

dotnwat Mar 14, 2024 • edited

Choose a reason for hiding this comment

nvartolomei Mar 15, 2024

Choose a reason for hiding this comment

nvartolomei commented Mar 14, 2024

nvartolomei commented Mar 15, 2024

dotnwat left a comment

Choose a reason for hiding this comment

vbotbuildovich commented Mar 22, 2024

vbotbuildovich commented Mar 22, 2024

nvartolomei commented Mar 12, 2024 •

edited

vbotbuildovich commented Mar 12, 2024 •

edited

dotnwat Mar 14, 2024 •

edited