Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: mutex compaction and prefix truncation #17019

Merged
merged 1 commit into from
Mar 22, 2024

Conversation

nvartolomei
Copy link
Contributor

@nvartolomei nvartolomei commented Mar 12, 2024

A crash was observed in CI caused by storage layer metadata inconsistency and accompanied by data loss.

Assert failure: (src/v/raft/consensus.cc:2434) 'last_included_term.has_value()' Unable to get term for snapshot last included offset: 9499, log: {
    offsets: {start_offset:9457, committed_offset:11164, committed_offset_term:2, dirty_offset:11164, dirty_offset_term:2},
    is_closed: false, segments: [
        {size: 12, [
            {offset_tracker:{term:2, base_offset:9500, committed_offset:9669, dirty_offset:9669},

Notice that start_offset as reported by the storage layer is lower than the base_offset of the first segment. start_offset must be great-or-equal.

In the logs the following lines were observed:

Compacting 2 adjacent segments: [
   Segment 1: {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ...
   Segment 2: {offset_tracker:{term:1, base_offset:9457, committed_offset:9499, dirty_offset:9499} ...

Followed shortly after by:

log_eviction_stm.cc:164 - requested to write raft snapshot (prefix_truncate) at 9456

disk_log_impl.cc:917 - Final compacted segment {offset_tracker:{term:1, base_offset:9190, committed_offset:9499, dirty_offset:9499}
segment_utils.cc:483 - swapping compacted segment temp file /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log.compaction.staging with the segment /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log

disk_log_impl.cc:2403 - Removing "/var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ...

This points to a race condition between adjacent segment compaction and prefix truncation. Segment 2 was "folded" into Segment 1. When prefix truncation tried to remove the pre-compaction Segment 1 it ended up removing data for Segment 2 as well. The result is data loss and metadata inconsistency.

We fix this by introducing mutual exclusion of truncation and compaction routine similar to what we do in suffix truncation.


I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.


Reproduced with these changes https://gist.github.com/nvartolomei/d25ee2b9b2c25e6a2726f96ad89ab723

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fix a race condition between suffix truncation / delete records and adjacent segment compaction that can lead to crashes and data-loss.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 12, 2024

Copy link
Contributor

@andrwng andrwng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can also be fixed by just taking a read/write lock on the segment but the change is more intrusive as it requires additional refactoring.

+1, fwiw this solution still feels intuitive, given it's used for other flavors of segment removal

@nvartolomei
Copy link
Contributor Author

/cdt

Comment on lines +2482 to +2483
ssx::semaphore_units seg_rewrite_units
= co_await _segment_rewrite_lock.get_units();
Copy link
Member

@dotnwat dotnwat Mar 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can also be fixed by just taking a read/write lock on the
segment but the change is more intrusive as it requires additional
refactoring.

What's the gist of the refactor? If this is more intuitive and the change is small then it might make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Routines for closing the segment acquire a write lock. The refactor I was thinking about is about refactoring these methods so that they can be passed the lock to avoid trying and failing to write lock it twice.

A crash was observed in CI caused by storage layer metadata
inconsistency and accompanied by data loss.

```
Assert failure: (src/v/raft/consensus.cc:2434) 'last_included_term.has_value()' Unable to get term for snapshot last included offset: 9499, log: {
    offsets: {start_offset:9457, committed_offset:11164, committed_offset_term:2, dirty_offset:11164, dirty_offset_term:2},
    is_closed: false, segments: [
        {size: 12, [
            {offset_tracker:{term:2, base_offset:9500, committed_offset:9669, dirty_offset:9669},
```

Notice that `start_offset` as reported by the storage layer is lower
than the `base_offset` of the first segment. `start_offset` must be
great-or-equal.

In the logs the following lines were observed:

```
Compacting 2 adjacent segments: [
   Segment 1: {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ...
   Segment 2: {offset_tracker:{term:1, base_offset:9457, committed_offset:9499, dirty_offset:9499} ...
```

Followed shortly after by:

```
log_eviction_stm.cc:164 - requested to write raft snapshot (prefix_truncate) at 9456

disk_log_impl.cc:917 - Final compacted segment {offset_tracker:{term:1, base_offset:9190, committed_offset:9499, dirty_offset:9499}
segment_utils.cc:483 - swapping compacted segment temp file /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log.compaction.staging with the segment /var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log

disk_log_impl.cc:2403 - Removing "/var/lib/redpanda/data/kafka/test-topic/0_28/9190-1-v1.log" (remove_prefix_full_segments, {offset_tracker:{term:1, base_offset:9190, committed_offset:9456, dirty_offset:9456} ...
```

This points to a race condition between adjacent segment compaction and
prefix truncation. Segment 2 was "folded" into Segment 1. When prefix
truncation tried to remove the pre-compaction Segment 1 it ended up
removing data for Segment 2 as well. The result is data loss and
metadata inconsistency.

We fix this by introducing mutual exclusion of truncation and compaction
routine similar to what we do in suffix truncation.

---

I believe this can also be fixed by just taking a read/write lock on the
segment but the change is more intrusive as it requires additional
refactoring.
@nvartolomei
Copy link
Contributor Author

/cdt

@nvartolomei
Copy link
Contributor Author

All CDT failures are known and referenced in #17042

Please merge this. 🙇‍♂️

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome cover letter, thanks for the explanation.

should this not be backported? we've only seen it in CI and it's changes to concurrency. Happy either way, just wanted to check.

@dotnwat dotnwat merged commit def0cd5 into redpanda-data:dev Mar 22, 2024
14 of 17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants