Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: clean up staging files on deletion #7912

Merged
merged 3 commits into from
Jan 10, 2023

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Dec 22, 2022

If the generation_id is updated while we write the compaction output, we
end up returning early without keeping track of the staging files. This
could result in files being left over, even after removal of the
partition since we currently don't allow removing the NTP directory
while any unexpected files exist.

This PR addresses this in two ways:

  • by removing all files suffixed with ".staging" when a partition is deleted
  • by immediately removing staging files if exiting out of compaction early

The latter approach as implemented by this PR doesn't completely cover every instance of aborted compactions, just the ones seen in the wild and commonly seen in the storage unit test. Tackling this more holistically will be a broader change that will take more time and be harder to backport.

Backports Required

  • none - not a bug fix
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v22.3.x
  • v22.2.x
  • v22.1.x

UX Changes

Release Notes

Bug Fixes

  • Files left over from aborted compactions will now be cleaned up more robustly.

@andrwng
Copy link
Contributor Author

andrwng commented Dec 22, 2022

Context here is that there's a cluster that has several orphaned files, and in their logs I see:

2022-12-09T21:30:56+09:00 {} 2022-12-09T12:30:56.667153985Z stderr F INFO 2022-12-09 12:30:56,666 [shard 0] storage-gc - disk_log_impl.cc:529 - Aborting compaction of a segment: {offset_tracker:{term:1, base_offset:117199576, committed_offset:133010063, dirty_offset:133010063}, compacted_segment=1, finished_self_compaction=1, generation={68371}, reader={/var/lib/redpanda/data/<ntp>/1_173872/117199576-1-v1.log, (1932550142 bytes)}, writer=nullptr, cache=nullptr, compaction_index:nullopt, closed=1, tombstone=1, index={file:/var/lib/redpanda/data/<ntp>/1_173872/117199576-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{133010063}, base_timestamp:{timestamp: 1670492269909}, max_timestamp:{timestamp: 1670501374930}, index(52261,52261,52261)}, step:32768, needs_persistence:0}}. Generation id mismatch, previous generation: 68370

In my runs of the test, I couldn't reliably reproduce the adjacent segment compaction merging abort that I expected to, but I'm fairly certain this is these code path being hit. Open to further test suggestions.

I also considered making a more wholistic change that passed an out-parameter to callers the populate files to clean up, but opted to go with a less invasive approach to start.

@@ -2620,6 +2620,12 @@ FIXTURE_TEST(write_truncate_compact, storage_test_fixture) {
info("produce_done");
truncate.get();
info("truncate_done");

// Ensure we've cleaned up all our staging segments such that a removal of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to do a more localized unit test of the compaction code that twiddles the generation to force the abort path and validates deletion? Perhaps not, but it would be nice to have a test that deterministically exercises it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, it'd be nice to have a better way to reproduce this bug, though I changed approaches so a more targeted test makes a bit less sense for this PR.

jcsp
jcsp previously approved these changes Dec 22, 2022
@piyushredpanda piyushredpanda added this to the v22.3.10 milestone Dec 22, 2022
@piyushredpanda
Copy link
Contributor

Would be awesome to get this in for v22.3.10, scheduled 6h Jan, @andrwng

@andrwng
Copy link
Contributor Author

andrwng commented Dec 23, 2022

Would be awesome to get this in for v22.3.10, scheduled 6h Jan, @andrwng

Will keep that in mind. I needs some updates though; after injecting the failure I'm still seeing some files leftover.

@andrwng andrwng changed the title storage: clean up after aborted compaction storage: clean up staging files on deletion Jan 5, 2023
@andrwng
Copy link
Contributor Author

andrwng commented Jan 5, 2023

Context here is that there's a cluster that has several orphaned files, and in their logs I see:

2022-12-09T21:30:56+09:00 {} 2022-12-09T12:30:56.667153985Z stderr F INFO 2022-12-09 12:30:56,666 [shard 0] storage-gc - disk_log_impl.cc:529 - Aborting compaction of a segment: {offset_tracker:{term:1, base_offset:117199576, committed_offset:133010063, dirty_offset:133010063}, compacted_segment=1, finished_self_compaction=1, generation={68371}, reader={/var/lib/redpanda/data/<ntp>/1_173872/117199576-1-v1.log, (1932550142 bytes)}, writer=nullptr, cache=nullptr, compaction_index:nullopt, closed=1, tombstone=1, index={file:/var/lib/redpanda/data/<ntp>/1_173872/117199576-1-v1.base_index, offsets:{0}, index:{header_bitflags:0, base_offset:{0}, max_offset:{133010063}, base_timestamp:{timestamp: 1670492269909}, max_timestamp:{timestamp: 1670501374930}, index(52261,52261,52261)}, step:32768, needs_persistence:0}}. Generation id mismatch, previous generation: 68370

In my runs of the test, I couldn't reliably reproduce the adjacent segment compaction merging abort that I expected to, but I'm fairly certain this is these code path being hit. Open to further test suggestions.

I also considered making a more wholistic change that passed an out-parameter to callers the populate files to clean up, but opted to go with a less invasive approach to start.

In manually twiddling the generation ID condition to always trigger the aborted adjacent segment compaction path, I found more edge cases in cleanup that made this change a bit trickier. To boot, I found myself chasing down a staging file that I ultimately couldn't find the source of (and thus couldn't find a place to clean it up). Our implementation of using staging files seems a little brittle, so if going down the route of cleaning up after abort, perhaps we should tackle this even more holistically (e.g. alongside any crash-consistency efforts).

For now, I've changed approaches to just clean up staging files on removal. It's not the best approach, but it is an improvement over what we have today.

EDIT: as I was typing this up, I felt a nagging that we should still be doing some cleanup if we can, so I've also brought back the cleanup in the initial draft.

@andrwng andrwng marked this pull request as ready for review January 5, 2023 23:30
@andrwng
Copy link
Contributor Author

andrwng commented Jan 6, 2023

ducktape test failure: #8072

@andrwng
Copy link
Contributor Author

andrwng commented Jan 6, 2023

CI failure is #8084

@@ -470,6 +470,10 @@ ss::future<std::optional<size_t>> do_self_compact_segment(
"generation: {}, skipping compaction",
s->get_generation_id(),
segment_generation);
const ss::sstring staging_file = s->reader().path().to_staging();
if (co_await ss::file_exists(staging_file)) {
co_await ss::remove_file(staging_file);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's log an error here so that we notice if it is happening in tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logged at the removal site (here just indicates concurrent compaction which isn't problematic)

If the generation_id is updated while we write the compaction output, we
end up returning early without keeping track of the staging files. This
could result in files being left over, even after removal of the
partition since we currently don't allow removing the NTP directory
while any unexpected files exist.

This commit addresses this by removing all files suffixed with
".staging" when a partition is deleted.

I considered an alternate fix wherein we kept track of all staging files
while compacting, but opted to scrap the approach, as it became a fairly
invasive change with several edge cases (e.g. staging files when
compacting a staged segment), and this fix will likely need to be
backported, so a simpler approach is preferrable.
If the generation_id is updated while we write the compaction output, we
end up returning early without keeping track of the staging files. This
could result in files being left over, even after removal of the
partition.

This commit addresses this by immediately removing files that may go
unused upon exiting early out of a compaction due to a generation ID
mismatch.
Comment on lines +590 to +591
if (co_await ss::file_exists(ss::sstring(f))) {
co_await ss::remove_file(ss::sstring(f));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the race condition here is a concern, you could call remove_file and ignore an exception containing an ENOENT error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this seems like a good idea. It's unclear to what extent these operations race with one another, but I can imagine there being some race with truncation that results in weird behavior. Will revisit this, since it looks like there are still some leftover files.

@piyushredpanda piyushredpanda modified the milestones: v22.3.10, v22.3.x-next Jan 9, 2023
@jcsp jcsp merged commit efe7d8a into redpanda-data:dev Jan 10, 2023
@jcsp
Copy link
Contributor

jcsp commented Jan 10, 2023

/backport v22.3.x

@daisukebe
Copy link
Contributor

/backport v22.2.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants