Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: fix windowed compaction error handling #16928

Merged
merged 4 commits into from
Mar 7, 2024

Conversation

andrwng
Copy link
Contributor

@andrwng andrwng commented Mar 7, 2024

We were previously using the l-value-variant of
model::record_batch_reader::consume(), which doesn't have a built-in
call to finally(). For a log_reader, this meant that in some scenarios
(e.g. if the reducer were to fail) we could end up not calling
finally(), and therefore wouldn't clear the underlying segment reader.

This is exactly what happened, as we hit:

std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall)

errors as we compressed/decompressed batches, and then ultimately
crashed because the log_reader was not closed:

Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes a bug in windowed compaction that could cause Redpanda to crash when an error occurs while reading batches.

The staging file was missing a '.'. Note that these files are transient,
so there isn't a compactibility concern here.
An upcoming test will inject an exception to simulate a failure case
seen in the wild that is otherwise hard to reproduce.

I'd considered using the file_sanitizer-based error injection, but it
seems that infra only really shines at making Redpanda hit asserts, so
it's not quite what I'm looking for (my upcoming test will assert that
we _don't_ crash in a particular error scenario).
This will be useful to test methods that need a feature table (e.g.
deduplicate_segments() has a feature table as an argument).
We were previously using the l-value-variant of
model::record_batch_reader::consume(), which doesn't have a built-in
call to finally(). For a log_reader, this meant that in some scenarios
(e.g. if the reducer were to fail) we could end up not calling
finally(), and therefore wouldn't clear the underlying segment reader.

This is exactly what happened, as we hit:

std::runtime_error (lz4f_compressend:ERROR_dstMaxSize_tooSmall)

errors as we compressed/decompressed batches, and then ultimately
crashed because the log_reader was not closed:

Assert failure: (.../redpanda/redpanda/src/v/storage/log_reader.h:141) '!_iterator.reader' log reader destroyed with live reader"
@@ -227,7 +227,7 @@ ss::future<index_state> deduplicate_segment(
&cmp_idx_writer,
inject_reader_failure);

auto new_idx = co_await rdr.consume(
auto new_idx = co_await std::move(rdr).consume(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice. i just lost my foot looking at this.

I think this if fine, but i would leave a ticket to fix record_batch_reader::impl::consume, by chaining inside consume()& an then_wrapped() to clean up in case of exceptions

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have append_entries_request::serde_async_write that does a consume like this, maybe it warrants a similar treatment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or at least a comment around this in reader::consume about the exception safety

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I wasn't sure what the rationale was but I assumed it was intentional that one doesn't call the other. At least I see that some other call-sites of the l-value variant explicitly include a finally() call

@andrwng andrwng merged commit 3261c5e into redpanda-data:dev Mar 7, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.3.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-16928-v23.3.x-857 remotes/upstream/v23.3.x
git cherry-pick -x 62fd4f83eee12ca0bd52d63de58e1b2ad4c80102 13f55371a6d23e64cf0729614c5eb715bcd3a49c 958aaf766b032c18bae659cceecb3078c55bd1b6 cc06447e3f3974427591bc2a7e16491aef653c17

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants