Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.1.x] cloud_storage: tolerate exceptions in the hydration loop #10917

Merged
merged 3 commits into from May 24, 2023

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented May 22, 2023

Backport of PR #10828 and #9786.

The fix in #10828 unveiled a deadlock in input_fanout_stream which was fixed in #9786.

Vlad Lazar added 2 commits May 22, 2023 12:30
This commit extends `remote_segment` to track the status of the
hydration loop. Previously, if a hydration is requested via
`remote_segment::hydrate()` whilst the loop is not running, the request
will stall until the client times it out. This also prevents Redpanda
from cleaing up the Kafka connection properly.

With this patch, the client gets an error instead an the connection is
released.
Previously, any unexpected exceptions thrown in the hydration loop
caused it to stop, leaving the `remote_segment` in an undetermined
state. We have seen this in the field when permissions for the cloud
storage cache prevent writes.

With this patch, these exceptions are tolerated and hydration can be
retried.
When there are multiple consumers in a fanout with a deep queue and some
consumers have progressed ahead, if a slow consumer exits early by
detaching, this causes many buffers in the front of the queue to have
their bitmasks set. When a fast consumer then does the next read, it
finds a buffer to read from, sets its own bit in the mask, and then
finds the buffer eligible to pop. But the invariant that the popped
buffer should be at the front of the queue is now broken.

The changes here clean up the front of the queue before setting any
masks, so that any buffers which have their mask already set to all
bits are popped out before we proceed with the current read. This
maintains the invariant.

(cherry picked from commit bc13bca)
@VladLazar VladLazar requested review from andrwng, jcsp and abhijat and removed request for andrwng May 22, 2023 15:35
@VladLazar VladLazar merged commit 24ee08b into redpanda-data:v23.1.x May 24, 2023
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants