Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v23.3.x] cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16476

Conversation

vbotbuildovich
Copy link
Collaborator

Backport of PR #16466

@vbotbuildovich vbotbuildovich added this to the v23.3.x-next milestone Feb 5, 2024
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Feb 5, 2024
Sometimes we want to detect situations when things are get stuck. Every
Redpanda service usually have a stop method which is invoked during
shutdown and which could prevent graceful shutdown when it hangs.
Another example is all types of eviction/housekeeping loops in the code.
When these loops are stuck Redpanda is running out of resources. We want
to be able to detect this.

The 'watchdog' utitily is a simple safeguard. When the object is created
it takes a timeout and a callback. When timeut expires the callback is
invoked. The watchdog can be created on a stack before the asynchronous
operation that can potentially get stuck.

(cherry picked from commit f44d820)
(cherry picked from commit 0f221f8)
The logic of the eviction loop did not change. The watchdog callback
only logs error message on error level so it'd be easier to detect this
situation. The alert could be created based on this log message.

(cherry picked from commit c1be886)
remote_segment_record_batch_reader::stop method.

(cherry picked from commit 443177e)
Previously, the remote_partition used async_manifest_view instance to
acquire the ntp. This is not safe in some cases because the
async_manifest_view could be disposed. Normally this isn't happening but
this could happen in tests and it might happen in Redpanda when
something is stuck during shutdown process.

(cherry picked from commit 5b0017f)
@Lazin Lazin force-pushed the backport-pr-16466-v23.3.x-165 branch from 67fee04 to 8fe10ed Compare February 5, 2024 16:27
@piyushredpanda
Copy link
Contributor

Failure is #16013

@piyushredpanda piyushredpanda merged commit 6fa86dc into redpanda-data:v23.3.x Feb 5, 2024
15 of 17 checks passed
@piyushredpanda piyushredpanda modified the milestones: v23.3.x-next, v23.3.5 Feb 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants