-
Notifications
You must be signed in to change notification settings - Fork 564
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v23.3.x] cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16476
Merged
piyushredpanda
merged 9 commits into
redpanda-data:v23.3.x
from
vbotbuildovich:backport-pr-16466-v23.3.x-165
Feb 5, 2024
Merged
[v23.3.x] cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16476
piyushredpanda
merged 9 commits into
redpanda-data:v23.3.x
from
vbotbuildovich:backport-pr-16466-v23.3.x-165
Feb 5, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sometimes we want to detect situations when things are get stuck. Every Redpanda service usually have a stop method which is invoked during shutdown and which could prevent graceful shutdown when it hangs. Another example is all types of eviction/housekeeping loops in the code. When these loops are stuck Redpanda is running out of resources. We want to be able to detect this. The 'watchdog' utitily is a simple safeguard. When the object is created it takes a timeout and a callback. When timeut expires the callback is invoked. The watchdog can be created on a stack before the asynchronous operation that can potentially get stuck. (cherry picked from commit f44d820)
(cherry picked from commit 0f221f8)
The logic of the eviction loop did not change. The watchdog callback only logs error message on error level so it'd be easier to detect this situation. The alert could be created based on this log message. (cherry picked from commit c1be886)
(cherry picked from commit 63125a7)
(cherry picked from commit 27ea035)
(cherry picked from commit f5b9ac2)
(cherry picked from commit 7305843)
remote_segment_record_batch_reader::stop method. (cherry picked from commit 443177e)
Previously, the remote_partition used async_manifest_view instance to acquire the ntp. This is not safe in some cases because the async_manifest_view could be disposed. Normally this isn't happening but this could happen in tests and it might happen in Redpanda when something is stuck during shutdown process. (cherry picked from commit 5b0017f)
Lazin
force-pushed
the
backport-pr-16466-v23.3.x-165
branch
from
February 5, 2024 16:27
67fee04
to
8fe10ed
Compare
Lazin
approved these changes
Feb 5, 2024
Failure is #16013 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport of PR #16466