-
Notifications
You must be signed in to change notification settings - Fork 552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466
Conversation
Sometimes we want to detect situations when things are get stuck. Every Redpanda service usually have a stop method which is invoked during shutdown and which could prevent graceful shutdown when it hangs. Another example is all types of eviction/housekeeping loops in the code. When these loops are stuck Redpanda is running out of resources. We want to be able to detect this. The 'watchdog' utitily is a simple safeguard. When the object is created it takes a timeout and a callback. When timeut expires the callback is invoked. The watchdog can be created on a stack before the asynchronous operation that can potentially get stuck.
The logic of the eviction loop did not change. The watchdog callback only logs error message on error level so it'd be easier to detect this situation. The alert could be created based on this log message.
f02ff3d
to
27ea035
Compare
remote_segment_record_batch_reader::stop method.
CI failure - #16471 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this might be pretty useful in general for debugging shutdown hangs.
Previously, the remote_partition used async_manifest_view instance to acquire the ntp. This is not safe in some cases because the async_manifest_view could be disposed. Normally this isn't happening but this could happen in tests and it might happen in Redpanda when something is stuck during shutdown process.
Force-push: added |
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44693#018d7974-3cee-4c8e-a3b2-6284fff1a0af |
/backport v23.3.x |
/backport v23.2.x |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a nice little utility, for debugging, but I don't think we should be relying on it too much. How do you see how retiring these? They feel like once they are added it's going to be hard to remove them. Are we adding sufficient logging and debugging elsewhere to debug the hang once it occurs?
Why should it be difficult to remove? The code that uses |
We have few places with very strong liveness guarantees. The eviction loop for instance. If it will get stuck then the redpanda process will eventually have problems (TS readers will get stuck) and will require to be restarted. And the cause and effect in this case are distant. In one case around two days have passed between the problem and the outcome. We should of cause fix the root cause but the |
I don't mean that it is mechanically/technically difficult to remove, I mean that it won't get removed because of perception that it is watching out for something "bad" that might happen. Right now that's true, but it seems like we should solve the underlying problem? |
I agree but a major code reorg is required to avoid this pitfall. We often use a pattern when the object manages at least one background fiber. I believe that in many cases this is an anti-pattern. It relies on implicit ordering of things and/or synchronization and generally prone to concurrency bugs. And in this case it led to liveliness issue which can't be solved in general. We can't really guarantee that the eviction loop will never get stuck because future bugs down the stack could potentially make it stuck. So I agree that it's difficult to remove this watchdog because it requires us to rethink the way this subsystem works. |
@Lazin got it. I didn't realize we would need to do major work to avoid the core issue. Thanks! |
Sometimes we want to detect situations when things are get stuck. Every
Redpanda service usually have a stop method which is invoked during
shutdown and which could prevent graceful shutdown when it hangs.
Another example is all types of eviction/housekeeping loops in the code.
When these loops are stuck Redpanda is running out of resources. We want
to be able to detect this.
The 'watchdog' utitily is a simple safeguard. When the object is created
it takes a timeout and a callback. When timeut expires the callback is
invoked. The watchdog can be created on a stack before the asynchronous
operation that can potentially get stuck.
Usage:
This utility is used in the
remote_partition::run_eviction_loop
method. This is a background fiber that disposes the segments and readers asynchronously (it eliminates the need for the active reader to wait for the previous reader to stop). When this loop can't make progress we can run out of resources and all readers will get stuck. To detect this thewatchdog
is created. The callback of the watchdog logs error message.Backports Required
Release Notes
Improvements