cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466

Lazin · 2024-02-03T15:54:12Z

Sometimes we want to detect situations when things are get stuck. Every
Redpanda service usually have a stop method which is invoked during
shutdown and which could prevent graceful shutdown when it hangs.
Another example is all types of eviction/housekeeping loops in the code.
When these loops are stuck Redpanda is running out of resources. We want
to be able to detect this.

The 'watchdog' utitily is a simple safeguard. When the object is created
it takes a timeout and a callback. When timeut expires the callback is
invoked. The watchdog can be created on a stack before the asynchronous
operation that can potentially get stuck.

Usage:

ss::future<> service: stop() {
  watchdog wd(10s, [] {
    vlog(logger.error, "service::stop hang");
  });
  _as.request_abort();
  co_await _gate.close();  //<-- operation that can potentially get stuck
}

This utility is used in the remote_partition::run_eviction_loop method. This is a background fiber that disposes the segments and readers asynchronously (it eliminates the need for the active reader to wait for the previous reader to stop). When this loop can't make progress we can run out of resources and all readers will get stuck. To detect this the watchdog is created. The callback of the watchdog logs error message.

Backports Required

Release Notes

Improvements

Improves observability by allowing Redpanda to detect that some internal processes are stuck.

Sometimes we want to detect situations when things are get stuck. Every Redpanda service usually have a stop method which is invoked during shutdown and which could prevent graceful shutdown when it hangs. Another example is all types of eviction/housekeeping loops in the code. When these loops are stuck Redpanda is running out of resources. We want to be able to detect this. The 'watchdog' utitily is a simple safeguard. When the object is created it takes a timeout and a callback. When timeut expires the callback is invoked. The watchdog can be created on a stack before the asynchronous operation that can potentially get stuck.

The logic of the eviction loop did not change. The watchdog callback only logs error message on error level so it'd be easier to detect this situation. The alert could be created based on this log message.

remote_segment_record_batch_reader::stop method.

Lazin · 2024-02-04T12:41:07Z

CI failure - #16471

abhijat

LGTM, this might be pretty useful in general for debugging shutdown hangs.

Previously, the remote_partition used async_manifest_view instance to acquire the ntp. This is not safe in some cases because the async_manifest_view could be disposed. Normally this isn't happening but this could happen in tests and it might happen in Redpanda when something is stuck during shutdown process.

Lazin · 2024-02-05T12:40:36Z

Force-push: added ntp field to the remote_partition. Some of our tests are starting and stopping remote_partition instances out of main application. This makes it impossible to acquire ntp in the stop method. This is not happening normally in Redpanda but it will still be safer to have a field in the object itself rather then depending on some external lifetime hierarchy.

vbotbuildovich · 2024-02-05T14:40:20Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/44693#018d7974-3cee-4c8e-a3b2-6284fff1a0af

vbotbuildovich · 2024-02-05T16:04:05Z

/backport v23.3.x

vbotbuildovich · 2024-02-05T16:04:06Z

/backport v23.2.x

dotnwat

It's a nice little utility, for debugging, but I don't think we should be relying on it too much. How do you see how retiring these? They feel like once they are added it's going to be hard to remove them. Are we adding sufficient logging and debugging elsewhere to debug the hang once it occurs?

Lazin · 2024-02-07T12:53:35Z

It's a nice little utility, for debugging, but I don't think we should be relying on it too much. How do you see how retiring these? They feel like once they are added it's going to be hard to remove them. Are we adding sufficient logging and debugging elsewhere to debug the hang once it occurs?

Why should it be difficult to remove? The code that uses watchdog doesn't depend on it. The watchdog can be easily removed at any moment in time because it's just logging.

Lazin · 2024-02-07T12:57:04Z

We have few places with very strong liveness guarantees. The eviction loop for instance. If it will get stuck then the redpanda process will eventually have problems (TS readers will get stuck) and will require to be restarted. And the cause and effect in this case are distant. In one case around two days have passed between the problem and the outcome. We should of cause fix the root cause but the watchdog is a nice safety net for such cases.

dotnwat · 2024-04-10T00:02:09Z

Why should it be difficult to remove?

I don't mean that it is mechanically/technically difficult to remove, I mean that it won't get removed because of perception that it is watching out for something "bad" that might happen. Right now that's true, but it seems like we should solve the underlying problem?

Lazin · 2024-04-10T12:16:43Z

I mean that it won't get removed because of perception that it is watching out for something "bad" that might happen. Right now that's true, but it seems like we should solve the underlying problem?

I agree but a major code reorg is required to avoid this pitfall. We often use a pattern when the object manages at least one background fiber. I believe that in many cases this is an anti-pattern. It relies on implicit ordering of things and/or synchronization and generally prone to concurrency bugs. And in this case it led to liveliness issue which can't be solved in general. We can't really guarantee that the eviction loop will never get stuck because future bugs down the stack could potentially make it stuck. So I agree that it's difficult to remove this watchdog because it requires us to rethink the way this subsystem works.

dotnwat · 2024-04-11T03:49:44Z

I agree but a major code reorg is required to avoid this pitfall

@Lazin got it. I didn't realize we would need to do major work to avoid the core issue. Thanks!

github-actions bot added the area/redpanda label Feb 3, 2024

Lazin added 4 commits February 3, 2024 11:00

ssx: Add tests for watchdog utility

0f221f8

cloud_storage: Use watchding to detect stuck eviction

c1be886

The logic of the eviction loop did not change. The watchdog callback only logs error message on error level so it'd be easier to detect this situation. The alert could be created based on this log message.

cloud_storage: Log exceptions thrown in the eviction loop

63125a7

cloud_storage: Avoid object copy in the eviction loop

27ea035

Lazin force-pushed the pr/add-watchdog-utility branch from f02ff3d to 27ea035 Compare February 3, 2024 16:01

Lazin requested a review from abhijat February 3, 2024 16:01

Lazin added 3 commits February 3, 2024 12:02

cloud_storage: Add watchdog to 'remote_partition::stop'

f5b9ac2

cloud_storage: Add watchdog to remote_segment::stop

7305843

cloud_storage: Add watchdog to remote_segment's...

443177e

remote_segment_record_batch_reader::stop method.

abhijat previously approved these changes Feb 5, 2024

View reviewed changes

Lazin dismissed abhijat’s stale review via 5b0017f February 5, 2024 12:37

Lazin requested a review from abhijat February 5, 2024 12:37

abhijat approved these changes Feb 5, 2024

View reviewed changes

Lazin merged commit 44fb8fe into redpanda-data:dev Feb 5, 2024
17 checks passed

This was referenced Feb 5, 2024

[v23.3.x] cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16476

Merged

[v23.2.x] cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16477

Merged

jason-da-redpanda mentioned this pull request Feb 6, 2024

stuck cloud-storage readers with chunked reads enabled #16465

Closed

dotnwat reviewed Feb 7, 2024

View reviewed changes

renovate bot mentioned this pull request May 4, 2024

feat(github-release)!: Update redpanda-operator to v24.1.6 otosky/home-ops#1232

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466

cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466

Lazin commented Feb 3, 2024

Lazin commented Feb 4, 2024

abhijat left a comment

Lazin commented Feb 5, 2024

vbotbuildovich commented Feb 5, 2024

vbotbuildovich commented Feb 5, 2024

vbotbuildovich commented Feb 5, 2024

dotnwat left a comment

Lazin commented Feb 7, 2024

Lazin commented Feb 7, 2024

dotnwat commented Apr 10, 2024

Lazin commented Apr 10, 2024

dotnwat commented Apr 11, 2024

cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466

cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466

Conversation

Lazin commented Feb 3, 2024

Backports Required

Release Notes

Improvements

Lazin commented Feb 4, 2024

abhijat left a comment

Choose a reason for hiding this comment

Lazin commented Feb 5, 2024

vbotbuildovich commented Feb 5, 2024

vbotbuildovich commented Feb 5, 2024

vbotbuildovich commented Feb 5, 2024

dotnwat left a comment

Choose a reason for hiding this comment

Lazin commented Feb 7, 2024

Lazin commented Feb 7, 2024

dotnwat commented Apr 10, 2024

Lazin commented Apr 10, 2024

dotnwat commented Apr 11, 2024