Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloud_storage: Add watchdog utility and use it to detect stuck eviction loop #16466

Merged
merged 9 commits into from
Feb 5, 2024

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Feb 3, 2024

Sometimes we want to detect situations when things are get stuck. Every
Redpanda service usually have a stop method which is invoked during
shutdown and which could prevent graceful shutdown when it hangs.
Another example is all types of eviction/housekeeping loops in the code.
When these loops are stuck Redpanda is running out of resources. We want
to be able to detect this.

The 'watchdog' utitily is a simple safeguard. When the object is created
it takes a timeout and a callback. When timeut expires the callback is
invoked. The watchdog can be created on a stack before the asynchronous
operation that can potentially get stuck.

Usage:

ss::future<> service: stop() {
  watchdog wd(10s, [] {
    vlog(logger.error, "service::stop hang");
  });
  _as.request_abort();
  co_await _gate.close();  //<-- operation that can potentially get stuck
}

This utility is used in the remote_partition::run_eviction_loop method. This is a background fiber that disposes the segments and readers asynchronously (it eliminates the need for the active reader to wait for the previous reader to stop). When this loop can't make progress we can run out of resources and all readers will get stuck. To detect this the watchdog is created. The callback of the watchdog logs error message.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Improvements

  • Improves observability by allowing Redpanda to detect that some internal processes are stuck.

Sometimes we want to detect situations when things are get stuck. Every
Redpanda service usually have a stop method which is invoked during
shutdown and which could prevent graceful shutdown when it hangs.
Another example is all types of eviction/housekeeping loops in the code.
When these loops are stuck Redpanda is running out of resources. We want
to be able to detect this.

The 'watchdog' utitily is a simple safeguard. When the object is created
it takes a timeout and a callback. When timeut expires the callback is
invoked. The watchdog can be created on a stack before the asynchronous
operation that can potentially get stuck.
The logic of the eviction loop did not change. The watchdog callback
only logs error message on error level so it'd be easier to detect this
situation. The alert could be created based on this log message.
@Lazin
Copy link
Contributor Author

Lazin commented Feb 4, 2024

CI failure - #16471

abhijat
abhijat previously approved these changes Feb 5, 2024
Copy link
Contributor

@abhijat abhijat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this might be pretty useful in general for debugging shutdown hangs.

Previously, the remote_partition used async_manifest_view instance to
acquire the ntp. This is not safe in some cases because the
async_manifest_view could be disposed. Normally this isn't happening but
this could happen in tests and it might happen in Redpanda when
something is stuck during shutdown process.
@Lazin
Copy link
Contributor Author

Lazin commented Feb 5, 2024

Force-push: added ntp field to the remote_partition. Some of our tests are starting and stopping remote_partition instances out of main application. This makes it impossible to acquire ntp in the stop method. This is not happening normally in Redpanda but it will still be safer to have a field in the object itself rather then depending on some external lifetime hierarchy.

@vbotbuildovich
Copy link
Collaborator

@Lazin Lazin merged commit 44fb8fe into redpanda-data:dev Feb 5, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a nice little utility, for debugging, but I don't think we should be relying on it too much. How do you see how retiring these? They feel like once they are added it's going to be hard to remove them. Are we adding sufficient logging and debugging elsewhere to debug the hang once it occurs?

@Lazin
Copy link
Contributor Author

Lazin commented Feb 7, 2024

It's a nice little utility, for debugging, but I don't think we should be relying on it too much. How do you see how retiring these? They feel like once they are added it's going to be hard to remove them. Are we adding sufficient logging and debugging elsewhere to debug the hang once it occurs?

Why should it be difficult to remove? The code that uses watchdog doesn't depend on it. The watchdog can be easily removed at any moment in time because it's just logging.

@Lazin
Copy link
Contributor Author

Lazin commented Feb 7, 2024

We have few places with very strong liveness guarantees. The eviction loop for instance. If it will get stuck then the redpanda process will eventually have problems (TS readers will get stuck) and will require to be restarted. And the cause and effect in this case are distant. In one case around two days have passed between the problem and the outcome. We should of cause fix the root cause but the watchdog is a nice safety net for such cases.

@dotnwat
Copy link
Member

dotnwat commented Apr 10, 2024

Why should it be difficult to remove?

I don't mean that it is mechanically/technically difficult to remove, I mean that it won't get removed because of perception that it is watching out for something "bad" that might happen. Right now that's true, but it seems like we should solve the underlying problem?

@Lazin
Copy link
Contributor Author

Lazin commented Apr 10, 2024

I mean that it won't get removed because of perception that it is watching out for something "bad" that might happen. Right now that's true, but it seems like we should solve the underlying problem?

I agree but a major code reorg is required to avoid this pitfall. We often use a pattern when the object manages at least one background fiber. I believe that in many cases this is an anti-pattern. It relies on implicit ordering of things and/or synchronization and generally prone to concurrency bugs. And in this case it led to liveliness issue which can't be solved in general. We can't really guarantee that the eviction loop will never get stuck because future bugs down the stack could potentially make it stuck. So I agree that it's difficult to remove this watchdog because it requires us to rethink the way this subsystem works.

@dotnwat
Copy link
Member

dotnwat commented Apr 11, 2024

I agree but a major code reorg is required to avoid this pitfall

@Lazin got it. I didn't realize we would need to do major work to avoid the core issue. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants