Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce risk of thundering herd issues in retry logic #2351

Closed
aptituz opened this issue Nov 9, 2023 · 0 comments · Fixed by #2353
Closed

Reduce risk of thundering herd issues in retry logic #2351

aptituz opened this issue Nov 9, 2023 · 0 comments · Fixed by #2353
Milestone

Comments

@aptituz
Copy link

aptituz commented Nov 9, 2023

Is your feature request related to a problem? Please describe.

In an application using the cloud-config server we recently observed a situation, where the config server seemingly became overloaded due to a large amount of config client requests. We are still investigating the issue, but during the investigation we figured there is potential room for improvement in the retry logic of the config-client.

By default the config client uses a ExponentialBackOffPolicy with an initial interval of 1000, a multiplier of 1.1 and a maximum of 6 retries. This policy would lead to do retries after roughly 1000, 1100ms, 1210ms, 1331ms, 1464ms and 1610ms. This retry logic is fine for many scenarios, but can cause a thundering herd problematic in others.

In our case (as it certainly might be very common), both client and server were running in Kubernetes, so while the client might have aborted after 6 attempts and stopped with an exception, the pod would be restarted and try again (with kubelet-controlled backoff between the restarts). We believe that this led to a situation where our system wasn't able to recover on its own, because the retry-behaviour of the client added salt to injury with respect to the overloaded config-server, since essentially on all retries each client would try it again at the same time.

Describe the solution you'd like

The standard solution to a thundering herd problem is to add some random jitter to the backoff intervals.

As spring-cloud-config uses spring-retry, a solution could be to use the ExponentialRandomBackOffPolicy (https://github.com/spring-projects/spring-retry/blob/main/src/main/java/org/springframework/retry/backoff/ExponentialRandomBackOffPolicy.java) instead of the ExponentialBackOffPolicy by default or make it configurable as an alternative to the current default.

Describe alternatives you've considered

As your documentation suggests that there is the possibility to implement a RetryOperationsInterceptor, this might be an alternative worth consideration by individual users like us. I haven't yet looked into it in detail, because it seems that having some random component for the delay in spring cloud directly could make a safer default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Status: Done
Development

Successfully merging a pull request may close this issue.

3 participants