Reduce risk of thundering herd issues in retry logic #2351

aptituz · 2023-11-09T11:40:28Z

Is your feature request related to a problem? Please describe.

In an application using the cloud-config server we recently observed a situation, where the config server seemingly became overloaded due to a large amount of config client requests. We are still investigating the issue, but during the investigation we figured there is potential room for improvement in the retry logic of the config-client.

By default the config client uses a ExponentialBackOffPolicy with an initial interval of 1000, a multiplier of 1.1 and a maximum of 6 retries. This policy would lead to do retries after roughly 1000, 1100ms, 1210ms, 1331ms, 1464ms and 1610ms. This retry logic is fine for many scenarios, but can cause a thundering herd problematic in others.

In our case (as it certainly might be very common), both client and server were running in Kubernetes, so while the client might have aborted after 6 attempts and stopped with an exception, the pod would be restarted and try again (with kubelet-controlled backoff between the restarts). We believe that this led to a situation where our system wasn't able to recover on its own, because the retry-behaviour of the client added salt to injury with respect to the overloaded config-server, since essentially on all retries each client would try it again at the same time.

Describe the solution you'd like

The standard solution to a thundering herd problem is to add some random jitter to the backoff intervals.

As spring-cloud-config uses spring-retry, a solution could be to use the ExponentialRandomBackOffPolicy (https://github.com/spring-projects/spring-retry/blob/main/src/main/java/org/springframework/retry/backoff/ExponentialRandomBackOffPolicy.java) instead of the ExponentialBackOffPolicy by default or make it configurable as an alternative to the current default.

Describe alternatives you've considered

As your documentation suggests that there is the possibility to implement a RetryOperationsInterceptor, this might be an alternative worth consideration by individual users like us. I haven't yet looked into it in detail, because it seems that having some random component for the delay in spring cloud directly could make a safer default.

The text was updated successfully, but these errors were encountered:

…#2351

aptituz added the waiting-for-triage label Nov 9, 2023

ryanjbaxter added a commit to ryanjbaxter/spring-cloud-config that referenced this issue Nov 15, 2023

Add support for random exponential backoff policy. Fixes spring-cloud…

cb8c1dc

…#2351

ryanjbaxter mentioned this issue Nov 15, 2023

Add support for random exponential backoff policy #2353

Merged

ryanjbaxter linked a pull request Nov 15, 2023 that will close this issue

Add support for random exponential backoff policy #2353

Merged

ryanjbaxter added the enhancement label Nov 20, 2023

ryanjbaxter added this to the 4.0.5 milestone Nov 20, 2023

spencergibb removed the waiting-for-triage label Nov 21, 2023

ryanjbaxter closed this as completed in 0b92d82 Nov 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce risk of thundering herd issues in retry logic #2351

Reduce risk of thundering herd issues in retry logic #2351

aptituz commented Nov 9, 2023 •

edited

Reduce risk of thundering herd issues in retry logic #2351

Reduce risk of thundering herd issues in retry logic #2351

Comments

aptituz commented Nov 9, 2023 • edited

aptituz commented Nov 9, 2023 •

edited