You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
In an application using the cloud-config server we recently observed a situation, where the config server seemingly became overloaded due to a large amount of config client requests. We are still investigating the issue, but during the investigation we figured there is potential room for improvement in the retry logic of the config-client.
By default the config client uses a ExponentialBackOffPolicy with an initial interval of 1000, a multiplier of 1.1 and a maximum of 6 retries. This policy would lead to do retries after roughly 1000, 1100ms, 1210ms, 1331ms, 1464ms and 1610ms. This retry logic is fine for many scenarios, but can cause a thundering herd problematic in others.
In our case (as it certainly might be very common), both client and server were running in Kubernetes, so while the client might have aborted after 6 attempts and stopped with an exception, the pod would be restarted and try again (with kubelet-controlled backoff between the restarts). We believe that this led to a situation where our system wasn't able to recover on its own, because the retry-behaviour of the client added salt to injury with respect to the overloaded config-server, since essentially on all retries each client would try it again at the same time.
Describe the solution you'd like
The standard solution to a thundering herd problem is to add some random jitter to the backoff intervals.
As your documentation suggests that there is the possibility to implement a RetryOperationsInterceptor, this might be an alternative worth consideration by individual users like us. I haven't yet looked into it in detail, because it seems that having some random component for the delay in spring cloud directly could make a safer default.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
In an application using the cloud-config server we recently observed a situation, where the config server seemingly became overloaded due to a large amount of config client requests. We are still investigating the issue, but during the investigation we figured there is potential room for improvement in the retry logic of the config-client.
By default the config client uses a
ExponentialBackOffPolicy
with an initial interval of 1000, a multiplier of 1.1 and a maximum of 6 retries. This policy would lead to do retries after roughly 1000, 1100ms, 1210ms, 1331ms, 1464ms and 1610ms. This retry logic is fine for many scenarios, but can cause a thundering herd problematic in others.In our case (as it certainly might be very common), both client and server were running in Kubernetes, so while the client might have aborted after 6 attempts and stopped with an exception, the pod would be restarted and try again (with kubelet-controlled backoff between the restarts). We believe that this led to a situation where our system wasn't able to recover on its own, because the retry-behaviour of the client added salt to injury with respect to the overloaded config-server, since essentially on all retries each client would try it again at the same time.
Describe the solution you'd like
The standard solution to a thundering herd problem is to add some random jitter to the backoff intervals.
As spring-cloud-config uses spring-retry, a solution could be to use the
ExponentialRandomBackOffPolicy
(https://github.com/spring-projects/spring-retry/blob/main/src/main/java/org/springframework/retry/backoff/ExponentialRandomBackOffPolicy.java) instead of theExponentialBackOffPolicy
by default or make it configurable as an alternative to the current default.Describe alternatives you've considered
As your documentation suggests that there is the possibility to implement a
RetryOperationsInterceptor
, this might be an alternative worth consideration by individual users like us. I haven't yet looked into it in detail, because it seems that having some random component for the delay in spring cloud directly could make a safer default.The text was updated successfully, but these errors were encountered: