Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing kubernetes ratelimit queue for configuration reloading #3147

Closed
yue9944882 opened this issue Apr 7, 2018 · 5 comments
Closed

Comments

@yue9944882
Copy link
Contributor

yue9944882 commented Apr 7, 2018

Do you want to request a feature or report a bug?

Feature Request

What did you expect to see?

After #2848, we have unlocked some newer utilities of kubernetes client-go. I think it's the time to introduce k8s.io/client-go/util/workqueue to traefik.

This package provides a queue with rate limit and we can use it to prevent some ramification and overhead e.g. when the ingress or other resources is changing frequently. This will be helpful for large-scale deployment in kubernetes.

Also we can provide an item to configure the limit rate in the [kubernetes] section of the toml file and default the ratelimit to 10s-ish (not sure about this number).

What did you see instead?

Currently every small changes in kubernetes will trigger a configuration reload instantly.

@timoreimann
Copy link
Contributor

Note that Traefik already does some form of throttling after a configuration has been rendered: The ProvidersThrottleDuration parameter controls how long to delay a configuration update to make sure that we apply a new configuration at most every configured time units. The default value is 2 seconds.

Throttling even further at the provider level would move the rate limiting closer to the source of the event. I'm not entirely sure how much of a win that would be though given that the provider work happens all in-memory; I suppose it largely depends on the number of objects being managed by Traefik as you mentioned as well.

To my understanding, there's another use case for the work queue, which is to control the rate by which we should try re-processing events that we could not apply. This would be helpful to replace the static resync period that we have right now by a more reasonable exponential backoff pattern.

WDYT?

@yue9944882
Copy link
Contributor Author

yue9944882 commented Apr 10, 2018

This would be helpful to replace the static resync period that we have right now by a more reasonable exponential backoff pattern.

@timoreimann Umm I am sort of confused about the usage of exponential backoff here.
Surely the backoff could prevent traefik from doing some useless re-process, OTOH it will also delay the correct event we need when it arrives. So how could we clearify the possible correct event?
Alternatively, for prevent noisy k8s event flooding, we can set a max backoff delay for the event queue so that too many noisy events will not delay the correct event for too long. In this way, maybe we could use ItemExponentialFailureRateLimiter from client-go/utils/workqueue

@timoreimann
Copy link
Contributor

My experience with the work queue in client-go is still fairly limited, so take everything I say or claim with a grain of salt. :-) From what I know, it is possible to put elements back into the queue and revisit them at a periodicity dictated by an exponential factor. The idea is that, if an element cannot be processed for some reason (say, we cannot set up an Ingress because we can't access the secret for RBAC reasons), we want to conserve resources by retrying at longer and longer intervals, hoping that the situation will eventually be remedied.

On second thought, making this work may require changing how the Ingress controller in Traefik operates as we currently treat every event as a signal to process everything again. I suppose we'd have to carefully assess if the change in design is really worth it.

I also assumed that the backoff applied when putting an element back into the queue for failure reasons can be different from the backoff affecting new, incoming events. Ideally, the backoff should be indexable so that offending Ingresses can be backed off while others continue to be processed without any penalty whatsoever. Again, I'm lacking experience on what's possible in this field.

In summary, we may or may not have two separate feature requests hidden under the surface.

Anyways... I'm not opposed to the idea to throttle the rate of incoming events per se. I'd like to make sure though that there are actual users who have measured a negative impact of too frequently changing configurations and would benefit from the proposed feature. If that user is you, @yue9944882, please let me know. :-)

@yue9944882
Copy link
Contributor Author

yue9944882 commented Apr 17, 2018

I'd like to make sure though that there are actual users who have measured a negative impact of too frequently changing configurations and would benefit from the proposed feature.

@timoreimann Thanks. Maybe we could listen if any user hits this issue. I'm not having any negative impact from this currently. Just theoritically concerning if frequent changes will impact.

@timoreimann
Copy link
Contributor

Sure, let's keep our eyes and ears open and come back to this issue when/if we need to.

I'll close the issue if you don't mind until we can observe a real-world business need. Thanks!

@traefik traefik locked and limited conversation to collaborators Sep 1, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants