feat: add health track to ratelimit middleware #12042
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
In v3.4 a redis backend was introduced for the ratelimit middleware.
This introduces a hard dependency in the critical path: if redis is down or has a transient failure, the whole middleware goes down.
This is not a problem with the in memory backend because the chance of failure when updating the bucket is zero. But not with redis where you have a remote backend and many things can glitch (network, the backend itself, etc.)
This PR introduces three new configuration settings:
The ratelimit will stop working (requests will go through without issues) for a period of backoffDuration when backoffThreshold requests have failed in a time window of backoffThreshold.
I considered adding a simple "denyOnError" flag, but from my experience, this would not be good because when the backend is down performance will be severly hit as all requests will try to connect to redis until the timeout is reached.
Related issues:
When not configured, current behaviour is honored, where failure to update the bucket results in the request returning a 500 error.
I had some additional ideas that were finally not implemented to try and keep this focused and as simple as possible:
I implemented this at the Ratelimit middleware instead of inside the Redis Limiter itself because I believe that is where the responsibility should be at, i.e. we could have other limiters in the future that could use this, and the purpose is dealing with failures in the limiter component.
Motivation
Ingress components are critical and should be resilient to failures when possible.
More
Additional Notes
Fixes #12043