Skip to content

Conversation

@david-garcia-garcia
Copy link
Contributor

@david-garcia-garcia david-garcia-garcia commented Sep 5, 2025

What does this PR do?

In v3.4 a redis backend was introduced for the ratelimit middleware.

This introduces a hard dependency in the critical path: if redis is down or has a transient failure, the whole middleware goes down.

This is not a problem with the in memory backend because the chance of failure when updating the bucket is zero. But not with redis where you have a remote backend and many things can glitch (network, the backend itself, etc.)

This PR introduces three new configuration settings:

  • backoffDuration
  • backoffTimeout
  • backoffThreshold

The ratelimit will stop working (requests will go through without issues) for a period of backoffDuration when backoffThreshold requests have failed in a time window of backoffThreshold.

I considered adding a simple "denyOnError" flag, but from my experience, this would not be good because when the backend is down performance will be severly hit as all requests will try to connect to redis until the timeout is reached.

Related issues:

When not configured, current behaviour is honored, where failure to update the bucket results in the request returning a 500 error.

I had some additional ideas that were finally not implemented to try and keep this focused and as simple as possible:

  • We might want to limit how long the backend is unhealthy, and then fully shutdown down the middleware when we consider it is totally dead. So current backoff behaviour that lets requests through would be a "transient" shutdown, and then have a second parameter where if the backoff persists for too long, it will truly be shutdown and users receive a 500.
  • Instead of a fixed backoff, use exponential backoff until reaching backoffDuration

I implemented this at the Ratelimit middleware instead of inside the Redis Limiter itself because I believe that is where the responsibility should be at, i.e. we could have other limiters in the future that could use this, and the purpose is dealing with failures in the limiter component.

Motivation

Ingress components are critical and should be resilient to failures when possible.

More

  • Added/updated tests
  • Added/updated documentation

Additional Notes

Fixes #12043

@nmengin
Copy link
Contributor

nmengin commented Sep 8, 2025

Hello @david-garcia-garcia,
Thank you for your contribution.

We've set the status to "design-review" to allow us to check the PR and ensure there is no deep impact on Traefik before moving forward.

We'll keep you updated once the analysis is done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add health track to ratelimit middleware

4 participants