-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[suggestion] smarter default rate limit backoff strategy #6141
Comments
The issue of course is that Sidekiq doesn't know which jobs use which limiters. If you use a separate queue for limited jobs, that job counter becomes extremely easy to implement: |
Right, but what if there was an API for the job to declare itself dependent on a rate limiter + an estimate of how dependent e.g. I need to do 4 API calls in this job. That way when the job enqueues, we "allocate" 4 requests in the limiter "backlog counter". When the job finishes running or dies, we "deallocate" the estimated 4 requests. Or even if the job itself doesn't have an API, just allowing the rate limiter to have an API that allows us to keep a counter of the backlog (e.g. we manually increment and decrement that counter, possibly by hooking into the job lifecycle but not necessarily), but if the counter is used, then the rate limiter's default backoff strategy can be smarter. Of course we can implement this independently of Sidekiq... but feels appropriate to me for it to be part of the rate limiting feature.
This makes sense though in my use case I feel like that would create too many queues and wouldn't be the best way to handle it. Additionally, the backlog needs to include matching jobs in the Would middleware be the right way to hook into "enqueued" and "processed/dead"? |
I see your viewpoint. Sidekiq is generic infrastructure meant to be useful to all apps. As I develop more and more specialized APIs, those APIs become less and less useful to the majority. Rate limiting isn't useful to all but it's useful to many. Your followup suggestion would be useful to a far smaller subset in my experience; the hardest part of my job is determining where to stop building APIs. What I've found is that adding more complexity for a minority of users is a bad tradeoff. You can certainly build this complexity yourself, but I can't without more social proof that this is useful to the majority of people using rate limiting. Client middleware would be the place to start. I don't recall if client middleware runs when scheduled, enqueued or both. If you come up with a solution and I see demand for that solution from others, then I have cause to bake this in. |
Totally understand. I got something implemented via middleware out running on our own setup now. Will monitor to see how it does. The only issues I couldn't figure out were:
Ideally, I can decrement the counter when a job dies and increment it again when it gets retried, because a dead job really shouldn't factor into the throttle rate calculation. For now, I don't do that and dead jobs continue to count as backlogged. While not ideal, this is probably ok in practice because under normal operating conditions we do not expect there to be many dead jobs. Thanks! |
@stevenou Yeah, those are some of the harder issues to overcome.
Sidekiq doesn't have a great solution for following jobs as they change state. Actually, adding support for callbacks/event handlers for various job state changes throughout the system sounds like an interesting idea for Sidekiq 8. |
As I understand it, the default backoff strategy for rate limiting is linear.
When one enqueues a large number of jobs that use the same rate limiter, many of those jobs will fail immediately and get rescheduled to execute in a very short amount of time. Assuming the rate limit didn't magically increase exponentially, the vast majority of those reschedules will hit the limit again and get rescheduled. This will go on until ones gets further in the linear backoff ladder, at which point jobs will start getting rescheduled for much later than they need to be, and there will be lots of idle time where there is plenty of rate limit to execute jobs, but everything is rescheduled for much later.
For example, with a rate limit of 10 per minute, and 10k jobs getting enqueued, ideally, you want to execute 10 jobs per minute evenly over 1000 minutes (see https://github.com/sidekiq/sidekiq/wiki/Ent-Rate-Limiting#limiting-is-not-throttling). However, this throttling only works if you know the number of jobs upfront. If you don't know the number of jobs upfront, and you end up queueing 10k jobs in a short timeframe, and you use the default backoff strategy, it may end up taking 3 days to get through all the jobs instead of 17 hours.
What I am proposing is a heuristic in the backoff strategy that basically does this:
That will cause the backoff strategy to try to reschedule jobs "optimally". The rate limit may be dynamic and the number of jobs in backlog will be constantly changing, but this will try to optimize the rescheduling to minimize both idle time and over limits.
This requires a counter per rate limiter that keeps track of how many jobs have been enqueued but not processed/died (aka the number of jobs in backlog). We could possibly do this by hooking into job lifecycle events.
There are also other complications, such as a job may actually hit the rate limiter 4 times, so the math needs to multiply the optimal timeframe by 4 to be more accurate.
Anyway, I am running something like this in production right now (but instead of a counter, I am literally counting the number of jobs and caching it) and without a doubt the timing of job execution is much more optimal. But I feel like this could be part of the rate limiting feature itself, as all rate limiters provided by Sidekiq could benefit from more optimal backoff.
The text was updated successfully, but these errors were encountered: