New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Similar retry times for multiple failed jobs causes jobs to fail again #480
Comments
There is already a delay for retries but you're probably running into an edge case where, when all of your retries run, they themselves cause the problem. I wonder if you could handle this by using another queue (do we have custom retry queues?) and firing up another sidekiq server to serially process the retries? |
@bbhoss That would be rad. I've run into the same issue before but not badly enough for me to fix it. |
Cool, I'll work on a pull request then. It looks like I only need to tweak the formula here, correct? |
Exactly. |
@mperham It doesn't look like you have any existing test infrastructure for the retry_at time. Would you like me to add some or is a simple 1-line patch to the DELAY proc ok? |
I'm ok with a one line patch. |
Added a splay to the delay time for retried jobs. [Fixes #480]
So I'm running into a problem where I have a full queue of jobs that are super-db intensive. This situation is very rare with my app, but if there were a specific failure it would cause me to hit it. The issue is that when these jobs execute, some of them time out because the combined load on the db causes the queries to either lock on one another or just take much longer in general. When this happens, the jobs' execution time expires (as specified by the
timeout
option) and they go into the retries "queue."The actual issue is when the jobs are retried. Since they are all retried around the same time, the exact same issue happens, and the cycle continues. To me, a possible solution is to have a "splay" added to the retry time that causes there to be an increasing difference in actual retry times. It seems like this would be a simple enough thing to add, as it should just be a simple multiplication of a random number in the algorithm that determines the retry time. Would you be interested in this feature?
The text was updated successfully, but these errors were encountered: