Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similar retry times for multiple failed jobs causes jobs to fail again #480

Closed
bbhoss opened this issue Oct 30, 2012 · 6 comments
Closed

Comments

@bbhoss
Copy link
Contributor

bbhoss commented Oct 30, 2012

So I'm running into a problem where I have a full queue of jobs that are super-db intensive. This situation is very rare with my app, but if there were a specific failure it would cause me to hit it. The issue is that when these jobs execute, some of them time out because the combined load on the db causes the queries to either lock on one another or just take much longer in general. When this happens, the jobs' execution time expires (as specified by the timeout option) and they go into the retries "queue."

The actual issue is when the jobs are retried. Since they are all retried around the same time, the exact same issue happens, and the cycle continues. To me, a possible solution is to have a "splay" added to the retry time that causes there to be an increasing difference in actual retry times. It seems like this would be a simple enough thing to add, as it should just be a simple multiplication of a random number in the algorithm that determines the retry time. Would you be interested in this feature?

@jc00ke
Copy link
Contributor

jc00ke commented Oct 30, 2012

There is already a delay for retries but you're probably running into an edge case where, when all of your retries run, they themselves cause the problem.

I wonder if you could handle this by using another queue (do we have custom retry queues?) and firing up another sidekiq server to serially process the retries?

@mperham
Copy link
Collaborator

mperham commented Oct 30, 2012

@bbhoss That would be rad. I've run into the same issue before but not badly enough for me to fix it.

@bbhoss
Copy link
Contributor Author

bbhoss commented Oct 30, 2012

Cool, I'll work on a pull request then. It looks like I only need to tweak the formula here, correct?

@mperham
Copy link
Collaborator

mperham commented Oct 30, 2012

Exactly.

@bbhoss
Copy link
Contributor Author

bbhoss commented Oct 30, 2012

@mperham It doesn't look like you have any existing test infrastructure for the retry_at time. Would you like me to add some or is a simple 1-line patch to the DELAY proc ok?

@mperham
Copy link
Collaborator

mperham commented Oct 30, 2012

I'm ok with a one line patch.

mperham added a commit that referenced this issue Oct 30, 2012
Added a splay to the delay time for retried jobs. [Fixes #480]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants