Similar retry times for multiple failed jobs causes jobs to fail again #480

bbhoss · 2012-10-30T20:13:11Z

So I'm running into a problem where I have a full queue of jobs that are super-db intensive. This situation is very rare with my app, but if there were a specific failure it would cause me to hit it. The issue is that when these jobs execute, some of them time out because the combined load on the db causes the queries to either lock on one another or just take much longer in general. When this happens, the jobs' execution time expires (as specified by the timeout option) and they go into the retries "queue."

The actual issue is when the jobs are retried. Since they are all retried around the same time, the exact same issue happens, and the cycle continues. To me, a possible solution is to have a "splay" added to the retry time that causes there to be an increasing difference in actual retry times. It seems like this would be a simple enough thing to add, as it should just be a simple multiplication of a random number in the algorithm that determines the retry time. Would you be interested in this feature?

The text was updated successfully, but these errors were encountered:

jc00ke · 2012-10-30T20:20:03Z

There is already a delay for retries but you're probably running into an edge case where, when all of your retries run, they themselves cause the problem.

I wonder if you could handle this by using another queue (do we have custom retry queues?) and firing up another sidekiq server to serially process the retries?

mperham · 2012-10-30T20:24:42Z

@bbhoss That would be rad. I've run into the same issue before but not badly enough for me to fix it.

bbhoss · 2012-10-30T20:26:46Z

Cool, I'll work on a pull request then. It looks like I only need to tweak the formula here, correct?

mperham · 2012-10-30T20:40:20Z

Exactly.

bbhoss · 2012-10-30T21:02:15Z

@mperham It doesn't look like you have any existing test infrastructure for the retry_at time. Would you like me to add some or is a simple 1-line patch to the DELAY proc ok?

mperham · 2012-10-30T21:25:31Z

I'm ok with a one line patch.

Added a splay to the delay time for retried jobs. [Fixes #480]

mperham closed this as completed in b08696b Oct 30, 2012

mperham added a commit that referenced this issue Oct 30, 2012

Merge pull request #483 from bbhoss/add_splay_to_delay

6714e94

Added a splay to the delay time for retried jobs. [Fixes #480]

allcentury mentioned this issue Feb 2, 2018

Add :jitter to :exponentially_longer rails/rails#31872

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Similar retry times for multiple failed jobs causes jobs to fail again #480

Similar retry times for multiple failed jobs causes jobs to fail again #480

bbhoss commented Oct 30, 2012

jc00ke commented Oct 30, 2012

mperham commented Oct 30, 2012

bbhoss commented Oct 30, 2012

mperham commented Oct 30, 2012

bbhoss commented Oct 30, 2012

mperham commented Oct 30, 2012

Similar retry times for multiple failed jobs causes jobs to fail again #480

Similar retry times for multiple failed jobs causes jobs to fail again #480

Comments

bbhoss commented Oct 30, 2012

jc00ke commented Oct 30, 2012

mperham commented Oct 30, 2012

bbhoss commented Oct 30, 2012

mperham commented Oct 30, 2012

bbhoss commented Oct 30, 2012

mperham commented Oct 30, 2012