Consider per-worker timeout overrides when rescuing jobs #350

brandur · 2024-05-11T06:26:34Z

This one came up when I was thinking about the job specific rescue
threshold floated in [1].

I was going to suggest the possible workaround of setting an aggressive
rescue threshold combined with a low job timeout globally, and then
override the timeout on any specific job workers that needed to run
longer than the new low global job timeout. But then I realized this
wouldn't work because the job rescuer doesn't account for job-specific
timeouts -- it just rescues or discards everything it finds beyond the
run's rescue threshold.

Here, add new logic to address that problem. Luckily we were already
pulling worker information to procure what might be a possible custom
retry schedule, so we just have to piggyback onto that to also examine
a possible custom work timeout.

[1] #347

This one came up when I was thinking about the job specific rescue threshold floated in [1]. I was going to suggest the possible workaround of setting an aggressive rescue threshold combined with a low job timeout globally, and then override the timeout on any specific job workers that needed to run longer than the new low global job timeout. But then I realized this wouldn't work because the job rescuer doesn't account for job-specific timeouts -- it just rescues or discards everything it finds beyond the run's rescue threshold. Here, add new logic to address that problem. Luckily we were already pulling worker information to procure what might be a possible custom retry schedule, so we just have to piggyback onto that to also examine a possible custom work timeout. [1] #347

bgentry

Good idea and LGTM. Think this can also use a doc change on https://riverqueue.com/docs/maintenance-services#rescuer ?

Noticing two other things about that doc section:

It doesn't mention hardware faults or software crashes as a potential cause of a "stuck" job, which we should probably add.
"transaction job completion" should probably be "transactional job completion"

brandur · 2024-05-13T06:16:50Z

Cool, done in https://github.com/riverqueue/river-homepage/pull/93.

Prepare version 0.6.1 for release, including the changes from #350 (no premature rescue for jobs with long custom timeouts), #363 (exit with status 1 in case of bad command/flags) in CLI, and #364 (fix migration version 4 to be re-runnable).

brandur force-pushed the brandur-allow-long-worker-timeouts branch from 89c47f3 to bec5206 Compare May 11, 2024 06:30

brandur requested a review from bgentry May 11, 2024 06:34

bgentry approved these changes May 13, 2024

View reviewed changes

brandur merged commit 793f370 into master May 13, 2024
10 checks passed

brandur deleted the brandur-allow-long-worker-timeouts branch May 13, 2024 06:16

brandur mentioned this pull request May 20, 2024

Prepare version 0.6.1 #365

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider per-worker timeout overrides when rescuing jobs #350

Consider per-worker timeout overrides when rescuing jobs #350

brandur commented May 11, 2024

bgentry left a comment

brandur commented May 13, 2024

Consider per-worker timeout overrides when rescuing jobs #350

Consider per-worker timeout overrides when rescuing jobs #350

Conversation

brandur commented May 11, 2024

bgentry left a comment

Choose a reason for hiding this comment

brandur commented May 13, 2024