Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider per-worker timeout overrides when rescuing jobs #350

Merged
merged 1 commit into from May 13, 2024

Conversation

brandur
Copy link
Contributor

@brandur brandur commented May 11, 2024

This one came up when I was thinking about the job specific rescue
threshold floated in [1].

I was going to suggest the possible workaround of setting an aggressive
rescue threshold combined with a low job timeout globally, and then
override the timeout on any specific job workers that needed to run
longer than the new low global job timeout. But then I realized this
wouldn't work because the job rescuer doesn't account for job-specific
timeouts -- it just rescues or discards everything it finds beyond the
run's rescue threshold.

Here, add new logic to address that problem. Luckily we were already
pulling worker information to procure what might be a possible custom
retry schedule, so we just have to piggyback onto that to also examine
a possible custom work timeout.

[1] #347

This one came up when I was thinking about the job specific rescue
threshold floated in [1].

I was going to suggest the possible workaround of setting an aggressive
rescue threshold combined with a low job timeout globally, and then
override the timeout on any specific job workers that needed to run
longer than the new low global job timeout. But then I realized this
wouldn't work because the job rescuer doesn't account for job-specific
timeouts -- it just rescues or discards everything it finds beyond the
run's rescue threshold.

Here, add new logic to address that problem. Luckily we were already
pulling worker information to procure what might be a possible custom
retry schedule, so we just have to piggyback onto that to also examine
a possible custom work timeout.

[1] #347
@brandur brandur force-pushed the brandur-allow-long-worker-timeouts branch from 89c47f3 to bec5206 Compare May 11, 2024 06:30
@brandur brandur requested a review from bgentry May 11, 2024 06:34
Copy link
Contributor

@bgentry bgentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea and LGTM. Think this can also use a doc change on https://riverqueue.com/docs/maintenance-services#rescuer ?

Noticing two other things about that doc section:

  1. It doesn't mention hardware faults or software crashes as a potential cause of a "stuck" job, which we should probably add.
  2. "transaction job completion" should probably be "transactional job completion"

@brandur
Copy link
Contributor Author

brandur commented May 13, 2024

@brandur brandur merged commit 793f370 into master May 13, 2024
10 checks passed
@brandur brandur deleted the brandur-allow-long-worker-timeouts branch May 13, 2024 06:16
brandur added a commit that referenced this pull request May 20, 2024
Prepare version 0.6.1 for release, including the changes from #350 (no
premature rescue for jobs with long custom timeouts), #363 (exit with
status 1 in case of bad command/flags) in CLI, and #364 (fix migration
version 4 to be re-runnable).
@brandur brandur mentioned this pull request May 20, 2024
brandur added a commit that referenced this pull request May 21, 2024
Prepare version 0.6.1 for release, including the changes from #350 (no
premature rescue for jobs with long custom timeouts), #363 (exit with
status 1 in case of bad command/flags) in CLI, and #364 (fix migration
version 4 to be re-runnable).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants