New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Killed Resque jobs cannot be retried using ActiveJob #49734
Comments
This is a bit of an elaborate test case, so it will take time to review. I found #41214 to be somewhat related, but I'm wondering if you can actually execute anything after receiving the |
Yes - to trigger the behaviour, the first worker has to die in circumstances that cannot be caught and another worker has to detect that the former worker failed. The test is contrived, but OOM killer, pod scaling etc make this scenario all too common in production.
It's similar - but SIGKILL cannot be caught.
|
You can configure your pod scaler to gracefully terminate the processes, though right? |
Graceful termination partially mitigates the issue - but jobs do still get killed. Before we moved to ActiveJob we could catch and process these killed jobs using the standard Resque hooks in the job class, as if they were any other error. With the ActiveJob resque adapter, the errors are silently lost (because the jobs are all queued to run a single wrapper class that then invokes ActiveJob, it's not possible to add resque hooks at a job class level). The errors are visible in the rescue-web backend and the jobs can be manually re-queued, but the following does nothing: class MyJob < ActiveJob::Base
retry_on Resque::DirtyExit
end without the referenced PR or something similar. |
Resque workers can be killed.
If they are killed with SIGKILL, the error handling in ActiveJob doesn't kick in, because it's not raised as an exception within the job code.
The failures can be detected in Resque because other workers call
prune_dead_workers
and triggeron_failure_XXX
hooks on the job class, which can be handled, but ActiveJob currently misses these exceptions and cannot trigger retry logic.Steps to reproduce
rescue_from(Resque::DirtyExit) { retry_job }
Expected behavior
It should be possible to handle the exception in the ActiveJob class.
Actual behavior
It's not possible to handle the exception in ActiveJob without additional resque behaviour added to the JobWrapper class.
System configuration
Rails version: 7.0.0-7.2.0pre (at least)
Ruby version: Any
The text was updated successfully, but these errors were encountered: