Killed Resque jobs cannot be retried using ActiveJob #49734

geoffyoungs · 2023-10-21T19:25:48Z

Resque workers can be killed.

If they are killed with SIGKILL, the error handling in ActiveJob doesn't kick in, because it's not raised as an exception within the job code.

The failures can be detected in Resque because other workers call prune_dead_workers and trigger on_failure_XXX hooks on the job class, which can be handled, but ActiveJob currently misses these exceptions and cannot trigger retry logic.

Steps to reproduce

Create an ActiveJob instance
Add rescue_from(Resque::DirtyExit) { retry_job }
Enqueue in resque, and kill it mid-job with SIGKILL
Wait for the worker to be pruned
The error will be visible in the resque failure queue, but the retry will never happen.

# frozen_string_literal: true

require "bundler/inline"

# this requires redis-server to be in PATH

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem "rails", github: "rails/rails", branch: "main"
  # gem "rails", github: "geoffyoungs/rails", branch: "resque-dirty-exit-active-job"
  gem 'redis'
  gem 'resque'
end

require "active_support"
require "active_support/core_ext/object/blank"
require "active_job"
require "resque"
require "minitest/autorun"

ENV['QUEUE'] = 'std'
ENV['FORK_PER_JOB'] = 'false'
REDIS_PORT = 8765
REDIS_DB = 'resque_dirty_exit_active_job.rdb'+$$.to_s
ActiveJob::Base.queue_adapter = :resque

class BugTest < Minitest::Test
  class Job < ActiveJob::Base
    def self.status=(value)
      Resque.redis.set('job_status', value)
    end

    def self.status
      Resque.redis.get('job_status')
    end

    queue_as ENV['QUEUE']

    rescue_from(Resque::DirtyExit) do |exception|
      Job.status = 'retry'
      retry_job
    end

    def perform
      sleep 2
      Job.status = 'done'
    end
  end

  def setup
    spawn_redis
    connect_to_redis
    clear_redis
    FileUtils.rm_f(REDIS_DB)
  end

  def teardown
    kill_redis
    FileUtils.rm_f(REDIS_DB)
  end

  def test_whether_job_is_retried_after_dirty_exit
    Job.status = 'start'

    Job.perform_later

    assert Job.status.eql?('start')

    work_for(1)

    wait_for_workers_to_be_pruned

    assert Job.status.eql?('retry')

    work_for(3)

    assert Job.status.eql?('done')
  end

  private

  def work_for(time=0.5)
    pid = fork {
      connect_to_redis
      worker = Resque::Worker.new
      worker.prepare
      worker.heartbeat
      worker.work(1)
      exit!
    }
    sleep(time)
    kill('KILL', pid)
  end

  def spawn_redis
    @redis ||= spawn(['redis-server', '--port', REDIS_PORT.to_s, '--dbfilename', REDIS_DB].join(' '), out: File.open('/dev/null', 'w'))
  end

  def clear_redis
    Resque::Failure.clear
    Resque.remove_queue('std')
    Job.status = ''
  end

  def kill_redis
    kill('INT', @redis)
    @redis = nil
  end

  def kill(signal, pid)
    Process.kill(signal, pid)
    Process.waitpid(pid)
  end

  def connect_to_redis
    Resque.redis = Redis.new(port: REDIS_PORT)
  end

  def wait_for_workers_to_be_pruned
    while (workers = Resque::Worker.all).any?
      sleep(0.1)
      workers.first.prune_dead_workers
    end
  end
end

Expected behavior

It should be possible to handle the exception in the ActiveJob class.

Actual behavior

It's not possible to handle the exception in ActiveJob without additional resque behaviour added to the JobWrapper class.

System configuration

Rails version: 7.0.0-7.2.0pre (at least)

Ruby version: Any

The text was updated successfully, but these errors were encountered:

…ls#49734

zzak · 2023-10-30T05:50:34Z

This is a bit of an elaborate test case, so it will take time to review.

I found #41214 to be somewhat related, but I'm wondering if you can actually execute anything after receiving the SIGKILL. 🤔

geoffyoungs · 2023-10-30T06:26:10Z

This is a bit of an elaborate test case, so it will take time to review.

Yes - to trigger the behaviour, the first worker has to die in circumstances that cannot be caught and another worker has to detect that the former worker failed. The test is contrived, but OOM killer, pod scaling etc make this scenario all too common in production.

I found #41214 to be somewhat related, but I'm wondering if you can actually execute anything after receiving the SIGKILL. 🤔

It's similar - but SIGKILL cannot be caught.

The SIGKILL signal is used to cause immediate program termination. It cannot be handled or ignored, and is therefore always fatal. It is also not possible to block this signal. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

zzak · 2023-10-30T09:37:34Z

You can configure your pod scaler to gracefully terminate the processes, though right?

geoffyoungs · 2023-10-30T12:53:04Z

Graceful termination partially mitigates the issue - but jobs do still get killed.

Before we moved to ActiveJob we could catch and process these killed jobs using the standard Resque hooks in the job class, as if they were any other error.

With the ActiveJob resque adapter, the errors are silently lost (because the jobs are all queued to run a single wrapper class that then invokes ActiveJob, it's not possible to add resque hooks at a job class level). The errors are visible in the rescue-web backend and the jobs can be manually re-queued, but the following does nothing:

class MyJob < ActiveJob::Base
  retry_on Resque::DirtyExit
end

without the referenced PR or something similar.

geoffyoungs changed the title ~~Killed Resque workers cannot be retried using ActiveJob~~ Killed Resque jobs cannot be retried using ActiveJob Oct 21, 2023

geoffyoungs added a commit to geoffyoungs/rails that referenced this issue Oct 21, 2023

Catch and handle Resque::DirtyExit exceptions in ActiveJob. Fixes rai…

64df5ce

…ls#49734

paulreece added third party issue activejob attached PR labels Oct 21, 2023

zzak linked a pull request Oct 30, 2023 that will close this issue

Catch and handle Resque::DirtyExit exceptions in ActiveJob #49735

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Killed Resque jobs cannot be retried using ActiveJob #49734

Killed Resque jobs cannot be retried using ActiveJob #49734

geoffyoungs commented Oct 21, 2023

zzak commented Oct 30, 2023

geoffyoungs commented Oct 30, 2023

zzak commented Oct 30, 2023

geoffyoungs commented Oct 30, 2023

Killed Resque jobs cannot be retried using ActiveJob #49734

Killed Resque jobs cannot be retried using ActiveJob #49734

Comments

geoffyoungs commented Oct 21, 2023

Steps to reproduce

Expected behavior

Actual behavior

System configuration

zzak commented Oct 30, 2023

geoffyoungs commented Oct 30, 2023

zzak commented Oct 30, 2023

geoffyoungs commented Oct 30, 2023