Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Killed Resque jobs cannot be retried using ActiveJob #49734

Open
geoffyoungs opened this issue Oct 21, 2023 · 4 comments · May be fixed by #49735
Open

Killed Resque jobs cannot be retried using ActiveJob #49734

geoffyoungs opened this issue Oct 21, 2023 · 4 comments · May be fixed by #49735

Comments

@geoffyoungs
Copy link

Resque workers can be killed.

If they are killed with SIGKILL, the error handling in ActiveJob doesn't kick in, because it's not raised as an exception within the job code.

The failures can be detected in Resque because other workers call prune_dead_workers and trigger on_failure_XXX hooks on the job class, which can be handled, but ActiveJob currently misses these exceptions and cannot trigger retry logic.

Steps to reproduce

  1. Create an ActiveJob instance
  2. Add rescue_from(Resque::DirtyExit) { retry_job }
  3. Enqueue in resque, and kill it mid-job with SIGKILL
  4. Wait for the worker to be pruned
  5. The error will be visible in the resque failure queue, but the retry will never happen.
# frozen_string_literal: true

require "bundler/inline"

# this requires redis-server to be in PATH

gemfile(true) do
  source "https://rubygems.org"

  git_source(:github) { |repo| "https://github.com/#{repo}.git" }

  gem "rails", github: "rails/rails", branch: "main"
  # gem "rails", github: "geoffyoungs/rails", branch: "resque-dirty-exit-active-job"
  gem 'redis'
  gem 'resque'
end

require "active_support"
require "active_support/core_ext/object/blank"
require "active_job"
require "resque"
require "minitest/autorun"

ENV['QUEUE'] = 'std'
ENV['FORK_PER_JOB'] = 'false'
REDIS_PORT = 8765
REDIS_DB = 'resque_dirty_exit_active_job.rdb'+$$.to_s
ActiveJob::Base.queue_adapter = :resque

class BugTest < Minitest::Test
  class Job < ActiveJob::Base
    def self.status=(value)
      Resque.redis.set('job_status', value)
    end

    def self.status
      Resque.redis.get('job_status')
    end

    queue_as ENV['QUEUE']

    rescue_from(Resque::DirtyExit) do |exception|
      Job.status = 'retry'
      retry_job
    end

    def perform
      sleep 2
      Job.status = 'done'
    end
  end

  def setup
    spawn_redis
    connect_to_redis
    clear_redis
    FileUtils.rm_f(REDIS_DB)
  end

  def teardown
    kill_redis
    FileUtils.rm_f(REDIS_DB)
  end

  def test_whether_job_is_retried_after_dirty_exit
    Job.status = 'start'

    Job.perform_later

    assert Job.status.eql?('start')

    work_for(1)

    wait_for_workers_to_be_pruned

    assert Job.status.eql?('retry')

    work_for(3)

    assert Job.status.eql?('done')
  end

  private

  def work_for(time=0.5)
    pid = fork {
      connect_to_redis
      worker = Resque::Worker.new
      worker.prepare
      worker.heartbeat
      worker.work(1)
      exit!
    }
    sleep(time)
    kill('KILL', pid)
  end

  def spawn_redis
    @redis ||= spawn(['redis-server', '--port', REDIS_PORT.to_s, '--dbfilename', REDIS_DB].join(' '), out: File.open('/dev/null', 'w'))
  end

  def clear_redis
    Resque::Failure.clear
    Resque.remove_queue('std')
    Job.status = ''
  end

  def kill_redis
    kill('INT', @redis)
    @redis = nil
  end

  def kill(signal, pid)
    Process.kill(signal, pid)
    Process.waitpid(pid)
  end

  def connect_to_redis
    Resque.redis = Redis.new(port: REDIS_PORT)
  end

  def wait_for_workers_to_be_pruned
    while (workers = Resque::Worker.all).any?
      sleep(0.1)
      workers.first.prune_dead_workers
    end
  end
end

Expected behavior

It should be possible to handle the exception in the ActiveJob class.

Actual behavior

It's not possible to handle the exception in ActiveJob without additional resque behaviour added to the JobWrapper class.

System configuration

Rails version: 7.0.0-7.2.0pre (at least)

Ruby version: Any

@geoffyoungs geoffyoungs changed the title Killed Resque workers cannot be retried using ActiveJob Killed Resque jobs cannot be retried using ActiveJob Oct 21, 2023
@zzak
Copy link
Member

zzak commented Oct 30, 2023

This is a bit of an elaborate test case, so it will take time to review.

I found #41214 to be somewhat related, but I'm wondering if you can actually execute anything after receiving the SIGKILL. 🤔

@zzak zzak linked a pull request Oct 30, 2023 that will close this issue
3 tasks
@geoffyoungs
Copy link
Author

This is a bit of an elaborate test case, so it will take time to review.

Yes - to trigger the behaviour, the first worker has to die in circumstances that cannot be caught and another worker has to detect that the former worker failed. The test is contrived, but OOM killer, pod scaling etc make this scenario all too common in production.

I found #41214 to be somewhat related, but I'm wondering if you can actually execute anything after receiving the SIGKILL. 🤔

It's similar - but SIGKILL cannot be caught.

The SIGKILL signal is used to cause immediate program termination. It cannot be handled or ignored, and is therefore always fatal. It is also not possible to block this signal. https://www.gnu.org/software/libc/manual/html_node/Termination-Signals.html

@zzak
Copy link
Member

zzak commented Oct 30, 2023

You can configure your pod scaler to gracefully terminate the processes, though right?

@geoffyoungs
Copy link
Author

Graceful termination partially mitigates the issue - but jobs do still get killed.

Before we moved to ActiveJob we could catch and process these killed jobs using the standard Resque hooks in the job class, as if they were any other error.

With the ActiveJob resque adapter, the errors are silently lost (because the jobs are all queued to run a single wrapper class that then invokes ActiveJob, it's not possible to add resque hooks at a job class level). The errors are visible in the rescue-web backend and the jobs can be manually re-queued, but the following does nothing:

class MyJob < ActiveJob::Base
  retry_on Resque::DirtyExit
end

without the referenced PR or something similar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants