Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple Terms on Heroku #893

Closed
softwaregravy opened this issue Mar 22, 2013 · 37 comments
Closed

Multiple Terms on Heroku #893

softwaregravy opened this issue Mar 22, 2013 · 37 comments

Comments

@softwaregravy
Copy link
Contributor

We were following https://github.com/defunkt/resque/pull/638 pretty closely, and implemented handling on our side in line with this: http://hone.heroku.com/resque/2012/08/21/resque-signals.html

In our rescue logic, we try to re-queue ourselves for another worker.

Here's an example failed stack trace:

Exception: Resque::TermException
Error: SIGTERM
/app/vendor/bundle/ruby/1.9.1/gems/resque-1.23.0/lib/resque/worker.rb:320:in `block in unregister_signal_handlers'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/connection/ruby.rb:112:in `call'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/connection/ruby.rb:112:in `getaddrinfo'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/connection/ruby.rb:112:in `connect'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/connection/ruby.rb:174:in `connect'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:263:in `establish_connection'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:69:in `connect'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:282:in `ensure_connected'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:173:in `block in process'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:248:in `logging'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:172:in `process'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:147:in `call_pipelined'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:121:in `block in call_pipeline'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:235:in `with_reconnect'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis/client.rb:119:in `call_pipeline'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis.rb:2020:in `block in multi'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis.rb:36:in `block in synchronize'
/usr/local/lib/ruby/1.9.1/monitor.rb:201:in `mon_synchronize'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis.rb:36:in `synchronize'
/app/vendor/bundle/ruby/1.9.1/gems/redis-3.0.2/lib/redis.rb:2012:in `multi'
/app/vendor/bundle/ruby/1.9.1/gems/redis-namespace-1.2.1/lib/redis/namespace.rb:220:in `method_missing'
/app/vendor/bundle/ruby/1.9.1/gems/resque-loner-1.2.1/lib/resque-ext/job.rb:19:in `create_with_loner'
/app/vendor/bundle/ruby/1.9.1/gems/resque-1.23.0/lib/resque.rb:240:in `enqueue_to'
/app/vendor/bundle/ruby/1.9.1/gems/resque-1.23.0/lib/resque.rb:221:in `enqueue'
/app/app/workers/resque/bidtable/bid_table_by_campaign_id_executor.rb:23:in `rescue in perform'
/app/app/workers/resque/bidtable/bid_table_by_campaign_id_executor.rb:8:in `perform'

Seems right -- we get the term exception on shutdown. However, this is coming from our rescue block. Our perform looks like this:

def perform
  # clever code here
rescue Resque::TermException
  #this next line is 23
  Resque.enqueue(BidTableByCampaignIdExecutor, campaign_id)
end

So we catch the term exception, then we try to re-enqueue ourselves, but get another term exception right away.

Any thoughts?

I'm trying to reproduce this in a more deterministic way right now, but wanted to bring this up in case anyone knew what was going on.

Resque version 1.23.0

Procfile for worker:

bidtable_worker:  env TERM_CHILD=1 RESQUE_TERM_TIMEOUT=9 QUEUES=bidtable bundle exec rake resque:work 

Update: I've been trying to reproduce this deterministically, and I don't seem to be able to. We do get his error in production fairly often.

@softwaregravy
Copy link
Contributor Author

Closing until I can repro this

@chrisyuska
Copy link

+1 to this issue. We're having the same issue, though it occurs infrequently and sporadically.

@steveklabnik
Copy link
Member

Awesome, a reproduction would be fantastic. I can't. :(

@softwaregravy
Copy link
Contributor Author

The latest from my investigations/experimentation:

  • it is a second term exception being thrown while we're handling the first one
  • term exceptions often cause/manifest as other exceptions. Examples are getting HTTP disconnect errors if we had a persistent connection open (similar is a rest client exception from some clients), ActiveRecord errors if the term exception arrives mid-query.

I suspect deep down in a lot of these clients they are catching exception, and thus grabbing our term exception and throwing their own. Huge pain.

we've reduced the RESQUE_TERM_TIMEOUT from 9 in the above to 4. This seems to have helped, but I can't back that statement up with anything.

On our side, we're looking at:

  1. catching exception and string matching the error message for TERM as most exceptions we see getting thrown will reference the cause exception in their message
  2. monkey-patching the exception handler to set a variable which we will teach all our workers to handle (and thus handle this in the application layer). Handling it in the application layer isn't that ugly since we already handle the term exception and re-enqueuing there. The monkey-patching isn't ideal, obviously.

If we get something that works reliably, I'll try and come back to update

@steveklabnik
Copy link
Member

Awesome, thanks so much. These errors are so tricky!

@softwaregravy
Copy link
Contributor Author

@steveklabnik

Okay, I've got it. I am consistently reproducing a case in which Heroku is sending multiple TERMs on shutdown. So, the problem is definitely getting multiple terms. You get 1, catch and enter your rescue block. 2nd one comes in while you're in the middle of [Post-to-airbrake | sending-stats-to-newrelic | reenqueing-yourself] and looks.

Probably can't follow this link, but for reference: https://help.heroku.com/tickets/84111

I will try and copy my examples over later today.

Long story short, I can "fix" resque by ignoring subsequent terms.

require 'resque'                                                                                                                                                                                                                                                                                                                                                          
require 'resque/errors'


module Resque
  class Worker
    def unregister_signal_handlers                
      trap('TERM') do                             
        trap ('TERM') do 
          # ignore subsequent terms               
        end 
        raise TermException.new("SIGTERM")        
      end 

      trap('INT', 'DEFAULT')                      

      begin
        trap('QUIT', 'DEFAULT')                   
        trap('USR1', 'DEFAULT')                   
        trap('USR2', 'DEFAULT')                   
      rescue ArgumentError
      end
    end
  end 
end 

Also, RESQUE_TERM_TIMEOUT is slightly misleading or not working. With a value of 8, I am consistently killed by Heroku before reaching this timeout. I'm going to experiment on this later.

@steveklabnik
Copy link
Member

Awesome, thanks so much for diving into this. Huh.

@softwaregravy
Copy link
Contributor Author

Here we go -- all this from the heroku support thread

TLDR: Here's a Monkey patch tested with Resque 1.24.1 that seems to have solved our problem.

require 'resque' 
require 'resque/errors'


module Resque
  class Worker
    def unregister_signal_handlers
      trap('TERM') do  
        trap ('TERM') do  
          # ignore subsequent terms                                                                                                                                                                                                                                                                                                                                       
        end 
        raise TermException.new("SIGTERM") 
      end 

      trap('INT', 'DEFAULT')

      begin
        trap('QUIT', 'DEFAULT')
        trap('USR1', 'DEFAULT')
        trap('USR2', 'DEFAULT')
      rescue ArgumentError
      end 
    end 
  end 
end 

Here's the full version starting with verifying multiple TERMs.

require 'resque/errors'

class FakeWorker
  @queue = :bidtable
  class << self 
    def perform
      Rails.logger.info "FakeWorker2: starting up"
      while true
        Rails.logger.info "FakeWorker2: while loop iterating"
        begin 
          Rails.logger.info "FakeWorker2: sleeping"
          sleep 2
          Rails.logger.info "FakeWorker2: woke up"
        rescue Resque::TermException => e
          Rails.logger.info "FakeWorker2: rescued a term exception"
        end 
      end 
      Rails.logger.info "FakeWorker2: stopping naturally"
      rescue => e
        Rails.logger.info "FakeWorker2: Caught some other exception: #{e}"
      end 
    end 
end 

Below we can see that we get 2x: rescued a term exception

2013-05-01T01:33:21.013587+00:00 app[bidtable_worker.1]: FakeWorker2: woke up
2013-05-01T01:33:21.014005+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:21.014144+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:21.115690+00:00 heroku[api]: Scale to bidtable_worker=0, core_worker=2, emr_worker=1, lineitem_worker=1, maintenance_worker=1, scheduler=1, weather_worker=1, web=2 by heroku@thinknear.com
2013-05-01T01:33:23.024274+00:00 app[bidtable_worker.1]: FakeWorker2: woke up
2013-05-01T01:33:23.026301+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:23.026351+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:24.129548+00:00 heroku[bidtable_worker.1]: Stopping all processes with SIGTERM
2013-05-01T01:33:24.884940+00:00 app[bidtable_worker.1]: FakeWorker2: rescued a term exception
2013-05-01T01:33:24.884940+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:24.884940+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:24.905371+00:00 app[bidtable_worker.1]: FakeWorker2: rescued a term exception
2013-05-01T01:33:24.905560+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:24.905660+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:26.914141+00:00 app[bidtable_worker.1]: FakeWorker2: woke up
2013-05-01T01:33:26.914408+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:26.914524+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:21.089235+00:00 heroku[bidtable_worker.1]: State changed from up to down
2013-05-01T01:33:28.924192+00:00 app[bidtable_worker.1]: FakeWorker2: woke up
2013-05-01T01:33:28.924192+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:28.924339+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:30.933645+00:00 app[bidtable_worker.1]: FakeWorker2: woke up
2013-05-01T01:33:30.944030+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:30.944735+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:32.954114+00:00 app[bidtable_worker.1]: FakeWorker2: woke up
2013-05-01T01:33:32.954337+00:00 app[bidtable_worker.1]: FakeWorker2: while loop iterating
2013-05-01T01:33:32.954457+00:00 app[bidtable_worker.1]: FakeWorker2: sleeping
2013-05-01T01:33:34.803445+00:00 heroku[bidtable_worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
2013-05-01T01:33:34.803617+00:00 heroku[bidtable_worker.1]: Stopping remaining processes with SIGKILL
2013-05-01T01:33:35.454248+00:00 heroku[bidtable_worker.1]: Process exited with status 0

Repeated this behavior a few times. Verification:

require 'resque/errors'

class FakeWorker

  @queue = :bidtable

  class << self 
    def perform
      Rails.logger.info "FakeWorker: starting up"

      while true
        Rails.logger.info "FakeWorker: while loop iterating"
        begin 
          Rails.logger.info "FakeWorker: sleeping"
          sleep 2
          Rails.logger.info "FakeWorker: woke up"
        rescue Resque::TermException => e
          Rails.logger.info "FakeWorker: rescued a term exception TERM TERM"

          while true
            Rails.logger.info "FakeWorker: INNER while loop iterating"
            begin
              Rails.logger.info "FakeWorker: sleeping"
              sleep 2
              Rails.logger.info "FakeWorker: woke up"
            rescue Resque::TermException => e
              Rails.logger.info "FakeWorker: INNER while loop rescued a term exception TERM TERM"
            end 
          end 

        end 
      end 
      Rails.logger.info "FakeWorker: stopping naturally"
    rescue => e
      Rails.logger.info "FakeWorker: Caught some other exception: #{e}"
    end 
  end 
end 

The key line in this next loop is "INNER while loop rescued a term exception TERM TERM"

2013-05-01T18:25:29.663657+00:00 app[bidtable_worker.1]: FakeWorker: starting up
2013-05-01T18:25:29.663657+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:29.663657+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:31.672796+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:31.673419+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:31.673561+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:33.683736+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:33.684142+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:33.684268+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:35.692938+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:35.693215+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:35.693340+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:37.702343+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:37.702343+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:37.702343+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:39.712313+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:39.712375+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:39.712375+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:39.804478+00:00 heroku[bidtable_worker.1]: State changed from up to down
2013-05-01T18:25:39.826774+00:00 heroku[api]: Scale to bidtable_worker=0, web=2 by heroku@thinknear.com
2013-05-01T18:25:41.722378+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:41.722378+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:41.722378+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:43.520024+00:00 heroku[bidtable_worker.1]: Stopping all processes with SIGTERM
2013-05-01T18:25:43.732644+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:43.732963+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T18:25:43.733088+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:44.424080+00:00 app[bidtable_worker.1]: FakeWorker: rescued a term exception TERM TERM
2013-05-01T18:25:44.424080+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T18:25:44.424080+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:44.443483+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop rescued a term exception TERM TERM
2013-05-01T18:25:44.443808+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T18:25:44.443934+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:46.452305+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:46.452305+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T18:25:46.452305+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:48.462392+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:48.462711+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T18:25:48.462859+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:50.472165+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:50.472345+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T18:25:50.472468+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:52.482754+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T18:25:52.483129+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T18:25:52.483255+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T18:25:54.099724+00:00 heroku[bidtable_worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
2013-05-01T18:25:54.099950+00:00 heroku[bidtable_worker.1]: Stopping remaining processes with SIGKILL
2013-05-01T18:25:54.925145+00:00 heroku[bidtable_worker.1]: Process exited with status 0

I first tried putting this in my rescue statement: trap('TERM', 'IGNORE'). Didn't work. Then I put this in

          trap('TERM') do
            Rails.logger.info "FakeWorker: trapped another term. trying to ignore this one TERM TERM"
          end

And it worked. Full example:

require 'resque/errors'

class FakeWorker

  @queue = :bidtable

  class << self
    def perform
      Rails.logger.info "FakeWorker: starting up"

      while true
        Rails.logger.info "FakeWorker: while loop iterating"
        begin
          Rails.logger.info "FakeWorker: sleeping"
          sleep 2
          Rails.logger.info "FakeWorker: woke up"
        rescue Resque::TermException => e
          Rails.logger.info "FakeWorker: rescued a term exception TERM TERM"
          trap('TERM') do
            Rails.logger.info "FakeWorker: trapped another term. trying to ignore this one TERM TERM"
          end

          while true
            Rails.logger.info "FakeWorker: INNER while loop iterating"
            begin
              Rails.logger.info "FakeWorker: sleeping"
              sleep 2
              Rails.logger.info "FakeWorker: woke up"
            rescue Resque::TermException => e
              Rails.logger.info "FakeWorker: INNER while loop rescued a term exception TERM TERM"
            end
          end

        end
      end
      Rails.logger.info "FakeWorker: stopping naturally"
    rescue => e
      Rails.logger.info "FakeWorker: Caught some other exception: #{e}"
    end
  end
end    

Note we keep going after trapping the second term

2013-05-01T19:05:12.922487+00:00 app[bidtable_worker.1]: FakeWorker: starting up
2013-05-01T19:05:12.922487+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:05:12.922487+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:14.925428+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:14.925637+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:05:14.925749+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:16.935343+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:16.943383+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:05:16.943383+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:18.945465+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:18.945465+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:05:18.945465+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:20.958536+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:20.958536+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:05:20.958536+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:22.965073+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:22.965073+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:05:22.965292+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:23.839416+00:00 heroku[bidtable_worker.1]: Stopping all processes with SIGTERM
2013-05-01T19:05:24.705489+00:00 app[bidtable_worker.1]: FakeWorker: rescued a term exception TERM TERM
2013-05-01T19:05:24.705489+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:05:24.705489+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:24.725492+00:00 app[bidtable_worker.1]: FakeWorker: trapped another term. trying to ignore this one TERM TERM
2013-05-01T19:05:26.715351+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:26.715602+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:05:26.715709+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:28.724521+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:28.725442+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:05:28.726585+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:30.736132+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:30.736320+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:05:30.736424+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:32.744447+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:05:32.744447+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:05:32.745381+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:05:33.643878+00:00 heroku[bidtable_worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
2013-05-01T19:05:33.644031+00:00 heroku[bidtable_worker.1]: Stopping remaining processes with SIGKILL
2013-05-01T19:05:35.198728+00:00 heroku[bidtable_worker.1]: Process exited with status 0
2013-05-01T19:05:20.390151+00:00 heroku[bidtable_worker.1]: State changed from up to down
2013-05-01T19:05:20.635070+00:00 heroku[api]: Scale to bidtable_worker=0, web=2 by heroku@thinknear.com

After playing around a bit, tried out a monkey patch:

require 'resque' 
require 'resque/errors'


module Resque
  class Worker
    def unregister_signal_handlers
      trap('TERM') do  
        trap ('TERM') do  
          # ignore subsequent terms                                                                                                                                                                                                                                                                                                                                       
        end 
        raise TermException.new("SIGTERM") 
      end 

      trap('INT', 'DEFAULT')

      begin
        trap('QUIT', 'DEFAULT')
        trap('USR1', 'DEFAULT')
        trap('USR2', 'DEFAULT')
      rescue ArgumentError
      end 
    end 
  end 
end 

with an example class:

require 'resque/errors'

class FakeWorker

  @queue = :bidtable

  class << self 
    def perform
      Rails.logger.info "FakeWorker: starting up"

      while true
        Rails.logger.info "FakeWorker: while loop iterating"
        begin 
          Rails.logger.info "FakeWorker: sleeping"
          sleep 2
          Rails.logger.info "FakeWorker: woke up"
        rescue Resque::TermException => e
          Rails.logger.info "FakeWorker: rescued a term exception TERM TERM"

          while true
            Rails.logger.info "FakeWorker: INNER while loop iterating"
            begin
              Rails.logger.info "FakeWorker: sleeping"
              sleep 2                                                                                                                                                                                                                                                                                                                                                     
              Rails.logger.info "FakeWorker: woke up"
            rescue Resque::TermException => e
              Rails.logger.info "FakeWorker: INNER while loop rescued a term exception TERM TERM"
            end 
          end 

        end 
      end 
      Rails.logger.info "FakeWorker: stopping naturally"
    rescue => e
      Rails.logger.info "FakeWorker: Caught some other exception: #{e}"
    end 
  end 
end 

works fine:

2013-05-01T19:25:15.455695+00:00 app[bidtable_worker.1]: FakeWorker: starting up
2013-05-01T19:25:15.456433+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:15.456433+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:17.464980+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:17.464980+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:17.464980+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:19.474791+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:19.474791+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:19.474791+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:21.485038+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:21.485038+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:21.485038+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:23.495558+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:23.495558+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:23.495558+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:25.506605+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:25.506605+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:25.506605+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:27.518892+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:27.522329+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:27.522500+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:27.512421+00:00 heroku[api]: Scale to bidtable_worker=0, web=2 by heroku@thinknear.com
2013-05-01T19:25:29.530452+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:29.539650+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:29.539855+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:31.148319+00:00 heroku[bidtable_worker.1]: Stopping all processes with SIGTERM
2013-05-01T19:25:31.541695+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:31.546919+00:00 app[bidtable_worker.1]: FakeWorker: while loop iterating
2013-05-01T19:25:31.546919+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:27.492164+00:00 heroku[bidtable_worker.1]: State changed from up to down
2013-05-01T19:25:32.028712+00:00 app[bidtable_worker.1]: FakeWorker: rescued a term exception TERM TERM
2013-05-01T19:25:32.028712+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:25:32.028712+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:34.031483+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:34.031483+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:25:34.031483+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:36.041852+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:36.041852+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:25:36.041852+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:38.052469+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:38.052469+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:25:38.052469+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:40.062807+00:00 app[bidtable_worker.1]: FakeWorker: woke up
2013-05-01T19:25:40.062807+00:00 app[bidtable_worker.1]: FakeWorker: INNER while loop iterating
2013-05-01T19:25:40.062807+00:00 app[bidtable_worker.1]: FakeWorker: sleeping
2013-05-01T19:25:41.345321+00:00 heroku[bidtable_worker.1]: Error R12 (Exit timeout) -> At least one process failed to exit within 10 seconds of SIGTERM
2013-05-01T19:25:41.345515+00:00 heroku[bidtable_worker.1]: Stopping remaining processes with SIGKILL
2013-05-01T19:25:42.566499+00:00 heroku[bidtable_worker.1]: Process exited with status 0

I verified with a well-behaved worker as well. If the worker process finishing, all is well.

@steveklabnik
Copy link
Member

Amazing.

Would you like to work up a patch with this? You've already gone above and beyond the call of duty on this one, so if not, I will, but I don't want to if you already are.

@softwaregravy
Copy link
Contributor Author

Sure

@divoxx
Copy link

divoxx commented May 1, 2013

I'm glad to see this being discussed but I believe there is another, more important, issue here.

Raising an exception from within a signal handler is a terrible idea. It's undeterministic, it's impossible to tell where the exception was raised from cause it will raise independently of where the execution is, which makes it really hard to rescue from. It could raise pretty much from any class/method from any loaded library.

Instead of doing that, it should signal the workers to gracefully shutdown, re-queuing if necessary, and quitting. It also doesn't make sense to leave that to the hand of the developer, because you're leaking resque internals to the application using it.

@steveklabnik
Copy link
Member

Raising an exception from within a signal handler is a terrible idea.

Agreed. That said, I'm trying to change 1-x-stable's behavior as little as possible. Resque as a project has a ton of technical debt, and changing this makes a lot of sense for 2.0, but given the number of people that rely on Resque as it currently stands, changing things massively is irresponsible.

I'll file a ticket to start tracking this particular bit of debt.

@divoxx
Copy link

divoxx commented May 1, 2013

Although I agree with being backwards compatible, the current implementation makes it impossible to have a solid/reliable resque setup and even more impossible to stop having SigtermExceptions popping up all over the place.

If this gets changed though, it should not impact anyone since I don't believe people are depending on this behavior but more likely getting annoyed by it :) The worst that could happen is that existing rescue Resque::TermExeception will become unreachable.

@maletor
Copy link

maletor commented May 1, 2013

I tend to agree with @divoxx here. Having been frustrated by this personally, I would prefer to target 1-x-stable.

Heroku right now.

@steveklabnik
Copy link
Member

I guess that's true. I guess the main message I want to get across as the maintainer of Resque is that I don't want to 'zomg change EVERYTHING' every time there's a bug. Too many people rely on it to handle things like that. In 2.0, on the other hand....

That said, you are right: this shouldn't affect anyone negatively. My reaction was one of conservatism, not opposition. So sure, if someone wants to change it now, let's do it.

@divoxx
Copy link

divoxx commented May 1, 2013

@maletor LOL :)

@steveklabnik Yes, I definitely don't want resque to change drastically in 1.x as well.

Time is a little bit scarse right now but if no one else steps up, I will give it a try as soon as I can. If anyone else have the time and wanna give it a shot, I'd would appreciate.

@dwilkie
Copy link

dwilkie commented May 2, 2013

@softwaregravy I'm having similar issues with ActiveRecord where I'm geting:

undefined method `result' for #<Resque::TermException: SIGTERM>
/app/vendor/bundle/ruby/2.0.0/gems/activerecord-3.2.13/lib/active_record/connection_adapters/postgresql_adapter.rb:1147:in `translate_exception'

Looking at translate_exception here https://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L713 I can see how this relates exactly to what you were saying before that "term exceptions often cause/manifest as other exceptions. Examples are getting HTTP disconnect errors if we had a persistent connection open (similar is a rest client exception from some clients), ActiveRecord errors if the term exception arrives mid-query."

I can see how your patch fixes the issue where Heroku is sending multiple TERMS but how does it also fix the second point where the TERM exception is caught by ActiveRecord?

@softwaregravy
Copy link
Contributor Author

It wouldn't

Before I found this error I as correcting my code by catching Exception (instead of TermException) and matching the message against the word SIGTERM. Because of this confounding behavior, I was unable to conclusively determine if that was a reliable mitigation strategy. Hopefully I'll have a better answer next week.

@dwilkie
Copy link

dwilkie commented May 2, 2013

I see. Looking at https://github.com/rails/rails/blob/v3.2.13/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb#L281
it seems like catching Exception and matching it against the word SIGTERM wouldn't work in this case since that method in ActiveRecord also rescues from Exception.

@softwaregravy
Copy link
Contributor Author

@dwilkie

We don't write to our database during our recovery, so I can't verify just yet. I plan to.

looking at the code you linked to, thought, I would expect this to work

  1. rescue on line 281. this is the sigterm
  2. message is set to "Resque::TermException: SIGTERM: SELECT * ..."
  3. that message goes into translate_exception
  4. which creates a new ActiveRecord ::StatementInvalid using the afore-mentioned message
  5. which is then thrown

so in your code, you should be able to catch this like:

rescue Exception => e
  if e.message =~ /SIGTERM/
    # handle like a Resque::TermException
  else
    raise
  end
end

This is what we do. And since the Resque::TermException.message contains "SIGTERM", we don't even both catching the TermException at all, and just have the 1 resque block.

We do this now, but I don't know if it works perfectly. It was this not working that lead us to discover that heroku was sending the term's twice. Once we've run my patch in production for a few days, I should be able to state with confidence that this works. Right now, I'd say I'm 95% sure

@dwilkie
Copy link

dwilkie commented May 2, 2013

@softwaregravy Really appreciate your input on this. I can see how your solution might work, however in this case its not clean. The problem is that the error raised is undefined method `result' for #<Resque::TermException: SIGTERM>
/app/vendor/bundle/ruby/2.0.0/gems/activerecord-3.2.13/lib/active_record/connection_adapters/postgresql_adapter.rb:114

So in your response above 1-3 are correct but in 4, the error is raised in this line https://github.com/rails/rails/blob/v3.2.13/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L1147 because TermException doesn't respond to 'result'. If it did respond to 'result' and 'error_field' it would be re-raised in https://github.com/rails/rails/blob/v3.2.13/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L1153 then caught in the worker. It just seems wrong that TermException should have to respond to these methods so I'm not sure what the correct solution is.

@softwaregravy
Copy link
Contributor Author

Good point.

@divoxx
Copy link

divoxx commented May 2, 2013

Like I mentioned, the real underlying issue is the fact that the exception is being raised from within a signal trap. That means that a exception will be raised from the point in the application that is running at the time the signal is received, causing the exception to be raised pretty much from anywhere. That is why you guys are seeing different exceptions, it makes the behavior totally unpredictable.

The only way to really fix this is to stop raising the exception from the signal trap and simply signal the worker to shutdown instead.

@softwaregravy
Copy link
Contributor Author

@divoxx I'm not disagreeing with you, but I'm not agreeing with you either. At the heart of it, we're dealing with multiple processes and with signals passing between them. One such signal is that "it's time to shut down now". Manifesting these as an exception is a very common paradigm (e.g. java's InterruptedException), and it can work well, but takes some diligence. It has it's warts, and I'm not saying the approach shouldn't be reconsidered, but the paradigm is not without precedence.

To be devil's advocate:

The only way to really fix this is to stop raising the exception from the signal trap and simply signal the worker to shutdown instead.

If you call a database or have any external dependency, then probably your only reliable option here is to throw an exception anyway.

All that said, definitely something worth exploring/debating for v2

@divoxx
Copy link

divoxx commented May 2, 2013

@softwaregravy raising an exception is not the problem, raising an exception from within a trap signal is, because its unpredictable.

Also, in Java, InterruptedException is standard and is raised by the VM. That's not how ruby signal handling works and you can't expect the whole Ruby ecosystem and libraries to change just because Resque raises an exception when that happens.

@softwaregravy
Copy link
Contributor Author

@divoxx
For the sake of argument: I have been under the impression that it was a well understood bad practice to catch exception. Especially if you're going to then rely on the exception you catch to implement specific, non-standard methods.
http://stackoverflow.com/questions/10048173/why-is-it-bad-style-to-rescue-exception-e-in-ruby

That said, I'm starting to agree with you. What would you propose instead? a method hook that could be called from the trap?

@divoxx
Copy link

divoxx commented May 2, 2013

@softwaregravy answering your question from the rails issue:

Because Resque dispatch each job execution in a forked process, and it's the master process that receives the signal, the master process can just re-queue the job and kill the child process, which would work even if the child process is stuck in a slow query or something.

Of course, it would probably be a good idea to have a way for the jobs to specify a cleanup procedure, but that's easily done.

And last but not least, yes, rescuing Exception is a bad idea. You usually want to act on specific errors and not catch everything and specially if you're just swallowing it.

@divoxx
Copy link

divoxx commented May 2, 2013

@softwaregravy yeah, a hook for the worker is prob the best way to do it since even though re-queing the job is common it's definitely not a good default

@softwaregravy
Copy link
Contributor Author

@divoxx So, full circle back to discussing my suggested change to rails. Lets say we had the hook, then I need to exit my method. I would choose to just raise an exception since there's a chance I'm in the middle of a 20s db operation. Should my error have to implement pg-exception-specific methods? won't I be brittle and potentially break on changes inside the pg gem? What if I'm also calling Redis and redis does something similar to the pg-extension? What if I add a new dependency, do I have to dig through that source code to find out what special methods those dependency clients require their exceptions to support? Even though those exceptions are internal and not part of the interface (hence active_record wrapping the exception)?

@divoxx
Copy link

divoxx commented May 2, 2013

@softwaregravy I think you are just proving my point. Don't raise an exception. And if you do don't expect libraries to change because of something you do.

Maybe you're not understanding my proposal, so here is the outline:

  • Parent process receives a SIGTERM
  • It waits N seconds for the job to finish
  • If child process finished and exited, parent exits
  • Otherwise, call a SIGTERM on the child process
  • Child process execute Job.shutdown if available and immediately exit
  • Parent process exit

If you had a open database connection, the shutdown method can disconnect. And even if you don't do that, the child process will quit and OS will close/release file descriptors.

@softwaregravy
Copy link
Contributor Author

With the current implementation (were it all working correctly) the child can

  • receive signal that it needs to shut down (currently a SIGTERM)
  • re-enqueue itself with progress updated

We routinely take a job of processing, say 100,000 records and break them into 100 workers that process 1000 each. Each worker might resemble the following:

e.g.

class SomeWorker
  def self.perform(start_id = 2000, end_id = 3000, batch_size = 100)
    current_end_id = end_id
    while start_id < end_id
      current_end_id = [start_id + batch_size, end_id].min
      SomeModel.where(:id => (start_id..current_end_id)).each {|x| x.do_something }
      start_id = current_end_id
    end
  rescue Exception => e
    if e.message =~ SIGTERM
      Resque.enqueue(SomeWorker, start_id, end_id, batch_size)
    else
      raise
    end
  end
end

Whether or not we fix this problem by fixing Rails or not, I think it's incorrect for Rails to be catching Exception and then expecting that exception to implement context-specific methods.

@ecoffey
Copy link
Member

ecoffey commented May 7, 2013

I think that the ideas @divoxx are proposing should definitely be explored in Resque 2. Is that something you would be willing to spike?

@jeyb
Copy link

jeyb commented Sep 23, 2013

Here is my proposal for it. I'll be testing this and will let you know how it goes. https://github.com/doximity/resque/pull/2

@jeyb
Copy link

jeyb commented Sep 25, 2013

This has been working well for me so far, all Resque::TermException are being caught and re-enqueued (by code I added in the on_failure_retry method). We also had a problem with workers not being detached and as a result seeing things like "553 of 567 Workers Working", when it was nowhere near that. This problem also seems resolved with this.

If you use resque mailer, this might be useful: https://github.com/doximity/resque_mailer/pull/1

@JackDanger
Copy link
Member

@jeyb any update on your doximity#2 proposal? Still working? Do you recommend this be merged as-is or modified?

@jeyb
Copy link

jeyb commented Jul 18, 2014

@JackDanger it has worked well for us since the last post. However, we are no longer relying on Heroku therefore we no longer need this patch.

@jeremy
Copy link
Contributor

jeremy commented Feb 15, 2018

@jeremy jeremy closed this as completed Feb 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants