New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple Terms on Heroku #893
Comments
|
Closing until I can repro this |
|
+1 to this issue. We're having the same issue, though it occurs infrequently and sporadically. |
|
Awesome, a reproduction would be fantastic. I can't. :( |
|
The latest from my investigations/experimentation:
I suspect deep down in a lot of these clients they are catching exception, and thus grabbing our term exception and throwing their own. Huge pain. we've reduced the RESQUE_TERM_TIMEOUT from 9 in the above to 4. This seems to have helped, but I can't back that statement up with anything. On our side, we're looking at:
If we get something that works reliably, I'll try and come back to update |
|
Awesome, thanks so much. These errors are so tricky! |
|
Okay, I've got it. I am consistently reproducing a case in which Heroku is sending multiple TERMs on shutdown. So, the problem is definitely getting multiple terms. You get 1, catch and enter your rescue block. 2nd one comes in while you're in the middle of [Post-to-airbrake | sending-stats-to-newrelic | reenqueing-yourself] and looks. Probably can't follow this link, but for reference: https://help.heroku.com/tickets/84111 I will try and copy my examples over later today. Long story short, I can "fix" resque by ignoring subsequent terms. require 'resque'
require 'resque/errors'
module Resque
class Worker
def unregister_signal_handlers
trap('TERM') do
trap ('TERM') do
# ignore subsequent terms
end
raise TermException.new("SIGTERM")
end
trap('INT', 'DEFAULT')
begin
trap('QUIT', 'DEFAULT')
trap('USR1', 'DEFAULT')
trap('USR2', 'DEFAULT')
rescue ArgumentError
end
end
end
end Also, RESQUE_TERM_TIMEOUT is slightly misleading or not working. With a value of 8, I am consistently killed by Heroku before reaching this timeout. I'm going to experiment on this later. |
|
Awesome, thanks so much for diving into this. Huh. |
|
Here we go -- all this from the heroku support thread TLDR: Here's a Monkey patch tested with Resque 1.24.1 that seems to have solved our problem. require 'resque'
require 'resque/errors'
module Resque
class Worker
def unregister_signal_handlers
trap('TERM') do
trap ('TERM') do
# ignore subsequent terms
end
raise TermException.new("SIGTERM")
end
trap('INT', 'DEFAULT')
begin
trap('QUIT', 'DEFAULT')
trap('USR1', 'DEFAULT')
trap('USR2', 'DEFAULT')
rescue ArgumentError
end
end
end
end Here's the full version starting with verifying multiple TERMs. require 'resque/errors'
class FakeWorker
@queue = :bidtable
class << self
def perform
Rails.logger.info "FakeWorker2: starting up"
while true
Rails.logger.info "FakeWorker2: while loop iterating"
begin
Rails.logger.info "FakeWorker2: sleeping"
sleep 2
Rails.logger.info "FakeWorker2: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker2: rescued a term exception"
end
end
Rails.logger.info "FakeWorker2: stopping naturally"
rescue => e
Rails.logger.info "FakeWorker2: Caught some other exception: #{e}"
end
end
end Below we can see that we get 2x: rescued a term exception Repeated this behavior a few times. Verification: require 'resque/errors'
class FakeWorker
@queue = :bidtable
class << self
def perform
Rails.logger.info "FakeWorker: starting up"
while true
Rails.logger.info "FakeWorker: while loop iterating"
begin
Rails.logger.info "FakeWorker: sleeping"
sleep 2
Rails.logger.info "FakeWorker: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker: rescued a term exception TERM TERM"
while true
Rails.logger.info "FakeWorker: INNER while loop iterating"
begin
Rails.logger.info "FakeWorker: sleeping"
sleep 2
Rails.logger.info "FakeWorker: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker: INNER while loop rescued a term exception TERM TERM"
end
end
end
end
Rails.logger.info "FakeWorker: stopping naturally"
rescue => e
Rails.logger.info "FakeWorker: Caught some other exception: #{e}"
end
end
end The key line in this next loop is "INNER while loop rescued a term exception TERM TERM" I first tried putting this in my rescue statement: trap('TERM') do
Rails.logger.info "FakeWorker: trapped another term. trying to ignore this one TERM TERM"
endAnd it worked. Full example: require 'resque/errors'
class FakeWorker
@queue = :bidtable
class << self
def perform
Rails.logger.info "FakeWorker: starting up"
while true
Rails.logger.info "FakeWorker: while loop iterating"
begin
Rails.logger.info "FakeWorker: sleeping"
sleep 2
Rails.logger.info "FakeWorker: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker: rescued a term exception TERM TERM"
trap('TERM') do
Rails.logger.info "FakeWorker: trapped another term. trying to ignore this one TERM TERM"
end
while true
Rails.logger.info "FakeWorker: INNER while loop iterating"
begin
Rails.logger.info "FakeWorker: sleeping"
sleep 2
Rails.logger.info "FakeWorker: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker: INNER while loop rescued a term exception TERM TERM"
end
end
end
end
Rails.logger.info "FakeWorker: stopping naturally"
rescue => e
Rails.logger.info "FakeWorker: Caught some other exception: #{e}"
end
end
end Note we keep going after trapping the second term After playing around a bit, tried out a monkey patch: require 'resque'
require 'resque/errors'
module Resque
class Worker
def unregister_signal_handlers
trap('TERM') do
trap ('TERM') do
# ignore subsequent terms
end
raise TermException.new("SIGTERM")
end
trap('INT', 'DEFAULT')
begin
trap('QUIT', 'DEFAULT')
trap('USR1', 'DEFAULT')
trap('USR2', 'DEFAULT')
rescue ArgumentError
end
end
end
end with an example class: require 'resque/errors'
class FakeWorker
@queue = :bidtable
class << self
def perform
Rails.logger.info "FakeWorker: starting up"
while true
Rails.logger.info "FakeWorker: while loop iterating"
begin
Rails.logger.info "FakeWorker: sleeping"
sleep 2
Rails.logger.info "FakeWorker: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker: rescued a term exception TERM TERM"
while true
Rails.logger.info "FakeWorker: INNER while loop iterating"
begin
Rails.logger.info "FakeWorker: sleeping"
sleep 2
Rails.logger.info "FakeWorker: woke up"
rescue Resque::TermException => e
Rails.logger.info "FakeWorker: INNER while loop rescued a term exception TERM TERM"
end
end
end
end
Rails.logger.info "FakeWorker: stopping naturally"
rescue => e
Rails.logger.info "FakeWorker: Caught some other exception: #{e}"
end
end
end works fine: I verified with a well-behaved worker as well. If the worker process finishing, all is well. |
|
Amazing. Would you like to work up a patch with this? You've already gone above and beyond the call of duty on this one, so if not, I will, but I don't want to if you already are. |
|
Sure |
|
I'm glad to see this being discussed but I believe there is another, more important, issue here. Raising an exception from within a signal handler is a terrible idea. It's undeterministic, it's impossible to tell where the exception was raised from cause it will raise independently of where the execution is, which makes it really hard to rescue from. It could raise pretty much from any class/method from any loaded library. Instead of doing that, it should signal the workers to gracefully shutdown, re-queuing if necessary, and quitting. It also doesn't make sense to leave that to the hand of the developer, because you're leaking resque internals to the application using it. |
Agreed. That said, I'm trying to change I'll file a ticket to start tracking this particular bit of debt. |
|
Although I agree with being backwards compatible, the current implementation makes it impossible to have a solid/reliable resque setup and even more impossible to stop having SigtermExceptions popping up all over the place. If this gets changed though, it should not impact anyone since I don't believe people are depending on this behavior but more likely getting annoyed by it :) The worst that could happen is that existing |
|
I tend to agree with @divoxx here. Having been frustrated by this personally, I would prefer to target 1-x-stable. |
|
I guess that's true. I guess the main message I want to get across as the maintainer of Resque is that I don't want to 'zomg change EVERYTHING' every time there's a bug. Too many people rely on it to handle things like that. In 2.0, on the other hand.... That said, you are right: this shouldn't affect anyone negatively. My reaction was one of conservatism, not opposition. So sure, if someone wants to change it now, let's do it. |
|
@maletor LOL :) @steveklabnik Yes, I definitely don't want resque to change drastically in 1.x as well. Time is a little bit scarse right now but if no one else steps up, I will give it a try as soon as I can. If anyone else have the time and wanna give it a shot, I'd would appreciate. |
|
@softwaregravy I'm having similar issues with ActiveRecord where I'm geting: Looking at I can see how your patch fixes the issue where Heroku is sending multiple TERMS but how does it also fix the second point where the TERM exception is caught by ActiveRecord? |
|
It wouldn't Before I found this error I as correcting my code by catching Exception (instead of TermException) and matching the message against the word SIGTERM. Because of this confounding behavior, I was unable to conclusively determine if that was a reliable mitigation strategy. Hopefully I'll have a better answer next week. |
|
I see. Looking at https://github.com/rails/rails/blob/v3.2.13/activerecord/lib/active_record/connection_adapters/abstract_adapter.rb#L281 |
|
We don't write to our database during our recovery, so I can't verify just yet. I plan to. looking at the code you linked to, thought, I would expect this to work
so in your code, you should be able to catch this like: rescue Exception => e
if e.message =~ /SIGTERM/
# handle like a Resque::TermException
else
raise
end
endThis is what we do. And since the We do this now, but I don't know if it works perfectly. It was this not working that lead us to discover that heroku was sending the term's twice. Once we've run my patch in production for a few days, I should be able to state with confidence that this works. Right now, I'd say I'm 95% sure |
|
@softwaregravy Really appreciate your input on this. I can see how your solution might work, however in this case its not clean. The problem is that the error raised is undefined method `result' for #<Resque::TermException: SIGTERM> So in your response above 1-3 are correct but in 4, the error is raised in this line https://github.com/rails/rails/blob/v3.2.13/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L1147 because TermException doesn't respond to 'result'. If it did respond to 'result' and 'error_field' it would be re-raised in https://github.com/rails/rails/blob/v3.2.13/activerecord/lib/active_record/connection_adapters/postgresql_adapter.rb#L1153 then caught in the worker. It just seems wrong that TermException should have to respond to these methods so I'm not sure what the correct solution is. |
|
Good point. |
|
Like I mentioned, the real underlying issue is the fact that the exception is being raised from within a signal trap. That means that a exception will be raised from the point in the application that is running at the time the signal is received, causing the exception to be raised pretty much from anywhere. That is why you guys are seeing different exceptions, it makes the behavior totally unpredictable. The only way to really fix this is to stop raising the exception from the signal trap and simply signal the worker to shutdown instead. |
|
@divoxx I'm not disagreeing with you, but I'm not agreeing with you either. At the heart of it, we're dealing with multiple processes and with signals passing between them. One such signal is that "it's time to shut down now". Manifesting these as an exception is a very common paradigm (e.g. java's To be devil's advocate:
If you call a database or have any external dependency, then probably your only reliable option here is to throw an exception anyway. All that said, definitely something worth exploring/debating for v2 |
|
@softwaregravy raising an exception is not the problem, raising an exception from within a trap signal is, because its unpredictable. Also, in Java, InterruptedException is standard and is raised by the VM. That's not how ruby signal handling works and you can't expect the whole Ruby ecosystem and libraries to change just because Resque raises an exception when that happens. |
|
@divoxx That said, I'm starting to agree with you. What would you propose instead? a method hook that could be called from the trap? |
|
@softwaregravy answering your question from the rails issue: Because Resque dispatch each job execution in a forked process, and it's the master process that receives the signal, the master process can just re-queue the job and kill the child process, which would work even if the child process is stuck in a slow query or something. Of course, it would probably be a good idea to have a way for the jobs to specify a cleanup procedure, but that's easily done. And last but not least, yes, rescuing |
|
@softwaregravy yeah, a hook for the worker is prob the best way to do it since even though re-queing the job is common it's definitely not a good default |
|
@divoxx So, full circle back to discussing my suggested change to rails. Lets say we had the hook, then I need to exit my method. I would choose to just raise an exception since there's a chance I'm in the middle of a 20s db operation. Should my error have to implement pg-exception-specific methods? won't I be brittle and potentially break on changes inside the pg gem? What if I'm also calling Redis and redis does something similar to the pg-extension? What if I add a new dependency, do I have to dig through that source code to find out what special methods those dependency clients require their exceptions to support? Even though those exceptions are internal and not part of the interface (hence active_record wrapping the exception)? |
|
@softwaregravy I think you are just proving my point. Don't raise an exception. And if you do don't expect libraries to change because of something you do. Maybe you're not understanding my proposal, so here is the outline:
If you had a open database connection, the shutdown method can disconnect. And even if you don't do that, the child process will quit and OS will close/release file descriptors. |
|
With the current implementation (were it all working correctly) the child can
We routinely take a job of processing, say 100,000 records and break them into 100 workers that process 1000 each. Each worker might resemble the following: e.g. class SomeWorker
def self.perform(start_id = 2000, end_id = 3000, batch_size = 100)
current_end_id = end_id
while start_id < end_id
current_end_id = [start_id + batch_size, end_id].min
SomeModel.where(:id => (start_id..current_end_id)).each {|x| x.do_something }
start_id = current_end_id
end
rescue Exception => e
if e.message =~ SIGTERM
Resque.enqueue(SomeWorker, start_id, end_id, batch_size)
else
raise
end
end
endWhether or not we fix this problem by fixing Rails or not, I think it's incorrect for Rails to be catching |
|
I think that the ideas @divoxx are proposing should definitely be explored in Resque 2. Is that something you would be willing to spike? |
|
Here is my proposal for it. I'll be testing this and will let you know how it goes. https://github.com/doximity/resque/pull/2 |
|
This has been working well for me so far, all If you use resque mailer, this might be useful: https://github.com/doximity/resque_mailer/pull/1 |
|
@jeyb any update on your doximity#2 proposal? Still working? Do you recommend this be merged as-is or modified? |
|
@JackDanger it has worked well for us since the last post. However, we are no longer relying on Heroku therefore we no longer need this patch. |
|
Heroku TERM support: https://github.com/iloveitaly/resque-heroku-signals#readme |
We were following https://github.com/defunkt/resque/pull/638 pretty closely, and implemented handling on our side in line with this: http://hone.heroku.com/resque/2012/08/21/resque-signals.html
In our rescue logic, we try to re-queue ourselves for another worker.
Here's an example failed stack trace:
Seems right -- we get the term exception on shutdown. However, this is coming from our rescue block. Our perform looks like this:
So we catch the term exception, then we try to re-enqueue ourselves, but get another term exception right away.
Any thoughts?
I'm trying to reproduce this in a more deterministic way right now, but wanted to bring this up in case anyone knew what was going on.
Resque version 1.23.0
Procfile for worker:
Update: I've been trying to reproduce this deterministically, and I don't seem to be able to. We do get his error in production fairly often.
The text was updated successfully, but these errors were encountered: