Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

requires resque EXACTLY 2.0.0 #5

Closed
jrochkind opened this issue Jun 7, 2021 · 12 comments
Closed

requires resque EXACTLY 2.0.0 #5

jrochkind opened this issue Jun 7, 2021 · 12 comments

Comments

@jrochkind
Copy link

The dependency expressed on resque is for exactly 2.0.0. resque 2.0.1 or 2.1.0 would not be usable with resque-heroku-signals.

https://github.com/iloveitaly/resque-heroku-signals/blob/master/resque-heroku-signals.gemspec#L22

This is not a problem right now because resque 2.0.0 is the latest release of resque. And may possibly remain so forever... But it's still not a very good practice, the dependency should probably be `~> 2.0" instead?

Or is the idea that you'd release a new version of resque-heroku-signals with every single release of resque, even 2.0.1?

It's a bit weird that this is a separate gem instead of built-in to resque in the first place, is the takeaway that resque is pretty much not maintained?

@iloveitaly
Copy link
Collaborator

@jrochkind there's a line on the readme which explains this a bit:

Since this gem monkeypatches the Heroku worker the gemspec is locked to a x.x.x version of Resque to ensure the monkeypatched logic is compatible with any changes in the original Resque logic. Issue a PR if this is not compatible with the version of resque you are using.

Monkeypatching isn't great practice, but it's the easiest option here, and adding strict dependencies to the gemspec mitigates some of the risk here.

@jrochkind
Copy link
Author

jrochkind commented Jun 8, 2021

OK, thanks! Sorry for missing that. Makes some sense!

Really, this fix should be in resque! But if resque never releases another version again it won't matter. But it's gonna be a pain for everyone if resque say releases a 2.0.1 with a security patch some day, and we all need to update in a hurry cause it's a security patch, but then we can't cause we'd need a new release of resque-heroku-signals, and then hopefully you are around and have time to act quick!

Resque's current maintenance story is kind of painful. :(

@iloveitaly
Copy link
Collaborator

Resque's current maintenance story is kind of painful. :(

Yeah, it is. If you know of anyone there who is open to inviting others to help with maintenance, let me know / drop me an email.

@jrochkind
Copy link
Author

I don't know anyone there, but the hard part is usually finding volunteers for maintenance. I wonder if we should try harder to let them know you are such a volunteer. But yeah, this stuff is hard! gem owner resque mostly lists people that I think aren't involved anymore. :( rafaelfranca is the recognizable still involved name, but he's so busy it's hard to get his attention for anything, and I don't have a personal relationship.

This stuff is really challenging.

Thanks for your work on this gem! I still prefer resque to sidekick, and it's been so confusing to figure out how to get resque to restart gracefully on non-heroku environments. Moving to heroku, and at least you've solved it there!

@iloveitaly
Copy link
Collaborator

iloveitaly commented Jun 8, 2021 via email

@jrochkind
Copy link
Author

Hey sorry, this isn't exactly relevant, but it is SO NICE to find someone to actually talk to about resque...

  • I wasn't totally sure that the heroku graceful shutdown was working properly...
  • So looking at the source I realized there was some useful logging around it that might tell me what's really going on (https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L536-L550)....
  • First I had to figure out that resque by default doesn't log ANYTHING, and figure out how to get it to log and at the :debug level too, which is all documented kind of confusingly...
  • only to find after that, that in fact logging in new_kill_child does not work, and has not worked since ruby 2.0, because new_kill_child is always called inside a signal trap handler, and ruby 2.0 won't let you log to ::Logger inside a signal trap handler.

OMG. Phew! While I've been using resque for a while without problems , something about trying to migrate to heroku is bringing me to bang my head against a variety of abandoned resque problems. (to be fair, never fully solved graceful shutdown probelm, but in my pre-heroku setup it just didn't come up as much).

But, ok, are you @iloveitaly currently using resque on heroku, and you believe with TERM_CHILD set and resque-heroku-signals installed, you have some confidence it's working? Would you care to share the values you use for RESQUE_PRE_SHUTDOWN_TIMEOUT and RESQUE_TERM_TIMEOUT? I think the sum of them should be less than heroku's 30s (?), but I don't understand what's happening after resque sends the TERM enough to have an idea of how to distribute between the two.

Thanks for any advice... definitely considering getting off of resque (or forking resque?), but there are so few options these days... it's kind of a monoculture.

@jrochkind
Copy link
Author

(I have RESQUE_PRE_SHUTDOWN_TIMEOUT set to 20, but I really don't think it's actually waiting 20 seconds, if I look in resque dashboard with 'live poll', I am seeing the workers disappearing (and being recorded as failed jobs, at least they are succesfully being recorded as failed jobs!) -- in what seems to me much quicker than 20 seconds.

Which is why I wanted to investigate the logging to see if I could see what's actually going on... but THAT'S impossible, so not really sure what to do)

@iloveitaly
Copy link
Collaborator

iloveitaly commented Jun 9, 2021

@jrochkind

  • First I had to figure out that resque by default doesn't log ANYTHING, and figure out how to get it to log and at the :debug level too, which is all documented kind of confusingly...

If there's any documentation improvements to be made, make sure to submit them to the resque repo. I just got privledges to start helping triage PRs and issues there so hopefully I'll be able to start getting things merged there too.

But, ok, are you @iloveitaly currently using resque on heroku

Yes, I am and it's working fine. The exact values I'm using for my setup are documented here:

https://github.com/iloveitaly/resque-heroku-signals#example-procfile

Do those match what you have in place in production on your end?

@jrochkind
Copy link
Author

Thanks!

I'm not setting the INTERVAL from default, but I don't think that matters. I may do some more experimentation tomorrow, figure out some hacky way to get logging for the purposes of seeing what's going on with the new_ shutdown.

I just got privledges to start helping triage PRs and issues there

Great news!

I don't understand quite what's intended with the logging design enough to PR a doc change, I just sort of hacked it out to work enough for debugging for now. (are you supposed to set a custom formatter that does something other than swallow all log entries (regardless of log_level set!) as is the default? or you're just supposed to set a whole new logger? If using Rails is Resque.logger = Rails.logger a fine idea or a bad one? It works, although the resque log lines in the context of Rails log aren't always easily identifiable as resque log lines, which is confusing. i honestly haven't quite figured out how to set up logging in a way that seems 'standard' to me).

@jrochkind
Copy link
Author

@iloveitaly OK, in order to try to observe what's going on and verify "graceful" shutdown....

I have RESQUE_PRE_SHUTDOWN_TIMEOUT=20 and RESQUE_TERM_TIMEOUT=8. Resque 2.0.0, resque-heroku-signals 2.0.0.

  1. I hackily patched resque to log just by $stdout.puts, so I can actually see all log lines, even the ones in signal traps.
  2. I have one worker dyno with one resque process
  3. I enqueue some jobs that just do sleep 200, so I can ensure they will be interrupted
  4. I do a CLI heroku ps:restart worker

It can be really confusing to observe even with logs... but I think it does look right as far as timing of when job is killed. (copy and paste from paperclip)

Shutdown begins at 11:38:06, great, and says "Waiting 20.0s for child process to exit". So we'd expect the process won't be interupted until around 11:38:26... and it looks like tha'ts what happens, at 11:38:26 we get "Sending TERM signal to child 11", and "[resque-heroku] received second term signal, throwing term exception". Resulting in Resque::TermException.

And the job does show up in the resque failed queue, great.

The only somewhat unusual thing observed is this also shows up in logs (twice) at that 11:38:26 time:

resque: error: Child 11 already quit and reaped.

That's from a rescue in resque code... https://github.com/resque/resque/blob/master/lib/resque/worker.rb#L554

So it's hard to say what's triggering it, as no stack trace of the original exception is logged. Perhaps this is just an artifact of the algorithm under heroku environment, to be expected, and not to be considered a problem even though it's being logged as "error".

(And of course, ordinarily there's no way to see it, as you can't log to a ruby Logger in a signal trap handler, so without hacky patching you're never going to see any of these log lines anyway).

OK, I think I reassured myself! Thanks for listening. :)

In other news, if you're looking for PR's or issues to triage, to get things working right on heroku, in addition to this gem, I also needed to monkey-patch this in: resque/resque#1739 . I'm not totally sure that's the right solution (is it leaving orphan redis connections open in some cases?), I don't fully understand what's going on there, but I was getting those SSL_read: sslv3 alert bad record mac errors, and no longer do with that patch applied.

@jrochkind
Copy link
Author

jrochkind commented Jun 9, 2021

Actually there is one remaining problem I am seeing...

In some cases, the killed worker leaves itself registered in resque. So the count of workers in resque admin is off, there is one "phantom" worker.

  1. I'm not sure why the worker ends up remaining on the resque list of workers despite being gone. RESQUE_TERM_TIMEOUT=8 be plenty of time for the worker to remove itself from the resque list before dying? But this doesn't seem to happen every time, so I guess it's some kind of race condition.

  2. I'm not sure why these aren't pruned by resque. ,While resque has some logic that is supposed to eventually prune workers that haven't sent a heartbeat in a while... I'm not sure this ever actually gets pruned, there may be a bug in that logic. :( because that worker does show up in Resque::Worker.all_workers_with_expired_heartbeats. The logic looks like prune_dead_workers should get called and remove these on, at least, worker startup... but restarting my worker dynos (and thus my resque workers) never actually does seem to prune these.

They can be cleaned up withResque::Worker.all_workers_with_expired_heartbeats.each { |w| w.unregister_worker } (not sure if there's a better way to do this)

@iloveitaly
Copy link
Collaborator

@jrochkind the zombie worker issue is something I've run into. There's a thread here where others have reported it as well:

resque/resque#1704

Hoping someone can write up a PR and I can help get it merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants