Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

triggering shutdown by setting a redis flag #434

Merged
merged 4 commits into from Jan 12, 2015

Conversation

jtushman
Copy link

@jtushman jtushman commented Oct 9, 2014

Problem Statement
We use rq on heroku and their way to the shutdown the worker that runs rq does not ensure a safe shutdown.

From Heroku docs: https://devcenter.heroku.com/articles/dynos

When a worker is stoped by ps:scaling or pushing out a new release or any heroku operation.
The dyno manager sends the working SIGTERM

The process then has 10 seconds to shut down gracefully.

If the process is still alive -- then SIGKILL is then sent

So if a job takes more then 10 secs (which many of our jobs do) -- we are going to be out of luck. The job will be killed in a potentially unsafe point

So we needed an approach where we could set a flag on redis, that the rq worker can ping.

I introduced the key: 'rq:worker:pause_work' Which any process can set. If the worker sees that it has been set -- it hops into a pause loop, until the key is deleted.

In Worker#work at the top of the while True: block

    try:
        before_state = None
        notified = False

        while Worker.paused() and not self.stopped:

            if burst:
                self.log.warn('Paused in burst mode -- exiting.')
                self.log.warn('Note:  There could still be unperformed jobs on the queue')
                raise StopRequested

            if not notified:
                self.log.warn('Stopping on pause request REALLY.')
                before_state = self.get_state()
                self.set_state('paused')
                notified = True
            time.sleep(1)
    except StopRequested:
        break

self.log.warn('Paused in burst mode -- exiting.')
self.log.warn('Note: There could still be unperformed jobs on the queue')
break

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is being pretty nit-picky, but there's a race condition here, the paused key could be set between this if statement and the while paused loop, and because Worker.paused() makes a network call this window is actually a few milliseconds long

@jtushman
Copy link
Author

Cool @nvie I have cleaned this up and removed the "provisional" tag -- seems to be working well for us. Let me know your thoughts.

Thanks for the great library

@selwin
Copy link
Collaborator

selwin commented Oct 25, 2014

From your problem description, shouldn't this situation be handled by the process manager (supevisor or upstart) running rqworker? It feels cleaner that way.

@jtushman
Copy link
Author

Hi @selwin. I wish that was the case. But with Heroku we are unable to access the process manager.

Our main option is: heroku ps:scale worker=0

Which does the following:

  1. Send SIGTERM
  2. Wait 10 seconds (non configurable)
  3. Send SIGKILL

I can not find it now, but the heroku folks recommended a database driven solution like the one implemented here.

I know it adds some complexity to an elegant solution -- but I think this will help a lot of people that use this on Heroku or similar systems.

I wonder if @kennethreitz has anything to add to this discussion (the python advocate at Heroku)

@selwin
Copy link
Collaborator

selwin commented Oct 28, 2014

Ah ok, got it.

rqworker already listens for SIGTERM signal and will perform a warm shut down when that signal is received. If it doesn't already do that, then we should open an issue and treat this as a bug.

In the case that a job takes longer than 10 seconds to complete and the worker is killed by a SIGKILL, the master branch has a StartedJobRegistry which keeps track of started jobs that didn't finish properly so you'll be able to track jobs that are killed by cold shutdowns. These jobs will eventually be moved to FailedQueue so you can choose to retry or delete failed jobs from there.

Does that help?

@ThisGuyCodes
Copy link
Contributor

Sort of. But that still leaves us without a way to cleanly shutdown. Intentionally failing and then retrying whatever job is currently running certainly is an option, but hardly a good one.

@nvie
Copy link
Collaborator

nvie commented Oct 29, 2014

Hey @jtushman, thanks for the PR. Definitely a missing feature. Heroku is a well-known platform that can cause annoying issues like this.

I think the initial stab you took at it is fine, but I have a few comments:

  • What I don't like yet is the "global nature" of it. We need a way of controlling the spin-down process for each worker individually.
  • The .pause() method (because it's a method) implies it's scoped to the worker, but it's not. Calling .pause() on one worker will pause them all.

What if we flip the logic here? We can have each worker check whether it may continue with the next job in some way. These conditional acts as traffic lights, essentially. And as we're at it, why not have more than one such conditional?

Each worker could than be instantiated as follows:

$ rqworker -c foo -c bar

This would mean that rqworker will continue processing jobs unless the Redis keys foo or bar exist. As soon as any of these keys exist, it should go into "paused" mode. (Not specifying any traffic lights using -c means: never check, always continue. E.g. the current behaviour.)

This way you can have fine-grained control over your deployments. The simplest case would be:

hi: rqworker -c heroku:maintenance high
lo: rqworker -c heroku:maintenance high default low

Now to start a new deployment, you would simply create the Redis key heroku:maintenance (e.g. using SET heroku:maintenance 1). It might even be reasonable to have an extra command line subcommand to set those in a specific env. It'd be extra nice if that tool would support setting these values with an expiry value, too:

$ rq flag --expires-in=60 heroku:maintenance   # briefly pause
$ rq flag heroku:maintenance   # set flag, until unset
$ rq flag -d heroku:maintenance  # unset flag

What do you think of this alternative? We could easily document this option as the recommended way of doing Heroku deployments.

@nvie
Copy link
Collaborator

nvie commented Oct 29, 2014

Actually, we should be scoping these flags. Perhaps the best solution is to use a global set, like rq:flags, and using SADD rq:flags maintenance, and SISMEMBER rq:flags maintenance to test each round. The idea remains the same, though.

@selwin
Copy link
Collaborator

selwin commented Oct 29, 2014

Sorry but it feels like there's something that I'm missing. Doesn't Worker already perform a clean shutdown when a SIGTERM is sent? (Please ignore the case when RQ is performing a long running job).

@nvie
Copy link
Collaborator

nvie commented Oct 29, 2014

Yes it will. The effect essentially is the same: it will finish the current work, then stop. The difference is that in the Heroku context you cannot control the Unix process itself, so (1) you cannot send it SIGTERM, and (2) Heroku will kill it after 10 seconds, even if the current job requires more time.

This approach will provide a control mechanism that can be controlled via Redis, so independent of the platform it's running on. The deployment would essentially be:

  1. Run rq flag maintenance
  2. Wait until all work is done, no matter how long it takes (safe)
  3. Do the deployment
  4. Run rq flag -d maintenance

@ThisGuyCodes
Copy link
Contributor

@selwin we can't ignore the case when RQ is performing a long running job, because that's the entire problem we're trying to solve (job takes longer than Heroku gives us to shut down).

I do like @nvie's approach, but it also gave me an idea; along the same lines I think it also makes sense for each queue to have a pause key. It may require a bit more work (and in fact might need to be a separate conversation), but I feel like that would have more use cases (stopping a specific kind of work instead of specific workers).

Also, in Heroku if we shut down the process (instead of some internal pause/sleep), then Heroku will automatically restart the process. Thus, no matter which approach we take, we do need to do an internal pause/sleep instead of shutting down (that's what this PR does so far, but I felt it worth noting).

@selwin
Copy link
Collaborator

selwin commented Oct 29, 2014

@nvie from my understanding of Heroku's docs, when Heroku initiates a shutdown, SIGKILL is automatically sent to the process.

@conslo I may be mistaken, but I this is what I think. If automatic shutdown is requested by Heroku's Dyno manager, it won't restart the process in the same dyno when the process stops (it doesn't make sense to restart a process they want to shutdown). This line in Heroku's docs about Dynos implies that it is enough for us to simply shutdown the process when we receive Heroku's shutdown signal:

The application processes have ten seconds to shut down cleanly (ideally, they will do so more quickly than that).

Heroku's own code sample directly below also supports my hypothesis:

STDOUT.sync = true
puts "Starting up"

trap('TERM') do
  puts "Graceful shutdown"
  exit
end

loop do
  puts "Pretending to do work"
  sleep 3
end

So in short, from my limited understanding of Heroku's Dynos, I'm not convinced that this suspend/resume feature will help in handling Dyno manager's shutdown requests more elegantly.

There could be other use cases for suspending workers, of course. And if we really want to implement this pause/resume feature, I'd suggest the following APIs:

rq suspend queue_name --connection-params
rq resume queue_name --connection-params

Thoughts?

@ThisGuyCodes
Copy link
Contributor

@selwin everything you said is correct. If in fact we respond to the shutdown request by stopping the process within ten seconds, things will shut down cleanly. And this process will not in fact aid in handling these shutdown requests. But that's not what we're trying to fix.

Currently, when we go to deploy new code, we have two choices:

  • Kill our workers and cause whatever jobs may be running to fail (we have very few jobs that take < 10 seconds)

Or

  • Check worker status until no jobs are running (may take a while depending on queue)
  • (race condition here)
  • Kill the workers to prevent new jobs from starting. (heroku ps:scale worker 0)
  • Push new code.
  • Scale the workers back up. (heroku ps:scale worker N)

What we want is a cleaner choice, which as proposed amounts to this:

  • Flip a bit in redis.
  • Wait for all workers to report as paused.
  • Push new code (this causes workers to restart with new codebase)
  • Flip bit back.

This would allow us to push new code, knowing that we wouldn't cause jobs to fail. Currently this guarantee isn't possible

@ThisGuyCodes
Copy link
Contributor

But your suspend/resume queue idea I do like yes

@jtushman
Copy link
Author

⬆️ I just made a preliminary implementation of @nvie's suggustions. Let me know your thoughts. If we like I will add tests, etc ...

@jtushman
Copy link
Author

Oh and I could not do the expiry thing with sadd (I am not a redis expert, but I do not think its supported for sets)

@selwin
Copy link
Collaborator

selwin commented Nov 1, 2014

@conslo thanks for the explanation. I see what you mean now.

In terms of implementation, I prefer a more explicit rq suspend and rq resume approach rather than rq flag suggested by @nvie. @nvie, what do you think?

@nvie
Copy link
Collaborator

nvie commented Nov 1, 2014

I initially went with the flag approach because I thought it'd be useful to allow you to mix and match the behaviour to your liking — i.e. not necessarily tie suspending per queues. For example, you could set a single flag and let all your workers watch that flag, so suspending all workers just takes setting a single flag. Or you could stop 1 worker, but not others.

But perhaps stopping per queue is also a good option. To support flags-per-queue, the change required is going to be a bit more involved, because we need to pull apart the BLPOP behaviour. Specifically, this line needs to essentially change to watch queue_keys - flagged_queues. Coordinating this alternative will be a bit more tricky (i.e. making sure the worker will pause and keep checking), but can certainly be done.

I think there's a case for both options. Does @jtushman have a preference?

@selwin
Copy link
Collaborator

selwin commented Nov 1, 2014

What if we initially just go for the global suspend/resume approach to keep things simple?

You can use different Redis DB number if you want to configure this on a per queue basis.

Sent from my phone

On Nov 1, 2014, at 1:50 PM, Vincent Driessen notifications@github.com wrote:

I initially went with the flag approach because I thought it'd be useful to allows you to mix and match the behaviour to your liking — i.e. not necessarily tie suspending per queues. For example, you could set a single flag and let all your workers watch that flag, so suspending all workers just takes setting a single flag. Or you could stop 1 worker, but not others.

But perhaps stopping per queue is also a good option. To support flags-per-queue, the change required is going to be a bit more involved, because we need to pull apart the BLPOP behaviour. Specifically, this line needs to essentially change to watch queue_keys - flagged_queues. Coordinating this alternative will be a bit more tricky (i.e. making sure the worker will pause and keep checking), but can certainly be done.

I think there's a case for both options. Does @jtushman have a preference?


Reply to this email directly or view it on GitHub.

Vagrantfile

.idea
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore these in a global ignore config.

@nvie
Copy link
Collaborator

nvie commented Nov 1, 2014

Sounds good, @selwin. If we all agree on the flag solution, let's go ahead and review the current PR, polish it, add tests and documentation.

@selwin
Copy link
Collaborator

selwin commented Nov 1, 2014

Oh, I meant using "rq suspend" and "rq resume", but on a global basis to make it easy. Is that ok?

Sent from my phone

On Sat, Nov 1, 2014 at 12:14 AM -0700, "Vincent Driessen" notifications@github.com wrote:

Sounds good, @selwin. If we all agree on the flag solution, let's go ahead and review the current PR, polish it, add tests and documentation.


Reply to this email directly or view it on GitHub.

@jtushman
Copy link
Author

jtushman commented Nov 2, 2014

Hey guys,

Either approach works for me. I think the CLI of rq suspend, rq resume would be easier to document and and explain and would probably service 90% of the need. We can have the flag implementation under the hood -- so we can extend it easily -- and give that functionality to the power users off the bat.

I can work on this early next week.

@nvie
Copy link
Collaborator

nvie commented Nov 3, 2014

Sure, let's start with the rq suspend and rq resume per queue implementation, as @selwin suggested. Thanks for wanting to work on this @jtushman! (Btw, the flag implementation does not work for the per-queue case, see the 2nd paragraph in this comment, but per-queue is simpler to understand I think, so better. Let's forget the flag idea for now.)

@selwin
Copy link
Collaborator

selwin commented Nov 3, 2014

Terribly sorry if there was any miscommunication. But I think I meant "rq suspend" and "rq resume" on a global basis to keep things simple. Please see my comment here: #434 (comment)

I share the same feeling with @nvie in that implementing this on a per queue basis may be a bit hairy.

@nvie
Copy link
Collaborator

nvie commented Nov 3, 2014

Oh, were you suggesting in #434 (comment) to do it like the current implementation by @jtushman, but suggest a terminology change (i.e. instead of using "flags" or "pause" name them "suspend" and "resume")?

@selwin
Copy link
Collaborator

selwin commented Nov 3, 2014

Yes, like @jtushman suggested, I think we should do it the simplest possible way i.e Worker should check whether rq:suspend key exists and suspends if it does and just leave it at that.

Since this covers a large majority of use cases, we should just leave it at that and forget about the ability to set arbitrary flags for now. Anything more complex can be implemented using custom worker classes (this is why we built custom worker classes ;).

@jtushman
Copy link
Author

jtushman commented Nov 4, 2014

Ok took a stab at the global pause/resume approach. Added tests. Thoughts?

@selwin
Copy link
Collaborator

selwin commented Nov 5, 2014

I'll try to find sometime to review this PR this weekend. Thanks!

Sent from my phone

On Nov 4, 2014, at 8:58 PM, Jonathan Tushman notifications@github.com wrote:

Ok took a stab at the global pause/resume approach. Added tests. Thoughts?


Reply to this email directly or view it on GitHub.



def suspend(connection, ttl=None):
if ttl:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be if ttl is None because passing 0 into ttl should not cause the key to live forever.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm -- I think we should make that that check at the CLI level -- if at all. To me setting the ttl=0 SHOULD cause the key to live forever (or until resume is called) That seems to me to be expected ttl behavior.

But I will default to you / @nvie lemme know.

Everything else is easy breezy

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) when pulling 0f3ea8c on jtushman:redis-based-shutdown into 786d3c5 on nvie:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.06%) when pulling 0f3ea8c on jtushman:redis-based-shutdown into 786d3c5 on nvie:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.2%) when pulling 162a02f on jtushman:redis-based-shutdown into 786d3c5 on nvie:master.

@jtushman
Copy link
Author

jtushman commented Dec 2, 2014

Cool -- how does this look? BTW I know this is getting to be a fair bit of commits (25) Should we squash them? I am not sure what that does to the conversation on the PR.

Note: Reposting my comment on an outdated diff for its collapsed now:

Regarding the arg error on the connection#set:
That was a redis version issue. I was running 2.9.x -- But the setup.py requires 2.7.0.
(I do have a test for it: test_cli#test_suspend_with_ttl)

Thats why I updated requirements.txt to match setup.py so travis will catch these errors

@jtushman
Copy link
Author

Hey guys -- any thoughts on this? Let me know if you have any additional feedback.

@selwin
Copy link
Collaborator

selwin commented Dec 12, 2014

I'll review this again this weekend. Yeah, it would be great if we could squash the commits together as there was a fair bit of noise with regards to the earlier implementation attempts.

@jtushman
Copy link
Author

Squashed away!

@@ -158,7 +163,12 @@ def worker(url, config, burst, name, worker_class, job_class, queue_class, path,
worker_class = import_attribute(worker_class)
queue_class = import_attribute(queue_class)

if is_suspended(conn):
click.secho("The worker has been paused, run reset_paused", fg='red')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say RQ is currently suspended, to resume job execution run "rq resume"

@selwin
Copy link
Collaborator

selwin commented Dec 15, 2014

@jtushman also, it looks like this PR doesn't merge cleanly anymore. Mind taking a look at this? If you can fix this within the next few days, I'll refrain from merging other PRs so we can get this merged in quickly since this PR touches quite a number of files.

Thanks for the effort and hard work!

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.15%) when pulling 1ae5a12 on jtushman:redis-based-shutdown into cd0c3c9 on nvie:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.19%) when pulling 60c27d5 on jtushman:redis-based-shutdown into cd0c3c9 on nvie:master.

@selwin
Copy link
Collaborator

selwin commented Dec 24, 2014

Aside from a few tiny typos and docs. I think this PR is ready. Any objection if I merge this in @nvie ?

@jtushman
Copy link
Author

jtushman commented Jan 5, 2015

Happy New Year @nvie, @selwin! I am back from break -- lemme know if this needs any more polish

@selwin
Copy link
Collaborator

selwin commented Jan 12, 2015

I'm going to merge this in. Thanks a lot for your hardwork @jtushman !

selwin added a commit that referenced this pull request Jan 12, 2015
triggering shutdown by setting a redis flag
@selwin selwin merged commit 31dcb57 into rq:master Jan 12, 2015
@jtushman
Copy link
Author

Woot!

@dfrankow
Copy link

I see that this was merged in, including into the rq command-line client ("rq suspend" and "rq resume").

But how exactly should this be used for people running on heroku? Are there any docs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants