triggering shutdown by setting a redis flag #434

jtushman · 2014-10-09T18:15:35Z

Problem Statement
We use rq on heroku and their way to the shutdown the worker that runs rq does not ensure a safe shutdown.

From Heroku docs: https://devcenter.heroku.com/articles/dynos

When a worker is stoped by ps:scaling or pushing out a new release or any heroku operation.
The dyno manager sends the working SIGTERM

The process then has 10 seconds to shut down gracefully.

If the process is still alive -- then SIGKILL is then sent

So if a job takes more then 10 secs (which many of our jobs do) -- we are going to be out of luck. The job will be killed in a potentially unsafe point

So we needed an approach where we could set a flag on redis, that the rq worker can ping.

I introduced the key: 'rq:worker:pause_work' Which any process can set. If the worker sees that it has been set -- it hops into a pause loop, until the key is deleted.

In Worker#work at the top of the while True: block

    try:
        before_state = None
        notified = False

        while Worker.paused() and not self.stopped:

            if burst:
                self.log.warn('Paused in burst mode -- exiting.')
                self.log.warn('Note:  There could still be unperformed jobs on the queue')
                raise StopRequested

            if not notified:
                self.log.warn('Stopping on pause request REALLY.')
                before_state = self.get_state()
                self.set_state('paused')
                notified = True
            time.sleep(1)
    except StopRequested:
        break

ThisGuyCodes · 2014-10-24T15:18:04Z

rq/worker.py

+                        self.log.warn('Paused in burst mode -- exiting.')
+                        self.log.warn('Note:  There could still be unperformed jobs on the queue')
+                        break
+


this is being pretty nit-picky, but there's a race condition here, the paused key could be set between this if statement and the while paused loop, and because Worker.paused() makes a network call this window is actually a few milliseconds long

jtushman · 2014-10-24T15:49:06Z

Cool @nvie I have cleaned this up and removed the "provisional" tag -- seems to be working well for us. Let me know your thoughts.

Thanks for the great library

selwin · 2014-10-25T03:04:02Z

From your problem description, shouldn't this situation be handled by the process manager (supevisor or upstart) running rqworker? It feels cleaner that way.

jtushman · 2014-10-25T19:18:26Z

Hi @selwin. I wish that was the case. But with Heroku we are unable to access the process manager.

Our main option is: heroku ps:scale worker=0

Which does the following:

Send SIGTERM
Wait 10 seconds (non configurable)
Send SIGKILL

I can not find it now, but the heroku folks recommended a database driven solution like the one implemented here.

I know it adds some complexity to an elegant solution -- but I think this will help a lot of people that use this on Heroku or similar systems.

I wonder if @kennethreitz has anything to add to this discussion (the python advocate at Heroku)

selwin · 2014-10-28T02:48:01Z

Ah ok, got it.

rqworker already listens for SIGTERM signal and will perform a warm shut down when that signal is received. If it doesn't already do that, then we should open an issue and treat this as a bug.

In the case that a job takes longer than 10 seconds to complete and the worker is killed by a SIGKILL, the master branch has a StartedJobRegistry which keeps track of started jobs that didn't finish properly so you'll be able to track jobs that are killed by cold shutdowns. These jobs will eventually be moved to FailedQueue so you can choose to retry or delete failed jobs from there.

Does that help?

ThisGuyCodes · 2014-10-28T15:40:31Z

Sort of. But that still leaves us without a way to cleanly shutdown. Intentionally failing and then retrying whatever job is currently running certainly is an option, but hardly a good one.

nvie · 2014-10-29T08:16:06Z

Hey @jtushman, thanks for the PR. Definitely a missing feature. Heroku is a well-known platform that can cause annoying issues like this.

I think the initial stab you took at it is fine, but I have a few comments:

What I don't like yet is the "global nature" of it. We need a way of controlling the spin-down process for each worker individually.
The .pause() method (because it's a method) implies it's scoped to the worker, but it's not. Calling .pause() on one worker will pause them all.

What if we flip the logic here? We can have each worker check whether it may continue with the next job in some way. These conditional acts as traffic lights, essentially. And as we're at it, why not have more than one such conditional?

Each worker could than be instantiated as follows:

$ rqworker -c foo -c bar

This would mean that rqworker will continue processing jobs unless the Redis keys foo or bar exist. As soon as any of these keys exist, it should go into "paused" mode. (Not specifying any traffic lights using -c means: never check, always continue. E.g. the current behaviour.)

This way you can have fine-grained control over your deployments. The simplest case would be:

hi: rqworker -c heroku:maintenance high
lo: rqworker -c heroku:maintenance high default low

Now to start a new deployment, you would simply create the Redis key heroku:maintenance (e.g. using SET heroku:maintenance 1). It might even be reasonable to have an extra command line subcommand to set those in a specific env. It'd be extra nice if that tool would support setting these values with an expiry value, too:

$ rq flag --expires-in=60 heroku:maintenance   # briefly pause
$ rq flag heroku:maintenance   # set flag, until unset
$ rq flag -d heroku:maintenance  # unset flag

What do you think of this alternative? We could easily document this option as the recommended way of doing Heroku deployments.

nvie · 2014-10-29T08:18:39Z

Actually, we should be scoping these flags. Perhaps the best solution is to use a global set, like rq:flags, and using SADD rq:flags maintenance, and SISMEMBER rq:flags maintenance to test each round. The idea remains the same, though.

selwin · 2014-10-29T08:31:15Z

Sorry but it feels like there's something that I'm missing. Doesn't Worker already perform a clean shutdown when a SIGTERM is sent? (Please ignore the case when RQ is performing a long running job).

nvie · 2014-10-29T13:35:26Z

Yes it will. The effect essentially is the same: it will finish the current work, then stop. The difference is that in the Heroku context you cannot control the Unix process itself, so (1) you cannot send it SIGTERM, and (2) Heroku will kill it after 10 seconds, even if the current job requires more time.

This approach will provide a control mechanism that can be controlled via Redis, so independent of the platform it's running on. The deployment would essentially be:

Run rq flag maintenance
Wait until all work is done, no matter how long it takes (safe)
Do the deployment
Run rq flag -d maintenance

ThisGuyCodes · 2014-10-29T15:17:06Z

@selwin we can't ignore the case when RQ is performing a long running job, because that's the entire problem we're trying to solve (job takes longer than Heroku gives us to shut down).

I do like @nvie's approach, but it also gave me an idea; along the same lines I think it also makes sense for each queue to have a pause key. It may require a bit more work (and in fact might need to be a separate conversation), but I feel like that would have more use cases (stopping a specific kind of work instead of specific workers).

Also, in Heroku if we shut down the process (instead of some internal pause/sleep), then Heroku will automatically restart the process. Thus, no matter which approach we take, we do need to do an internal pause/sleep instead of shutting down (that's what this PR does so far, but I felt it worth noting).

selwin · 2014-10-29T16:26:55Z

@nvie from my understanding of Heroku's docs, when Heroku initiates a shutdown, SIGKILL is automatically sent to the process.

@conslo I may be mistaken, but I this is what I think. If automatic shutdown is requested by Heroku's Dyno manager, it won't restart the process in the same dyno when the process stops (it doesn't make sense to restart a process they want to shutdown). This line in Heroku's docs about Dynos implies that it is enough for us to simply shutdown the process when we receive Heroku's shutdown signal:

The application processes have ten seconds to shut down cleanly (ideally, they will do so more quickly than that).

Heroku's own code sample directly below also supports my hypothesis:

STDOUT.sync = true
puts "Starting up"

trap('TERM') do
  puts "Graceful shutdown"
  exit
end

loop do
  puts "Pretending to do work"
  sleep 3
end

So in short, from my limited understanding of Heroku's Dynos, I'm not convinced that this suspend/resume feature will help in handling Dyno manager's shutdown requests more elegantly.

There could be other use cases for suspending workers, of course. And if we really want to implement this pause/resume feature, I'd suggest the following APIs:

rq suspend queue_name --connection-params
rq resume queue_name --connection-params

Thoughts?

ThisGuyCodes · 2014-10-29T16:41:29Z

@selwin everything you said is correct. If in fact we respond to the shutdown request by stopping the process within ten seconds, things will shut down cleanly. And this process will not in fact aid in handling these shutdown requests. But that's not what we're trying to fix.

Currently, when we go to deploy new code, we have two choices:

Kill our workers and cause whatever jobs may be running to fail (we have very few jobs that take < 10 seconds)

Or

Check worker status until no jobs are running (may take a while depending on queue)
(race condition here)
Kill the workers to prevent new jobs from starting. (heroku ps:scale worker 0)
Push new code.
Scale the workers back up. (heroku ps:scale worker N)

What we want is a cleaner choice, which as proposed amounts to this:

Flip a bit in redis.
Wait for all workers to report as paused.
Push new code (this causes workers to restart with new codebase)
Flip bit back.

This would allow us to push new code, knowing that we wouldn't cause jobs to fail. Currently this guarantee isn't possible

ThisGuyCodes · 2014-10-29T16:43:05Z

But your suspend/resume queue idea I do like yes

jtushman · 2014-10-29T20:55:14Z

⬆️ I just made a preliminary implementation of @nvie's suggustions. Let me know your thoughts. If we like I will add tests, etc ...

jtushman · 2014-10-29T20:58:15Z

Oh and I could not do the expiry thing with sadd (I am not a redis expert, but I do not think its supported for sets)

selwin · 2014-11-01T01:49:05Z

@conslo thanks for the explanation. I see what you mean now.

In terms of implementation, I prefer a more explicit rq suspend and rq resume approach rather than rq flag suggested by @nvie. @nvie, what do you think?

nvie · 2014-11-01T06:50:17Z

I initially went with the flag approach because I thought it'd be useful to allow you to mix and match the behaviour to your liking — i.e. not necessarily tie suspending per queues. For example, you could set a single flag and let all your workers watch that flag, so suspending all workers just takes setting a single flag. Or you could stop 1 worker, but not others.

But perhaps stopping per queue is also a good option. To support flags-per-queue, the change required is going to be a bit more involved, because we need to pull apart the BLPOP behaviour. Specifically, this line needs to essentially change to watch queue_keys - flagged_queues. Coordinating this alternative will be a bit more tricky (i.e. making sure the worker will pause and keep checking), but can certainly be done.

I think there's a case for both options. Does @jtushman have a preference?

selwin · 2014-11-01T07:02:10Z

What if we initially just go for the global suspend/resume approach to keep things simple?

You can use different Redis DB number if you want to configure this on a per queue basis.

Sent from my phone

On Nov 1, 2014, at 1:50 PM, Vincent Driessen notifications@github.com wrote:

I initially went with the flag approach because I thought it'd be useful to allows you to mix and match the behaviour to your liking — i.e. not necessarily tie suspending per queues. For example, you could set a single flag and let all your workers watch that flag, so suspending all workers just takes setting a single flag. Or you could stop 1 worker, but not others.

But perhaps stopping per queue is also a good option. To support flags-per-queue, the change required is going to be a bit more involved, because we need to pull apart the BLPOP behaviour. Specifically, this line needs to essentially change to watch queue_keys - flagged_queues. Coordinating this alternative will be a bit more tricky (i.e. making sure the worker will pause and keep checking), but can certainly be done.

I think there's a case for both options. Does @jtushman have a preference?

—
Reply to this email directly or view it on GitHub.

nvie · 2014-11-01T07:08:32Z

.gitignore

+Vagrantfile
+
+.idea


Please ignore these in a global ignore config.

nvie · 2014-11-01T07:14:12Z

Sounds good, @selwin. If we all agree on the flag solution, let's go ahead and review the current PR, polish it, add tests and documentation.

selwin · 2014-11-01T07:17:06Z

Oh, I meant using "rq suspend" and "rq resume", but on a global basis to make it easy. Is that ok?

Sent from my phone

On Sat, Nov 1, 2014 at 12:14 AM -0700, "Vincent Driessen" notifications@github.com wrote:

Sounds good, @selwin. If we all agree on the flag solution, let's go ahead and review the current PR, polish it, add tests and documentation.

—
Reply to this email directly or view it on GitHub.

jtushman · 2014-11-02T14:33:48Z

Hey guys,

Either approach works for me. I think the CLI of rq suspend, rq resume would be easier to document and and explain and would probably service 90% of the need. We can have the flag implementation under the hood -- so we can extend it easily -- and give that functionality to the power users off the bat.

I can work on this early next week.

nvie · 2014-11-03T12:19:08Z

Sure, let's start with the rq suspend and rq resume per queue implementation, as @selwin suggested. Thanks for wanting to work on this @jtushman! (Btw, the flag implementation does not work for the per-queue case, see the 2nd paragraph in this comment, but per-queue is simpler to understand I think, so better. Let's forget the flag idea for now.)

selwin · 2014-11-03T12:29:25Z

Terribly sorry if there was any miscommunication. But I think I meant "rq suspend" and "rq resume" on a global basis to keep things simple. Please see my comment here: #434 (comment)

I share the same feeling with @nvie in that implementing this on a per queue basis may be a bit hairy.

nvie · 2014-11-03T12:31:41Z

Oh, were you suggesting in #434 (comment) to do it like the current implementation by @jtushman, but suggest a terminology change (i.e. instead of using "flags" or "pause" name them "suspend" and "resume")?

selwin · 2014-11-03T15:08:58Z

Yes, like @jtushman suggested, I think we should do it the simplest possible way i.e Worker should check whether rq:suspend key exists and suspends if it does and just leave it at that.

Since this covers a large majority of use cases, we should just leave it at that and forget about the ability to set arbitrary flags for now. Anything more complex can be implemented using custom worker classes (this is why we built custom worker classes ;).

jtushman · 2014-11-04T13:58:21Z

Ok took a stab at the global pause/resume approach. Added tests. Thoughts?

selwin · 2014-11-05T01:58:02Z

I'll try to find sometime to review this PR this weekend. Thanks!

Sent from my phone

On Nov 4, 2014, at 8:58 PM, Jonathan Tushman notifications@github.com wrote:

Ok took a stab at the global pause/resume approach. Added tests. Thoughts?

—
Reply to this email directly or view it on GitHub.

selwin · 2014-12-01T23:52:14Z

rq/suspension.py

+
+
+def suspend(connection, ttl=None):
+    if ttl:


This should be if ttl is None because passing 0 into ttl should not cause the key to live forever.

hmmm -- I think we should make that that check at the CLI level -- if at all. To me setting the ttl=0 SHOULD cause the key to live forever (or until resume is called) That seems to me to be expected ttl behavior.

But I will default to you / @nvie lemme know.

Everything else is easy breezy

coveralls · 2014-12-02T04:27:50Z

Coverage increased (+0.06%) when pulling 0f3ea8c on jtushman:redis-based-shutdown into 786d3c5 on nvie:master.

coveralls · 2014-12-02T04:51:13Z

Coverage increased (+0.06%) when pulling 0f3ea8c on jtushman:redis-based-shutdown into 786d3c5 on nvie:master.

coveralls · 2014-12-02T13:09:30Z

Coverage increased (+0.2%) when pulling 162a02f on jtushman:redis-based-shutdown into 786d3c5 on nvie:master.

jtushman · 2014-12-02T13:22:34Z

Cool -- how does this look? BTW I know this is getting to be a fair bit of commits (25) Should we squash them? I am not sure what that does to the conversation on the PR.

Note: Reposting my comment on an outdated diff for its collapsed now:

Regarding the arg error on the connection#set:
That was a redis version issue. I was running 2.9.x -- But the setup.py requires 2.7.0.
(I do have a test for it: test_cli#test_suspend_with_ttl)

Thats why I updated requirements.txt to match setup.py so travis will catch these errors

jtushman · 2014-12-10T05:26:19Z

Hey guys -- any thoughts on this? Let me know if you have any additional feedback.

selwin · 2014-12-12T01:41:39Z

I'll review this again this weekend. Yeah, it would be great if we could squash the commits together as there was a fair bit of noise with regards to the earlier implementation attempts.

jtushman · 2014-12-12T22:00:16Z

Squashed away!

selwin · 2014-12-15T04:28:43Z

rq/cli/cli.py

@@ -158,7 +163,12 @@ def worker(url, config, burst, name, worker_class, job_class, queue_class, path,
    worker_class = import_attribute(worker_class)
    queue_class = import_attribute(queue_class)

+    if is_suspended(conn):
+        click.secho("The worker has been paused, run reset_paused", fg='red')


This should say RQ is currently suspended, to resume job execution run "rq resume"

selwin · 2014-12-15T08:05:27Z

@jtushman also, it looks like this PR doesn't merge cleanly anymore. Mind taking a look at this? If you can fix this within the next few days, I'll refrain from merging other PRs so we can get this merged in quickly since this PR touches quite a number of files.

Thanks for the effort and hard work!

coveralls · 2014-12-15T15:21:12Z

Coverage decreased (-0.15%) when pulling 1ae5a12 on jtushman:redis-based-shutdown into cd0c3c9 on nvie:master.

coveralls · 2014-12-15T15:50:18Z

Coverage increased (+0.19%) when pulling 60c27d5 on jtushman:redis-based-shutdown into cd0c3c9 on nvie:master.

selwin · 2014-12-24T02:54:39Z

Aside from a few tiny typos and docs. I think this PR is ready. Any objection if I merge this in @nvie ?

jtushman · 2015-01-05T14:42:05Z

Happy New Year @nvie, @selwin! I am back from break -- lemme know if this needs any more polish

selwin · 2015-01-12T07:25:04Z

I'm going to merge this in. Thanks a lot for your hardwork @jtushman !

triggering shutdown by setting a redis flag

jtushman · 2015-01-12T17:17:05Z

Woot!

dfrankow · 2019-05-24T22:37:49Z

I see that this was merged in, including into the rq command-line client ("rq suspend" and "rq resume").

But how exactly should this be used for people running on heroku? Are there any docs?

ThisGuyCodes reviewed Oct 24, 2014
View reviewed changes

nvie reviewed Nov 1, 2014
View reviewed changes

.gitignore

Vagrantfile

.idea

Copy link

Collaborator

nvie Nov 1, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please ignore these in a global ignore config.

selwin reviewed Dec 1, 2014
View reviewed changes

jtushman force-pushed the redis-based-shutdown branch from 162a02f to 63a344c Compare December 12, 2014 21:51

triggering shutdown by setting a redis flag

82333d2

jtushman force-pushed the redis-based-shutdown branch from 78b87e6 to 82333d2 Compare December 12, 2014 21:59

selwin reviewed Dec 15, 2014
View reviewed changes

Jonathan Tushman added 2 commits December 15, 2014 10:04

adding suspend with duration test

ccd4139

merging master

1ae5a12

fixing xrange for python3 compatability

60c27d5

selwin mentioned this pull request Dec 24, 2014

Make parent worker process wait until horse process finishes #457

Merged

selwin added a commit that referenced this pull request Jan 12, 2015

Merge pull request #434 from jtushman/redis-based-shutdown

31dcb57

triggering shutdown by setting a redis flag

selwin merged commit 31dcb57 into rq:master Jan 12, 2015

selwin mentioned this pull request Mar 19, 2015

SIGTERM causes jobs to get stuck in WIP queue #507

Closed

triggering shutdown by setting a redis flag #434

triggering shutdown by setting a redis flag #434

Conversation

jtushman commented Oct 9, 2014

ThisGuyCodes Oct 24, 2014

Choose a reason for hiding this comment

jtushman commented Oct 24, 2014

selwin commented Oct 25, 2014

jtushman commented Oct 25, 2014

selwin commented Oct 28, 2014

ThisGuyCodes commented Oct 28, 2014

nvie commented Oct 29, 2014

nvie commented Oct 29, 2014

selwin commented Oct 29, 2014

nvie commented Oct 29, 2014

ThisGuyCodes commented Oct 29, 2014

selwin commented Oct 29, 2014

ThisGuyCodes commented Oct 29, 2014

ThisGuyCodes commented Oct 29, 2014

jtushman commented Oct 29, 2014

jtushman commented Oct 29, 2014

selwin commented Nov 1, 2014

nvie commented Nov 1, 2014

selwin commented Nov 1, 2014

nvie Nov 1, 2014

Choose a reason for hiding this comment

nvie commented Nov 1, 2014

selwin commented Nov 1, 2014

jtushman commented Nov 2, 2014

nvie commented Nov 3, 2014

selwin commented Nov 3, 2014

nvie commented Nov 3, 2014

selwin commented Nov 3, 2014

jtushman commented Nov 4, 2014

selwin commented Nov 5, 2014

selwin Dec 1, 2014

Choose a reason for hiding this comment

jtushman Dec 2, 2014

Choose a reason for hiding this comment

coveralls commented Dec 2, 2014

coveralls commented Dec 2, 2014

coveralls commented Dec 2, 2014

jtushman commented Dec 2, 2014

jtushman commented Dec 10, 2014

selwin commented Dec 12, 2014

jtushman commented Dec 12, 2014

selwin Dec 15, 2014

Choose a reason for hiding this comment

selwin commented Dec 15, 2014

coveralls commented Dec 15, 2014

coveralls commented Dec 15, 2014

selwin commented Dec 24, 2014

jtushman commented Jan 5, 2015

selwin commented Jan 12, 2015

jtushman commented Jan 12, 2015

dfrankow commented May 24, 2019