New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
triggering shutdown by setting a redis flag #434
Conversation
self.log.warn('Paused in burst mode -- exiting.') | ||
self.log.warn('Note: There could still be unperformed jobs on the queue') | ||
break | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is being pretty nit-picky, but there's a race condition here, the paused key could be set between this if statement and the while paused loop, and because Worker.paused() makes a network call this window is actually a few milliseconds long
Cool @nvie I have cleaned this up and removed the "provisional" tag -- seems to be working well for us. Let me know your thoughts. Thanks for the great library |
From your problem description, shouldn't this situation be handled by the process manager (supevisor or upstart) running |
Hi @selwin. I wish that was the case. But with Heroku we are unable to access the process manager. Our main option is: Which does the following:
I can not find it now, but the heroku folks recommended a database driven solution like the one implemented here. I know it adds some complexity to an elegant solution -- but I think this will help a lot of people that use this on Heroku or similar systems. I wonder if @kennethreitz has anything to add to this discussion (the python advocate at Heroku) |
Ah ok, got it.
In the case that a job takes longer than 10 seconds to complete and the worker is killed by a Does that help? |
Sort of. But that still leaves us without a way to cleanly shutdown. Intentionally failing and then retrying whatever job is currently running certainly is an option, but hardly a good one. |
Hey @jtushman, thanks for the PR. Definitely a missing feature. Heroku is a well-known platform that can cause annoying issues like this. I think the initial stab you took at it is fine, but I have a few comments:
What if we flip the logic here? We can have each worker check whether it may continue with the next job in some way. These conditional acts as traffic lights, essentially. And as we're at it, why not have more than one such conditional? Each worker could than be instantiated as follows: $ rqworker -c foo -c bar This would mean that This way you can have fine-grained control over your deployments. The simplest case would be:
Now to start a new deployment, you would simply create the Redis key $ rq flag --expires-in=60 heroku:maintenance # briefly pause
$ rq flag heroku:maintenance # set flag, until unset
$ rq flag -d heroku:maintenance # unset flag What do you think of this alternative? We could easily document this option as the recommended way of doing Heroku deployments. |
Actually, we should be scoping these flags. Perhaps the best solution is to use a global set, like |
Sorry but it feels like there's something that I'm missing. Doesn't |
Yes it will. The effect essentially is the same: it will finish the current work, then stop. The difference is that in the Heroku context you cannot control the Unix process itself, so (1) you cannot send it SIGTERM, and (2) Heroku will kill it after 10 seconds, even if the current job requires more time. This approach will provide a control mechanism that can be controlled via Redis, so independent of the platform it's running on. The deployment would essentially be:
|
@selwin we can't ignore the case when RQ is performing a long running job, because that's the entire problem we're trying to solve (job takes longer than Heroku gives us to shut down). I do like @nvie's approach, but it also gave me an idea; along the same lines I think it also makes sense for each queue to have a pause key. It may require a bit more work (and in fact might need to be a separate conversation), but I feel like that would have more use cases (stopping a specific kind of work instead of specific workers). Also, in Heroku if we shut down the process (instead of some internal pause/sleep), then Heroku will automatically restart the process. Thus, no matter which approach we take, we do need to do an internal pause/sleep instead of shutting down (that's what this PR does so far, but I felt it worth noting). |
@nvie from my understanding of Heroku's docs, when Heroku initiates a shutdown, @conslo I may be mistaken, but I this is what I think. If automatic shutdown is requested by Heroku's Dyno manager, it won't restart the process in the same dyno when the process stops (it doesn't make sense to restart a process they want to shutdown). This line in Heroku's docs about Dynos implies that it is enough for us to simply shutdown the process when we receive Heroku's shutdown signal:
Heroku's own code sample directly below also supports my hypothesis: STDOUT.sync = true
puts "Starting up"
trap('TERM') do
puts "Graceful shutdown"
exit
end
loop do
puts "Pretending to do work"
sleep 3
end So in short, from my limited understanding of Heroku's Dynos, I'm not convinced that this suspend/resume feature will help in handling Dyno manager's shutdown requests more elegantly. There could be other use cases for suspending workers, of course. And if we really want to implement this pause/resume feature, I'd suggest the following APIs:
Thoughts? |
@selwin everything you said is correct. If in fact we respond to the shutdown request by stopping the process within ten seconds, things will shut down cleanly. And this process will not in fact aid in handling these shutdown requests. But that's not what we're trying to fix. Currently, when we go to deploy new code, we have two choices:
Or
What we want is a cleaner choice, which as proposed amounts to this:
This would allow us to push new code, knowing that we wouldn't cause jobs to fail. Currently this guarantee isn't possible |
But your suspend/resume queue idea I do like yes |
⬆️ I just made a preliminary implementation of @nvie's suggustions. Let me know your thoughts. If we like I will add tests, etc ... |
Oh and I could not do the expiry thing with |
I initially went with the flag approach because I thought it'd be useful to allow you to mix and match the behaviour to your liking — i.e. not necessarily tie suspending per queues. For example, you could set a single flag and let all your workers watch that flag, so suspending all workers just takes setting a single flag. Or you could stop 1 worker, but not others. But perhaps stopping per queue is also a good option. To support flags-per-queue, the change required is going to be a bit more involved, because we need to pull apart the I think there's a case for both options. Does @jtushman have a preference? |
What if we initially just go for the global suspend/resume approach to keep things simple? You can use different Redis DB number if you want to configure this on a per queue basis. Sent from my phone
|
Vagrantfile | ||
|
||
.idea |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please ignore these in a global ignore config.
Sounds good, @selwin. If we all agree on the flag solution, let's go ahead and review the current PR, polish it, add tests and documentation. |
Oh, I meant using "rq suspend" and "rq resume", but on a global basis to make it easy. Is that ok? Sent from my phone On Sat, Nov 1, 2014 at 12:14 AM -0700, "Vincent Driessen" notifications@github.com wrote: Sounds good, @selwin. If we all agree on the flag solution, let's go ahead and review the current PR, polish it, add tests and documentation. — |
Hey guys, Either approach works for me. I think the CLI of rq suspend, rq resume would be easier to document and and explain and would probably service 90% of the need. We can have the flag implementation under the hood -- so we can extend it easily -- and give that functionality to the power users off the bat. I can work on this early next week. |
Sure, let's start with the |
Terribly sorry if there was any miscommunication. But I think I meant "rq suspend" and "rq resume" on a global basis to keep things simple. Please see my comment here: #434 (comment) I share the same feeling with @nvie in that implementing this on a per queue basis may be a bit hairy. |
Oh, were you suggesting in #434 (comment) to do it like the current implementation by @jtushman, but suggest a terminology change (i.e. instead of using "flags" or "pause" name them "suspend" and "resume")? |
Yes, like @jtushman suggested, I think we should do it the simplest possible way i.e Since this covers a large majority of use cases, we should just leave it at that and forget about the ability to set arbitrary flags for now. Anything more complex can be implemented using custom worker classes (this is why we built custom worker classes ;). |
Ok took a stab at the global pause/resume approach. Added tests. Thoughts? |
I'll try to find sometime to review this PR this weekend. Thanks! Sent from my phone
|
|
||
|
||
def suspend(connection, ttl=None): | ||
if ttl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be if ttl is None
because passing 0
into ttl
should not cause the key to live forever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm -- I think we should make that that check at the CLI level -- if at all. To me setting the ttl=0 SHOULD cause the key to live forever (or until resume is called) That seems to me to be expected ttl behavior.
But I will default to you / @nvie lemme know.
Everything else is easy breezy
Cool -- how does this look? BTW I know this is getting to be a fair bit of commits (25) Should we squash them? I am not sure what that does to the conversation on the PR. Note: Reposting my comment on an outdated diff for its collapsed now: Regarding the arg error on the Thats why I updated requirements.txt to match setup.py so travis will catch these errors |
Hey guys -- any thoughts on this? Let me know if you have any additional feedback. |
I'll review this again this weekend. Yeah, it would be great if we could squash the commits together as there was a fair bit of noise with regards to the earlier implementation attempts. |
162a02f
to
63a344c
Compare
78b87e6
to
82333d2
Compare
Squashed away! |
@@ -158,7 +163,12 @@ def worker(url, config, burst, name, worker_class, job_class, queue_class, path, | |||
worker_class = import_attribute(worker_class) | |||
queue_class = import_attribute(queue_class) | |||
|
|||
if is_suspended(conn): | |||
click.secho("The worker has been paused, run reset_paused", fg='red') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should say RQ is currently suspended, to resume job execution run "rq resume"
@jtushman also, it looks like this PR doesn't merge cleanly anymore. Mind taking a look at this? If you can fix this within the next few days, I'll refrain from merging other PRs so we can get this merged in quickly since this PR touches quite a number of files. Thanks for the effort and hard work! |
Aside from a few tiny typos and docs. I think this PR is ready. Any objection if I merge this in @nvie ? |
I'm going to merge this in. Thanks a lot for your hardwork @jtushman ! |
triggering shutdown by setting a redis flag
Woot! |
I see that this was merged in, including into the rq command-line client ("rq suspend" and "rq resume"). But how exactly should this be used for people running on heroku? Are there any docs? |
Problem Statement
We use rq on heroku and their way to the shutdown the worker that runs rq does not ensure a safe shutdown.
From Heroku docs: https://devcenter.heroku.com/articles/dynos
So if a job takes more then 10 secs (which many of our jobs do) -- we are going to be out of luck. The job will be killed in a potentially unsafe point
So we needed an approach where we could set a flag on redis, that the rq worker can ping.
I introduced the key: 'rq:worker:pause_work' Which any process can set. If the worker sees that it has been set -- it hops into a pause loop, until the key is deleted.
In Worker#work at the top of the
while True:
block