Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Locking #176

Merged
merged 16 commits into from Mar 21, 2013

Conversation

Projects
None yet
3 participants
Contributor

bvandenbos commented Aug 21, 2012

Adding locking to support multiple scheduler processes.

@dbalatero dbalatero and 1 other commented on an outdated diff Aug 21, 2012

lib/resque/scheduler_locking.rb
+ end
+
+ def process_id
+ Process.pid
+ end
+
+ def is_master?
+ acquire_master_lock! || has_master_lock?
+ end
+
+ def master_lock_value
+ [hostname, process_id].join(':')
+ end
+
+ def master_lock_key
+ :master_lock
@dbalatero

dbalatero Aug 21, 2012

Contributor

You might consider a more specific key here, like :resque_scheduler_master_lock.

@bvandenbos

bvandenbos Aug 22, 2012

Contributor

I think resque namespaces all the redis keys it uses, but yeah, that's a good point. I'll update it.

@dbalatero dbalatero commented on an outdated diff Aug 21, 2012

lib/resque/scheduler_locking.rb
+# skipped since eventually a master will come online and it will process
+# everything that is ready (no matter how old it is). Scheduled jobs work
+# like cron - if you stop cron, no jobs fire while it's stopped and it doesn't
+# fire jobs that were missed when it starts up again.
+
+module Resque
+
+ module SchedulerLocking
+
+ # The TTL (in seconds) for the master lock
+ def lock_timeout=(v)
+ @lock_timeout = v
+ end
+
+ def lock_timeout
+ @lock_timeout ||= 60 * 3 # 3 minutes
@dbalatero

dbalatero Aug 21, 2012

Contributor

It should probably warn/fail if poll_sleep_amount is greater than the lock_timeout. Otherwise, you'll probably get processes bouncing back and forth between each other, which is not really the desired behavior (even if there are no side effects).

@dbalatero

dbalatero Aug 21, 2012

Contributor

(maybe suggesting a minimum value in the README would be good - at least 60.seconds?)

@dbalatero dbalatero commented on the diff Aug 21, 2012

lib/resque/scheduler.rb
@@ -133,8 +138,10 @@ def load_schedule_job(name, config)
if !config[interval_type].nil? && config[interval_type].length > 0
args = optionizate_interval_value(config[interval_type])
@@scheduled_jobs[name] = rufus_scheduler.send(interval_type, *args) do
@dbalatero

dbalatero Aug 21, 2012

Contributor

Just to make sure - is it ok to call rufus_scheduler.send without asserting is_master? first?

@bvandenbos

bvandenbos Aug 21, 2012

Contributor

Yup - it's ok because the assertion is in the block. So, rufus-scheduler fires the events on every machine, but only the master processes them. That way when a machine becomes master, no changes to the rufus-scheduler state are needed.

@dbalatero dbalatero and 1 other commented on an outdated diff Aug 21, 2012

lib/resque/scheduler_locking.rb
+
+ def master_lock_value
+ [hostname, process_id].join(':')
+ end
+
+ def master_lock_key
+ :master_lock
+ end
+
+ def extend_lock!
+ # If the master fails to checkin for 3 minutes, the lock is released and is up for grabs
+ Resque.redis.expire(master_lock_key, lock_timeout)
+ end
+
+ def acquire_master_lock!
+ if Resque.redis.setnx(master_lock_key, master_lock_value)
@dbalatero

dbalatero Aug 21, 2012

Contributor

Is it possible to setnx with a timeout all in one shot? If the program dies here without calling extend_lock! (super low possibility, but…), it will lock out all the other processes until you manually clear the lock.

@bvandenbos

bvandenbos Aug 22, 2012

Contributor

Good catch. I honestly can't remember if I was aware of this deficiency or not. In any case, redis doesn't have a setnxex... which is too bad, because that would be awesome. I don't think you can use a MULTI to get the desired effect here since we need to check the return value of SETNX before we issue the EXPIRE. Let me know if you think of something clever. At the very least, probably should have a comment there documenting the problem.

@dbalatero

dbalatero Aug 22, 2012

Contributor

Someone on IRC solved this with Lua scripting, but that would require Redis 2.4+.

http://pastie.org/4566109

The relevant code is:

// Returns 1 if acquired lock.
var lockScript = redis.NewScript(
`if redis.call('SETNX', KEYS[1], ARGV[1]) == 1
then
    redis.call('EXPIRE', KEYS[1], 60)
    return 1
else
    return 0
end`)

And checking for holding the lock:

//Return 1 if lock held.
var holdScript = redis.NewScript(
`res = redis.call('GET', KEYS[1])
if res != ARGV[1]
then
    return 0
end
redis.call('EXPIRE', KEYS[1], 60)
return 1`)
@dbalatero

dbalatero Aug 22, 2012

Contributor

(Redis will atomically run the Lua code.)

@dbalatero

dbalatero Aug 22, 2012

Contributor

One thing we could do is inject different locking behaviors for 2.2 + 2.4, and warn if someone is running 2.2. We'd just need two implementations of acquire_lock and check_if_locked.

Contributor

dbalatero commented Aug 21, 2012

A general comment: I think it should release its lock immediately via ensure on shutdown. Otherwise, for example, if your deploy process restarts resque-scheduler it will take 180 seconds to come back up (as it will need to wait for Redis to expire the lock key).

Contributor

dbalatero commented Aug 21, 2012

I'm just brain dumping other cases here, no matter how trivial (just to try and be complete…)

  1. resque-scheduler loses its network connection to Redis.

A: The lock will expire in N seconds, and any remaining connected resque-scheduler processes will acquire the lock and become master.

  1. resque-scheduler is shot in the head with kill - 9

A: The lock will expire in N seconds.

  1. Redis dies.

A: All resque-scheduler processes are screwed and won't process anything anyways.

Contributor

bvandenbos commented Aug 21, 2012

For case 3, you could put redis behind a zookeeper w/ replication and failover, then setup resque/resque-scheduler to use the Zookeeper for redis.

Contributor

dbalatero commented Aug 21, 2012

For case 3, you could put redis behind a zookeeper w/ replication and failover, then setup resque/resque-scheduler to use the Zookeeper for redis.

Yep, I have that with redis_failover and it's working great. But, if someone doesn't have that, then worst-case jobs will not be processed until things come back online. Luckily, they won't be double/triple-processed, which I see as a worse problem.

Contributor

bvandenbos commented Aug 22, 2012

Should definitely release the lock on shutdown.

bvandenbos and others added some commits Aug 22, 2012

@bvandenbos bvandenbos releasing master lock on shutdown, changing lock key cb65088
@dbalatero dbalatero Get a better redis-server in test.
The previous implementation was causing intermittent seg faults on my
machine.
9f55f13
@dbalatero dbalatero Refactor locking, and allow for different locks.
This is work-in-progress: tests fail on Redis 2.6
6d955b6
@bvandenbos bvandenbos Merge pull request #177 from dbalatero/resilent_2.4_locking
Resilient 2.4 locking
787787c

When we can expect this stuff to be released? :)

Contributor

bvandenbos commented Sep 1, 2012

We're waiting on a pull request to redis-namespace to support scripting in
redis.

On Sep 1, 2012, at 12:46 AM, Sergio Tulentsev notifications@github.com
wrote:

When we can expect this stuff to be released? :)


Reply to this email directly or view it on
GitHubhttps://github.com/bvandenbos/resque-scheduler/pull/176#issuecomment-8210894.

@bvandenbos bvandenbos added a commit that referenced this pull request Mar 21, 2013

@bvandenbos bvandenbos Merge pull request #176 from bvandenbos/locking
Locking
709832c

@bvandenbos bvandenbos merged commit 709832c into master Mar 21, 2013

@bjones bjones pushed a commit to bjones/resque-scheduler that referenced this pull request May 30, 2014

@bvandenbos bvandenbos Merge pull request #176 from bvandenbos/locking
Locking
3d22586
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment