Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Adding a scheduler locking mechanism #165

Closed
wants to merge 1 commit into from

7 participants

@parrish

I basically just implemented what was proposed in issue #161.

Seems to be working fine, though there is a very narrow window for lock contention between calling has_lock? and try_lock? here

@bvandenbos
Owner

Thanks for digging into this. I've got another pull request similar to this one (https://github.com/bvandenbos/resque-scheduler/pull/164/files). Both of these are missing some cases.

This was a little more complex than I originally thought, but I've got what I believe to be a complete solution here:

https://github.com/bvandenbos/resque-scheduler/compare/locking

Not ready to merge it into master, but I'd love for a few more sets of eyes on it and some validation of my assumptions. I'll post something similar in the other pull request as well.

@bvandenbos bvandenbos closed this
@cheald

(I can't comment on the commit, so this'll have to do!)

One solution to the desynchronized clocks issue is to synchronize time via Redis itself. This is actually really easy to do. If you're on Redis 2.6, you can trivially use TIME to synchronize your clients. However, if you aren't (and most people aren't), then you can implement time coordination by doing the following:

  • Clients attempt to SETNX on a timesynch value with their system timestamp as UTC
  • If the client successfully sets the value, then it sets the expiry of the the timesynch key to some far-future value (tens of years is fine!)
  • After the attempted SETNX, each client then performs a GET of the key, and a TTL of the key. You can derive how long it's been since the key was set, and from that, derive any drift between the "master" client and your own client. Adjust any time calls by the drift offset. You can resynch periodically by performing the same GET and TTL calls.

This would let you use a much lower timeout value than 3 minutes; you just have to have a timeout greater than your maximum projected round-trip network latency.

Once you have a synchronized time, then it's trivial to solve the scheduled jobs issue, as you don't have to worry about time drift resulting in a job being enqueued multiple times. Systems could have wildly divergent system times and still be scheduling things properly.

Additionally, it should be unnecessary to use the is_master? check around the load_schedule_job queueing, because load_schedule_job is called from update_schedule which is called from the event loop, which is already locked. You would want to check the master lock in load_schedule!, but in general, I think it's better to keep the lock checks in the points of entry, rather than down in the guts of the operation if at all possible.

@bvandenbos
Owner
@parrish

I just browsed through https://github.com/bvandenbos/resque-scheduler/compare/locking.
It looks fine to me, though I agree that the synchronize around load_schedule_job is unnecessary.

The double check in has_master_lock? and the lock/checking order in is_master? appears to solve the potential for lock contention as well.

@adamgotterer

Curious if this made it to current stable gem? The docs are still suggesting not running more then one scheduler instance.

@shedd

We are also interested in running Resque Scheduler across more than one instance to improve resiliency in our environment. Would love to see this merged in.

@adamgotterer

There hasn't been an update here in 5mo. Seems like this gem might not be supported any longer.

@parrish

Just to chime in, https://github.com/bvandenbos/resque-scheduler/compare/locking does look like a better way to do it, but I've been running this patch in production for a couple months now without any problems.

@cyphactor

@bvandenbos what is the hold up with merging https://github.com/bvandenbos/resque-scheduler/compare/locking into master? Is there something broken with it?

@lloydmeta

:+1: +1 Would love to see this merged in.

@bvandenbos
Owner

Sorry guys. 1) I've been crazy busy. 2) I've switch to sidekiq + clockwork so I haven't had the chance to production test this.

Unrelated: If someone wants to help with the maintenance of this gem, I'd love the help.

Let's merge it.

@bvandenbos
Owner

and by merge "it" I mean my lock branch.. just to be clear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Jun 22, 2012
  1. @parrish
This page is out of date. Refresh to see the latest.
Showing with 38 additions and 3 deletions.
  1. +38 −3 lib/resque/scheduler.rb
View
41 lib/resque/scheduler.rb
@@ -1,5 +1,6 @@
require 'rufus/scheduler'
require 'thwait'
+require 'socket'
module Resque
@@ -8,7 +9,6 @@ class Scheduler
extend Resque::Helpers
class << self
-
# If true, logs more stuff...
attr_accessor :verbose
@@ -31,6 +31,38 @@ def poll_sleep_amount
@poll_sleep_amount ||= 5 # seconds
end
+ def hostname
+ Socket.gethostbyname(Socket.gethostname).first
+ end
+
+ def process_id
+ Process.pid
+ end
+
+ def identifier
+ [hostname, process_id].join ':'
+ end
+
+ def lock_timeout
+ 60
+ end
+
+ def lock_key
+ 'schedule:lock'
+ end
+
+ def has_lock?
+ Resque.redis.get(lock_key) == identifier
+ end
+
+ def try_lock?
+ Resque.redis.setnx lock_key, identifier
+ end
+
+ def update_lock_expiry
+ Resque.redis.expire lock_key, lock_timeout
+ end
+
# Schedule all jobs and continually look for delayed jobs (never returns)
def run
$0 = "resque-scheduler: Starting"
@@ -48,8 +80,11 @@ def run
# Now start the scheduling part of the loop.
loop do
begin
- handle_delayed_items
- update_schedule if dynamic
+ if has_lock? || try_lock?
+ update_lock_expiry
+ handle_delayed_items
+ update_schedule if dynamic
+ end
rescue Errno::EAGAIN, Errno::ECONNRESET => e
warn e.message
end
Something went wrong with that request. Please try again.