Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Already on GitHub? Sign in to your account

Ephemeral node for exclusive lock not cleaned up when failure happens during lock acquisition #51

Closed
ryanlecompte opened this Issue Oct 7, 2012 · 2 comments

Comments

Projects
None yet
2 participants
Member

ryanlecompte commented Oct 7, 2012

I was finally able to reproduce the issue that we were looking at in IRC on Friday. Here's an isolated test case:

In process 1:

zk = ZK.new('..')
l = zk.locker('foo')

l.lock!(true)

In process 2:

zk = ZK.new('..')
l = zk.locker('foo')

l.lock!(true)

Now press Ctrl-C to simulate an exception/error happening while in the blocking #lock! call. Note that in my scenario I was producing this error condition by killing a couple ZK servers.

Now, when you do:

zk.children(l.root_locker_path)

You'll notice that there are some entries there that never get cleaned up.

I would love for this to get cleaned up somehow if possible, because in the case of redis_failover, it ends up producing a situation where the non-master node managers never get a chance to become a real master since they always hang on their next #lock!(true) call. They hang because their previous #lock!(true) call failed and their old ephemeral sequential znode never got deleted.

As a workaround, I'm doing the following hack in redis_failover (master):

        # we manually attempt to delete the lock path before
        # acquiring the lock, since currently the lock doesn't
        # get cleaned up if there is a connection error while
        # the client was previously blocked in the #lock! call.
        if path = @zk_lock.lock_path
          @zk.delete(path, :ignore => :no_node)
        end
        @zk_lock.lock!(true)

@ghost ghost assigned slyphon Oct 7, 2012

slyphon added a commit that referenced this issue Oct 7, 2012

Probable fix for #51, more aggressive cleanup of lock state
Basically runs cleanup if we're exiting the lock method and we didn't
acquire the lock (based on @locked), which is what we actually *mean*
Member

ryanlecompte commented Oct 7, 2012

Awesome! I just tested your new branch locally and it fixes my redis_failover situation.

Member

ryanlecompte commented Oct 8, 2012

Fixed in ZK 1.7.2. Thanks @slyphon!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment