-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Process locks up with failover client #13
Comments
This is interesting. I don't use resque so I actually haven't run into this specific issue before. I will definitely dig into it though and try to reproduce it. It may be related to forking and the ZK client. Could please provide me with a simple resque worker sample that demonstrates the problem? Also, have you been able to successfully use the client from other parts of your code? On May 10, 2012, at 3:57 AM, Matt Conwayreply@reply.github.com wrote:
|
After killing those workers and have a couple more respawn, here is what the dumps look like (both in the "Starting" procline state):
|
Yes, client works fine from other parts of my code. Not sure how to package up resque for you since there are a bunch of moving parts to set this up for our codebase. I was going to extract it for a rubber recipe, and then could put it on a throwaway instance for you, or let you into the instance I am testing on now. |
That's good to hear that it's working in other parts of your code base. I suspect this may have something to do specifically with resque and the fact that it forks workers. @slyphon, could you please take a look at these thread dumps? The recent traces seem to indicate something happening in the ZK layer. Do you think this is fixed in the new version of ZK that handles forking better? On May 10, 2012, at 4:42 AM, Matt Conwayreply@reply.github.com wrote:
|
Hey, I'm the ZK maintainer, I just got 1.4.0 released yesterday with the zookeeper-1.0.3 driver, which contains a bunch of fixes for fork behavior. I'm not too familiar with resque specifics, but there may be action we need to take to prevent this, specifically you need to call reopen() on the zk instance immediately after forking. We should definitely run redis_failover w/ the latest ZK and test it. |
Great, thanks Jonathan. I will test it today with latest ZK and a simple resque worker. On May 10, 2012, at 4:50 AM, Jonathan Simmsreply@reply.github.com wrote:
|
I'll try updating zk and zookeeper. For the reopen(), how do I call that (I'll try without it first)? Resque-pool has a post-fork handler I can call arbitrary code in, where I do already create a new connection to redis (effectively RedisFailover.new) just not sure what to call reopen() on,k or if its even necessary if I'm creating a new client. Thanks. |
yeah, we definitely need to use zookeeper-1.0.3, there was a shutdown ordering problem that caused a lockup on linux after a fork. Worked great on Mac OS, but apparently on linux's pthread implementation, with the zookeeper c client, after a fork, you have to shut down the existing client completely before trying to start a new one. I now have a very thorough spec for testing forking, event delivery, etc. |
Excellent! This sounds like it could be the problem then. Matt, this should be as simple as specifying the version in redis_failover.gemspec to be the one that Jonathan recently released. Would be awesome if you could give that a shot. Also, you probably don't need to deal with reopen if you are creating a brand new RedisFailover::Client instance since you will get a brand new ZK instance that's just tied to that new client. On May 10, 2012, at 5:09 AM, Jonathan Simmsreply@reply.github.com wrote:
|
I think you need to call close on the original, though. That was the bug, that the zkc connection that survived the fork was damaged goods, and if you started a new zkc connection before closing the old one, you'd deadlock. I designed the fork recovery code with reopen() in mind (for @eric's use case), so maybe we can do a short code review later to make sure it will work in this case. |
Ok, using these versions, it still hangs, do I need to explicitly call reopen/close? If so, how? |
Ok, looking at RedisFailover::Client, I don't see a way of explicitly calling close, once @ryanlecompte gets online I'll consult with him. Sorry, i'm in NY and I have to hop on the train in a few minutes, or I'd take a whack at this myself. Also, i'm sorry for all the trouble, handling a fork() turned out to be way harder than it appeared. |
Never fear, I'm used to the forking pain :) |
Okay, I just checked in (master) an update to redis_failover.gemspec to use the new ZK version, along with a new RedisFailover::Client#shutdown method that you should invoke in your after_fork hook, right before you create your new RedisFailover::Client instance. I just did a sanity check locally and it seems to work. Could you give master a shot, @wr0ngway and let us know how it works for you? Thanks! |
In trying to verify this, I saw the following stack trace - should one be able to reuse a client after shutdown similar to the way a normal redis client reconnects? Looks like the "reconnect_zk" doesn't exist - bad refactoring?
|
this should automatically shut down the underlying connection when a fork happens (fail-safe), regardless of user action. related to ryanlecompte/redis_failover#13
I'm working on a possible fix for the hang in zookeeper. i need to test this with the full stack to make sure it does the right thing, but this should prevent the hang by immediately shutting down the underlying low-level C wrapper after a fork, then the reopen will just repair zookeeper ruby-space state. |
Hey Matt, Are you on IRC (freenode)? If so it would be great if you could join #zk-gem and we could diagnose this a bit more easily. I'll be there in 15 minutes. Ryan On May 10, 2012, at 8:49 AM, Matt Conwayreply@reply.github.com wrote:
|
I'm seeing this SEGV now (right when I shutdown failover within the forked resque worker):
|
Can you give latest master a try? I just checked in the fixes around So basically in your after_fork, you want to do On Thu, May 10, 2012 at 9:07 AM, Matt Conway <
|
Ok, did a pretty much full rewrite of the C backend of zookeeper between friday/saturday, tests are passing for zookeeper on all platforms and ruby versions, ZK all tests are passing except the fork test in rubinius (but, really, who cares), so I'll cut a release on monday (5/14) and we'll give it a whirl. |
Awesome work! This will be great to test on Monday. On May 13, 2012, at 8:46 AM, Jonathan Simmsreply@reply.github.com wrote:
|
Ok, I tested and it looks like this is fixed. Good job! Now that I can get further, I'm running into a new issue, which I think is a redis_failover problem, will create a new issue. |
This may not be the fault of redis_failover, but all else being the same, when I have it enabled, my resque workers lock up on start. Here are a couple thread dumps - in the first, the worker doesn't get past changing the Starting in the procline, in the second, it got to the "Waiting" procline, but never picks up jobs. I used gdb.rb to get the traces.
"Starting" procline:
"Waiting" procilne:
The text was updated successfully, but these errors were encountered: