Can you provide any more context for the following error? I've been seeing a lot of them since upgrading from 0.4.4 to 1.2.12. Surprised to see timeouts because the system has been using Zookeeper for quite a while before this change without a problem. Thanks for any insight/ideas.
Caught exception: Zookeeper::Exceptions::ContinuationTimeoutError
Exception Message: "response for meth: :exists, args: [32348, "/somenode", nil, nil], not received within 30 seconds"
Between 0.4.4 and 1.2.12 there was a major rearchitecting of the backend to support use with resque and other fork-oriented frameworks. I switched between the 'mt' multi-threaded and 'st' single-threaded zookeeper C library. The mt lib would spawn its own Posix event thread, and this proved itself to be utterly impossible to deal with safely when calling fork. The 'st' backend is essentially an event pump, and the way that the zookeeper gem works now is by running its own ruby thread which iterates event delivery.
This was done so that you can "stop the world" before calling fork, call fork, then reopen the connections in the child, and "restart the world" in the parent without losing state in the parent, and without mysteriously segfaulting. This proved to be incredibly hard to do, especially because there are lots of weird bugs in 1.8.7 around threading and ConditionVariable.
There's the context.
So in order to provide what looks like a blocking API for you, it's necessary to take the request you make, suspend your thread, and submit it to the single-threaded asynchronous backend, wait for the response to be delivered, then wake your thread with the return value. This is potentially the source of deadlocks, because there are situations where the response may never come. I put the ContinuationTimeoutError in there to fire after 30s because the session timeout is negotiated to be 20s, and after that amount of time has passed, your session is invalid, and you will never get a response to your request (even if this isn't delivered to the client from the server).
It looks like you're using 1.8, so there's the possibility that the event thread is getting starved by something (in 0.4.4 the event thread was POSIX and controlled by the underlying zkc lib, in 1.2.12 the event thread in 1.8 is 'green' and is at the mercy of the ruby thread scheduler).
I've seen this kind of error happen when the Zookeeper cluster starts getting really loaded. It's hard to say exactly what would cause this, as things are time dependent between the client and server, and a lot can go wrong.
One thing to do is check your server logs to make sure there aren't timeouts or obvious errors happening there. As for the client, well, that's a little harder to answer without knowing more details about how you're using it.
I hope that helps!
Just as a side-note, using 0.4.4 we (Crowdflower, Matt and I) successfully used forking for resque on REE by taking a policy of abandoning the previous connection entirely (without closing on child side). Since we only ever forked once and the parent didn't use the zookeeper connection at all, this was an acceptable resource loss. Closing the connection led to very strange behavior.
We're going to roll back to the old gem because this bug is hitting us hard. Sadly, we're currently wedded to REE in one specific (and important) part of our software.
Thanks for confirming out suspicions that it's ruby 1.8 thread weather conditions.
Ahh, ok. Is there any reason not to just start the connection in the child immediately upon fork?
That's a good question. Right now our current deploy of zookeeper-1.2.12 is extremely unreliable and it's introducing inconsistencies in our database. We'll roll back and investigate that use once we triage our system.
Given how extremely difficult it was to write a reliable zookeeper lock in 1.8, I'm sort of nervous about moving anything involving threading up into greenville. I imagine the motivation for your change was to enable higher-level manipulation of the event queue and support for more friendly watch and async response functionality.
You might want to note our experience with REE in your README. Ruby 1.8 holdouts like us are rare but we tend to have massive codebases.
Added note about REE stability and green threads
In terms of the locking, you may want to take a look at the implementation I have in ZK, which I spent a lot of time on, and is probably the most used feature of that library.
+1 for Locker! I use it in http://github.com/ryanlecompte/redis_failover to allow redundancy between multiple redis_node_manager daemons.
Slyphon, if you have a good solution for how to reliably break off a locked operation in the event of zk disconnection (In REE) then I would be very interested. I looked at locker and ignoring environment considerations our implementations are quite similar.
I'll contact you offline.
by chance, are you guys using the mysql (as opposed to the mysql2) gem?
@KirinDave: Are you using the mysql gem? If you don't use the mysql2 gem, you'll likely run into problems if you have a mysql query take longer than the timeout (which tends to happen when you aren't looking).
Also, I've been very happy with using the zk gem and using lock.assert! to ensure you still have the lock.
@eric @slyphon We actually use postgres with DataMapper
Any other C extensions that do blocking network calls?
@mrgordon are you sure you're not using the mysql gem?!
I'm pretty sure the postgres gem is able to be run in 1.8-friendly mode, I'm not sure if DM can though. I'm way out of my depth here, but it might be worth checking if while in one thread, you can call SELECT pg_sleep(45); and in another thread, well, do anything.
Well, we've just upgraded to 1.4.2. Still the same behavior.
expected, nothing has changed in 1.8 threading or the way zookeeper uses threads that would have helped.