allow timeout on API calls for pro-active backoff? #62

johnrfrank · 2013-03-20T12:06:22Z

Is there a way to set a timeout on calls like zk.get_children so that if the server is overloaded, the client application can proactively backoff? Backing off when connection is lost is also useful, but waiting for the connection to drop is sometimes too late.

For example, if there are many clients and only a small amount of data available in zk for them to operate on, then the clients need a way of backing off.

Let me provide more context:

We are using zookeeper to coordinate a list of tasks for a pool workers
The number of tasks starts off at 10^5 and there are 10^3 workers
Workers "win" a taskID=random.choice(zk.get_children('available')) and then attempting zk.create('pending/%s' % taskID)
The worker compute processes are in far-flung data centers across the continent, and are heavily loaded when executing the task, so we have observed very long heart beat intervals. To cope with this, we have set the session time out to the absurdly long value of 15 minutes. This has actually been worked well.
This simple locking approach works fine until there are 10^2 tasks remaining, at which point the 10^3 workers do not know that only one-in-ten of them will win a task. So they clobber the server. It's a thundering herd at the application level.

So, I see two solutions:

simplest: set a timeout on get_children, and when a worker hits the timeout, it should backoff a lot, like several minutes
more complex, but probably better over all design: re-organize the logic to use something like leader election, where only the leader gets to take a task from the "available" pool, and then step down as leader.

Any other ideas?

ekimekim · 2013-06-25T07:49:27Z

Forgive me if I'm wrong, I only started using Kazoo recently, but doesn't kazoo.retry.RetrySleeper do exactly what you want? Also, note that you can create a custom KazooRetry that automatically retries with a backoff as per the RetrySleeper.

max deadline, transition properly when connection fails to LOST, and setup separate connection retry behavior from client command retry behavior. Patches by Mike Lundy.

bbangert · 2013-07-17T01:16:24Z

KazooRetry now supports time deadlines, so after backoff if it takes too long, thats now handled. This should address this situation.

novas0x2a mentioned this issue Jun 26, 2013

Extend KazooRetry to support time deadlines #102

Closed

bbangert closed this as completed Jul 17, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

allow timeout on API calls for pro-active backoff? #62

allow timeout on API calls for pro-active backoff? #62

johnrfrank commented Mar 20, 2013

ekimekim commented Jun 25, 2013

bbangert commented Jul 17, 2013

allow timeout on API calls for pro-active backoff? #62

allow timeout on API calls for pro-active backoff? #62

Comments

johnrfrank commented Mar 20, 2013

ekimekim commented Jun 25, 2013

bbangert commented Jul 17, 2013