You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When initializing a KazooClient and connecting to a ZK ensemble, the user has two places where they can pass in a timeout:
kazoo.client.KazooClient takes a timeout kwarg that specifies the requested client session timeout, which is used to negotiate a session timeout with the server. I'll refer to what is passed in to the constructor as the client session timeout, and the eventual negotiated result as the negotiated session timeout.
KazooClient.start takes a timeout kwarg that specifies how long to wait for a connection to ZK overall, without resetting the timer on retries/switching to a different address. I'll call this the client start timeout.
Before a session has ever been negotiated with the server (i.e. when trying to connect to ZK initially) Kazoo passes the client session timeout to kazoo.handlers.utils.create_tcp_connection, which means that the entire session timeout is used as the timeout for each attempted TCP connection. But later, the negotiated session timeout is divided by the number of hosts in the ensemble, and this divided value is used when retrying requests or recovering from a broken connection. Assuming that the client session timeout and the negotiated session timeout are equal, this means that the timeout for each attempted TCP connection is much shorter after a session has been negotiated.
This doesn't seem to make sense. To be consistent, it seems to me that Kazoo should divide the passed client session timeout by the number of hosts when trying to create an initial connection, so that it can try to connect to each host in the ensemble. Otherwise, when the client start timeout is equal to the client session timeout, a single failed connection can eat up the entire timeout and cause Kazoo to throw a timeout_exception even if there are healthy hosts in the connect string.
If a user sets a low client session timeout, they might run into issues where reconnecting or submitting requests times out much faster than they expect, because this timeout is later divided by the ensemble size. If they set the timeouts equal or have the client session timeout higher, then they could burn their entire KazooClient.start timeout on a single unresponsive host without ever trying the other hosts in the connect string.
One workaround is to set the client start timeout to NUM_HOSTS * client start timeout, but this seems weird to me - why would we want a different timeout for creating the session than for keeping it alive?
At any rate, I'm not sure if this is intended behavior. Could someone please help me understand? Of course, if it's not intended, I'm happy to submit a PR.
Thanks for your issue. It seems to me that the problem you are describing has been fixed in PR #540. The initial Connect() request (which defines the client session timeout you are describing) is now sent with a timeout that takes into account the number of hosts in the ZK ensemble.
I'm closing the issue now but feel free to re-open it in case I misunderstood the underlying problem.
When initializing a KazooClient and connecting to a ZK ensemble, the user has two places where they can pass in a timeout:
kazoo.client.KazooClient
takes atimeout
kwarg that specifies the requested client session timeout, which is used to negotiate a session timeout with the server. I'll refer to what is passed in to the constructor as the client session timeout, and the eventual negotiated result as the negotiated session timeout.KazooClient.start
takes atimeout
kwarg that specifies how long to wait for a connection to ZK overall, without resetting the timer on retries/switching to a different address. I'll call this the client start timeout.Before a session has ever been negotiated with the server (i.e. when trying to connect to ZK initially) Kazoo passes the client session timeout to
kazoo.handlers.utils.create_tcp_connection
, which means that the entire session timeout is used as the timeout for each attempted TCP connection. But later, the negotiated session timeout is divided by the number of hosts in the ensemble, and this divided value is used when retrying requests or recovering from a broken connection. Assuming that the client session timeout and the negotiated session timeout are equal, this means that the timeout for each attempted TCP connection is much shorter after a session has been negotiated.This doesn't seem to make sense. To be consistent, it seems to me that Kazoo should divide the passed client session timeout by the number of hosts when trying to create an initial connection, so that it can try to connect to each host in the ensemble. Otherwise, when the client start timeout is equal to the client session timeout, a single failed connection can eat up the entire timeout and cause Kazoo to throw a
timeout_exception
even if there are healthy hosts in the connect string.If a user sets a low client session timeout, they might run into issues where reconnecting or submitting requests times out much faster than they expect, because this timeout is later divided by the ensemble size. If they set the timeouts equal or have the client session timeout higher, then they could burn their entire KazooClient.start timeout on a single unresponsive host without ever trying the other hosts in the connect string.
One workaround is to set the client start timeout to
NUM_HOSTS * client start timeout
, but this seems weird to me - why would we want a different timeout for creating the session than for keeping it alive?At any rate, I'm not sure if this is intended behavior. Could someone please help me understand? Of course, if it's not intended, I'm happy to submit a PR.
I think this issue was touched on in #205, btw.
The text was updated successfully, but these errors were encountered: