Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kazoo uses the entire session timeout on each attempted TCP connection before negotiating a session #517

Closed
graysonchao opened this issue Jul 25, 2018 · 1 comment

Comments

@graysonchao
Copy link

When initializing a KazooClient and connecting to a ZK ensemble, the user has two places where they can pass in a timeout:

  1. kazoo.client.KazooClient takes a timeout kwarg that specifies the requested client session timeout, which is used to negotiate a session timeout with the server. I'll refer to what is passed in to the constructor as the client session timeout, and the eventual negotiated result as the negotiated session timeout.
  2. KazooClient.start takes a timeout kwarg that specifies how long to wait for a connection to ZK overall, without resetting the timer on retries/switching to a different address. I'll call this the client start timeout.

Before a session has ever been negotiated with the server (i.e. when trying to connect to ZK initially) Kazoo passes the client session timeout to kazoo.handlers.utils.create_tcp_connection, which means that the entire session timeout is used as the timeout for each attempted TCP connection. But later, the negotiated session timeout is divided by the number of hosts in the ensemble, and this divided value is used when retrying requests or recovering from a broken connection. Assuming that the client session timeout and the negotiated session timeout are equal, this means that the timeout for each attempted TCP connection is much shorter after a session has been negotiated.

This doesn't seem to make sense. To be consistent, it seems to me that Kazoo should divide the passed client session timeout by the number of hosts when trying to create an initial connection, so that it can try to connect to each host in the ensemble. Otherwise, when the client start timeout is equal to the client session timeout, a single failed connection can eat up the entire timeout and cause Kazoo to throw a timeout_exception even if there are healthy hosts in the connect string.

If a user sets a low client session timeout, they might run into issues where reconnecting or submitting requests times out much faster than they expect, because this timeout is later divided by the ensemble size. If they set the timeouts equal or have the client session timeout higher, then they could burn their entire KazooClient.start timeout on a single unresponsive host without ever trying the other hosts in the connect string.

One workaround is to set the client start timeout to NUM_HOSTS * client start timeout, but this seems weird to me - why would we want a different timeout for creating the session than for keeping it alive?

At any rate, I'm not sure if this is intended behavior. Could someone please help me understand? Of course, if it's not intended, I'm happy to submit a PR.

I think this issue was touched on in #205, btw.

@StephenSorriaux
Copy link
Member

Hi @graysonchao,

Thanks for your issue. It seems to me that the problem you are describing has been fixed in PR #540. The initial Connect() request (which defines the client session timeout you are describing) is now sent with a timeout that takes into account the number of hosts in the ZK ensemble.

I'm closing the issue now but feel free to re-open it in case I misunderstood the underlying problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants