fix(core): Adjust connection timeout and retry logic #685

ceache · 2022-11-24T04:10:27Z

Why is this needed?

Do not use the session timeout as connection timeout. This value is too large and results in a bad non-responsive server holding up the client long enough for the session to timeout.

Proposed Changes

Use the KazooRetry object to use an increasing backoff timeout and cycle over all servers quickly, working around bad servers with minimal impact.

Does this PR introduce any breaking change?

This change should be transparent at a high level.
It makes the client connection attempts more aggressive, on purpose.

kazoo/retry.py

kazoo/tests/test_connection.py

ceache · 2022-11-24T04:22:11Z

@python-zk/maintainers This is an early draft of fix to the connection logic. The original issue that sent me on this fix is as follow:

zkc = KazooClient(hosts=["nonresponsive:2181", "goodhost:2181"], timeout=60)
zkc.start()

On a new connection, it results in a painful delay of 60 seconds wasted on the bad server before trying the good one.
On a reconnect, it is worse. It results in the connection being stuck on an attempt on the bad server with a TCP timeout set to 60 seconds, i.e. equal to the session timeout. By the time it gives up, and moves on to trying the good host, the session is already lost.

codecov-commenter · 2022-11-24T04:24:10Z

Codecov Report

Patch coverage: 100.00% and project coverage change: -2.26 ⚠️

Comparison is base (2c36d69) 96.65% compared to head (fd57818) 94.40%.

❗ Current head fd57818 differs from pull request most recent head e34e407. Consider uploading reports for the commit e34e407 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #685      +/-   ##
==========================================
- Coverage   96.65%   94.40%   -2.26%     
==========================================
  Files          27       58      +31     
  Lines        3554     8413    +4859     
==========================================
+ Hits         3435     7942    +4507     
- Misses        119      471     +352

Impacted Files	Coverage Δ
kazoo/protocol/connection.py	`96.28% <100.00%> (-1.44%)`	⬇️
kazoo/retry.py	`100.00% <100.00%> (ø)`
kazoo/tests/test_connection.py	`100.00% <100.00%> (ø)`

... and 41 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

ceache · 2022-11-24T14:30:20Z

I am pondering if the below should be changed too.

I really do not think session_timeout_in_sec / number_of_hosts is appropriate here.
I would like it to use the client's connection_retry or even command_retry.

        connect_result, zxid = self._invoke(
            client._session_timeout / 1000.0 / len(client.hosts), connect
        )

ztzg

LGTM, in general—except possibly for the default initial cur_delay. I have left a few comments.

ztzg · 2022-11-24T18:22:47Z

kazoo/protocol/connection.py

-        connect_result, zxid = self._invoke(
-            client._session_timeout / 1000.0 / len(client.hosts), connect
-        )
+        connect_result, zxid = self._invoke(timeout, connect)


I agree that _session_timeout, being the requested session timeout and thus potentially arbitrarily larger than the actual session timeout, is an unfortunate choice.

Thinking about it some more, I am not sure that using the retry's cur_delay as-is is a good choice either. It may have been sufficient for the TCP (and potentially SSL) handshakes, but this request/response cycle involves a round trip through the JVM, disk logging, quorum packets (learner with global sessions), etc.

I can imagine many "false starts" (and thus gratuitous network & server activity) if the default cur_delay is kept too low.

Interestingly, _authenticate_with_sasl and AUTH_XID below reuse connect_timeout—which, at least, is related to the negotiated timeout—for "numerous" write and read operations which have to complete timely as they do not "touch" the session.

The good news is that the session is "fresh" at that point. But in theory, the negotiated timeout should be used to determine a deadline for the whole sequence, so that it aborts before causing a session expiry if e.g. SASL authentication proceeds too slowly.

(Note that this is not handled in the other clients either.)

I think that the "session timeout" is an indication what you want (as you point out, before negotiation with the server's config) for how long your session will remain "alive" in case the client disconnects. In other words, how much time the client has to reconnect.

I have a more "radical" view than yours, I think, which is that it has nothing to do with how long it takes to establish TCP connection between the client and the servers.

The increasing backoff method is a good way to adapt to Zookeeper servers that can be deployed on the local network (~10s ms), across data centers (~100s ms) or across continents. The default value would be more fitting for the 2 first use cases and, yes, possibly less so for wider deployments, but anyway for such setups, it is possible to easily tweak that value.

That is my opinion for the TCP connection timeout.

For the requests timeout, which is once the TCP connection is established, this is something else.

For request/reply messages on an established TCP connection, including Connect requests.

The original code used client._session_timeout / 1000.0 / len(client.hosts) for the Connect (pre-negotiation).

After that, all other request/reply use these two, connect_timeout for writes, read_timeout for reads:

connect_timeout = negotiated_session_timeout / len(client.hosts) read_timeout = negotiated_session_timeout * 2.0 / 3.0

My rational for changing the timeout on the 1st Connect is that it is possible to have a healthy host (which succeeds to TCP connect) but an unhealthy JVM/Zookeeper that does not handle the Connect request.

In such case, the client would wait a long time before timing out and moving to the next server, much of its total session timeout.
It feels that waiting arbitrary long when we know what was the timeout of a successful TCP is not ideal, especially when we have not yet selected a server and there could be others.

But I could be convinced that this should be part of a different change.

ztzg · 2022-11-24T18:24:26Z

kazoo/protocol/connection.py

@@ -614,7 +614,9 @@ def _connect_attempt(self, host, hostip, port, retry):

        try:
            self._xid = 0
-            read_timeout, connect_timeout = self._connect(host, hostip, port)
+            read_timeout, connect_timeout = self._connect(
+                host, hostip, port, timeout=retry.cur_delay


The logic looks good, but the default initial _cur_delay is 100ms. Do we really want to keep it that low?

jeffwidman

At a high-level this makes complete sense and the implementation looks solid. Reality of course is that timeouts are always tricky, so we may need to further tweak once this is live in prod and folks start to see how it behaves.

Do not use the session timeout as connection timeout. This value is too large and results in a bad non-responsive server holding up the client long enough for the session to timeout. Use the KazooRetry object to use an increasing backoff timeout and cycle over all servers quickly, working around bad servers with minimal impact.

StephenSorriaux · 2023-04-24T22:59:19Z

Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s). I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.
I noticed that in ConnectionHandler, the retry_sleeper is only used for the zk_loop, do you think some logic around the initial connection_retry value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry is None (ie. just keep things like before).

ztzg · 2024-06-13T06:32:29Z

Hi @StephenSorriaux, all,

It would be great to have this fixed in master, as teams using unpatched versions of Kazoo regularly experience "outages."

Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s).

Note that 10s is (supposed to be) the session lifetime, and that by using it as a connection timeout, the session is dead by the time any unresponsive TCP/IP operation aborts. Which is… suboptimal!

I would not have dared 100ms, but agree with @ceache that the backoff should cause clients to quickly adapt to the actual network.

In any case, I vote for it over repeated puzzled users :)

I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.

As Jeff wrote, "timeouts are always tricky"—but I don't understand why we would want to keep using this value, as it basically guarantees that the session is… dead… by the deadline. Am I missing something?

It seems that the default value should ideally be below (lifetime / n_hosts), to give the client a chance to try each host once—and even lower if we want to allow multiple "rounds."

For the record, the Java client initially sets the TCP/IP connect timeout to (requested_lifetime / n_hosts), and ajusts it to (negotiated_lifetime / n_hosts) once the negotiated lifetime is known—but it does not apply backoff.

I noticed that in ConnectionHandler, the retry_sleeper is only used for the zk_loop, do you think some logic around the initial connection_retry value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry is None (ie. just keep things like before).

This would alleviate the need for a patched Kazoo, but would IMHO still be a suboptimal default.

Cheers, -D

houndci-bot reviewed Nov 24, 2022

View reviewed changes

kazoo/retry.py Outdated Show resolved Hide resolved

kazoo/tests/test_connection.py Outdated Show resolved Hide resolved

kazoo/tests/test_connection.py Outdated Show resolved Hide resolved

ceache force-pushed the feat/connection_retry branch 2 times, most recently from 4a89840 to 0170cdf Compare November 24, 2022 04:13

ceache self-assigned this Nov 24, 2022

ceache requested review from a-ungurianu, bbangert, jeffwidman and StephenSorriaux November 24, 2022 04:15

ztzg reviewed Nov 24, 2022

View reviewed changes

ceache force-pushed the feat/connection_retry branch from b738347 to fd57818 Compare November 26, 2022 00:03

ceache force-pushed the feat/connection_retry branch from fd57818 to d4a28f8 Compare January 30, 2023 05:58

ceache marked this pull request as ready for review January 30, 2023 06:03

jeffwidman previously approved these changes Jan 30, 2023

View reviewed changes

ceache dismissed jeffwidman’s stale review via c36c2a1 February 12, 2023 14:26

ceache force-pushed the feat/connection_retry branch 2 times, most recently from c36c2a1 to 2295cbc Compare February 12, 2023 14:34

StephenSorriaux mentioned this pull request Apr 24, 2023

Make tests less flaky on CI #716

Merged

ceache force-pushed the feat/connection_retry branch from 2295cbc to c634e52 Compare April 24, 2023 21:14

ceache force-pushed the feat/connection_retry branch from c634e52 to e34e407 Compare April 24, 2023 21:19

ceache mentioned this pull request May 25, 2023

New kazoo release to include SSL fixes #720

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(core): Adjust connection timeout and retry logic #685

fix(core): Adjust connection timeout and retry logic #685

ceache commented Nov 24, 2022 •

edited

Loading

ceache commented Nov 24, 2022 •

edited

Loading

codecov-commenter commented Nov 24, 2022 •

edited by codecov bot

Loading

ceache commented Nov 24, 2022

ztzg left a comment

ztzg Nov 24, 2022

ztzg Nov 24, 2022

ztzg Nov 24, 2022

ceache Nov 24, 2022

ceache Nov 25, 2022

ztzg Nov 24, 2022

jeffwidman left a comment

StephenSorriaux commented Apr 24, 2023

ztzg commented Jun 13, 2024

fix(core): Adjust connection timeout and retry logic #685

Are you sure you want to change the base?

fix(core): Adjust connection timeout and retry logic #685

Conversation

ceache commented Nov 24, 2022 • edited Loading

Why is this needed?

Proposed Changes

Does this PR introduce any breaking change?

ceache commented Nov 24, 2022 • edited Loading

codecov-commenter commented Nov 24, 2022 • edited by codecov bot Loading

Codecov Report

ceache commented Nov 24, 2022

ztzg left a comment

Choose a reason for hiding this comment

ztzg Nov 24, 2022

Choose a reason for hiding this comment

ztzg Nov 24, 2022

Choose a reason for hiding this comment

ztzg Nov 24, 2022

Choose a reason for hiding this comment

ceache Nov 24, 2022

Choose a reason for hiding this comment

ceache Nov 25, 2022

Choose a reason for hiding this comment

ztzg Nov 24, 2022

Choose a reason for hiding this comment

jeffwidman left a comment

Choose a reason for hiding this comment

StephenSorriaux commented Apr 24, 2023

ztzg commented Jun 13, 2024

ceache commented Nov 24, 2022 •

edited

Loading

ceache commented Nov 24, 2022 •

edited

Loading

codecov-commenter commented Nov 24, 2022 •

edited by codecov bot

Loading