Delay in cluster client detecting failover? #2308

emilykimhuynh · 2023-01-30T19:59:17Z

I have a cluster client configured with enableAllAdaptiveRefreshTriggers() and also call refreshPartitions() on every retry. From a Redis server perspective, failover occurs in < 5 seconds, but the client is taking upward to 30 seconds to adjust. Is this gap between client and server expected with this configuration, and if so, why? (Apologies if this is outlined in documentation somewhere, I was unable to find it).

mp911de · 2023-01-31T15:36:32Z

Adaptive refresh triggers have a timeout of 30 seconds to back off from event bursts. This is to protect the cluster from rapid connects/disconnects and topology retrievals.

emilykimhuynh · 2023-01-31T20:51:13Z

thank you @mp911de - does explicitly calling refreshPartitions() share this same timeout? To add some additional context, anytime there's an exception from any command, I'm sleeping for 4s, calling refreshPartitions(), and retrying. However even after the failover is complete and the Redis server is stable, it still sometimes takes my client 30+ seconds to adjust. Is there any obvious reason for this lag?

karlvr · 2023-05-29T03:26:56Z

We are seeing similar behaviour on our Redis cluster of seven nodes; 3 × primary and 4 × replica. When we failover one of our primaries to a replica we get Lettuce threads hanging for 30-60s, and then just coming right, whereas the Redis failover appears to occur within ~1s.

We have set:

	.enableAllAdaptiveRefreshTriggers()
	.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(2))

... to attempt to encourage Lettuce to unblock itself, but without success. We would love to get to the bottom of this as it makes it challenging for us to maintain Redis nodes. We can reproduce the issue easily in our setup and we're happy to test configuration!

This is our full client config:

final ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
	.enablePeriodicRefresh(Duration.of(15, ChronoUnit.MINUTES))
	.enableAllAdaptiveRefreshTriggers()
	.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(2))
	.closeStaleConnections(true)
	.build();

final SocketOptions socketOptions = SocketOptions.builder()
	.connectTimeout(Duration.of(5, ChronoUnit.SECONDS))
	.tcpNoDelay(true)
	.build();

client.setOptions(ClusterClientOptions.builder()
	.autoReconnect(true)
	/* Filter out failed nodes https://github.com/lettuce-io/lettuce-core/issues/2318 */
	.nodeFilter(node -> !node.getFlags().contains(NodeFlag.FAIL))
	.timeoutOptions(TimeoutOptions.enabled())
	.pingBeforeActivateConnection(true)
	.topologyRefreshOptions(topologyRefreshOptions)
	.socketOptions(socketOptions)
	.build());

mp911de · 2023-05-30T12:38:13Z

Can you record debug logs from the failover period? If we see what happens under the hood, then we might be able to identify what takes so long.

github-actions · 2024-05-30T00:13:45Z

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 30 days this issue will be closed.

emilykimhuynh closed this as completed Jan 31, 2023

emilykimhuynh reopened this Jan 31, 2023

mp911de added the status: waiting-for-feedback We need additional information before we can continue label May 30, 2023

github-actions bot added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label May 30, 2024

github-actions bot closed this as completed Jun 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delay in cluster client detecting failover? #2308

Delay in cluster client detecting failover? #2308

emilykimhuynh commented Jan 30, 2023

mp911de commented Jan 31, 2023

emilykimhuynh commented Jan 31, 2023 •

edited

Loading

karlvr commented May 29, 2023 •

edited

Loading

mp911de commented May 30, 2023

github-actions bot commented May 30, 2024

Delay in cluster client detecting failover? #2308

Delay in cluster client detecting failover? #2308

Comments

emilykimhuynh commented Jan 30, 2023

mp911de commented Jan 31, 2023

emilykimhuynh commented Jan 31, 2023 • edited Loading

karlvr commented May 29, 2023 • edited Loading

mp911de commented May 30, 2023

github-actions bot commented May 30, 2024

emilykimhuynh commented Jan 31, 2023 •

edited

Loading

karlvr commented May 29, 2023 •

edited

Loading