Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delay in cluster client detecting failover? #2308

Closed
emilykimhuynh opened this issue Jan 30, 2023 · 5 comments
Closed

Delay in cluster client detecting failover? #2308

emilykimhuynh opened this issue Jan 30, 2023 · 5 comments
Labels
status: feedback-reminder We've sent a reminder that we need additional information before we can continue status: waiting-for-feedback We need additional information before we can continue

Comments

@emilykimhuynh
Copy link

I have a cluster client configured with enableAllAdaptiveRefreshTriggers() and also call refreshPartitions() on every retry. From a Redis server perspective, failover occurs in < 5 seconds, but the client is taking upward to 30 seconds to adjust. Is this gap between client and server expected with this configuration, and if so, why? (Apologies if this is outlined in documentation somewhere, I was unable to find it).

@mp911de
Copy link
Collaborator

mp911de commented Jan 31, 2023

Adaptive refresh triggers have a timeout of 30 seconds to back off from event bursts. This is to protect the cluster from rapid connects/disconnects and topology retrievals.

@emilykimhuynh
Copy link
Author

emilykimhuynh commented Jan 31, 2023

thank you @mp911de - does explicitly calling refreshPartitions() share this same timeout? To add some additional context, anytime there's an exception from any command, I'm sleeping for 4s, calling refreshPartitions(), and retrying. However even after the failover is complete and the Redis server is stable, it still sometimes takes my client 30+ seconds to adjust. Is there any obvious reason for this lag?

@karlvr
Copy link

karlvr commented May 29, 2023

We are seeing similar behaviour on our Redis cluster of seven nodes; 3 × primary and 4 × replica. When we failover one of our primaries to a replica we get Lettuce threads hanging for 30-60s, and then just coming right, whereas the Redis failover appears to occur within ~1s.

We have set:

	.enableAllAdaptiveRefreshTriggers()
	.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(2))

... to attempt to encourage Lettuce to unblock itself, but without success. We would love to get to the bottom of this as it makes it challenging for us to maintain Redis nodes. We can reproduce the issue easily in our setup and we're happy to test configuration!

This is our full client config:

final ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
	.enablePeriodicRefresh(Duration.of(15, ChronoUnit.MINUTES))
	.enableAllAdaptiveRefreshTriggers()
	.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(2))
	.closeStaleConnections(true)
	.build();

final SocketOptions socketOptions = SocketOptions.builder()
	.connectTimeout(Duration.of(5, ChronoUnit.SECONDS))
	.tcpNoDelay(true)
	.build();

client.setOptions(ClusterClientOptions.builder()
	.autoReconnect(true)
	/* Filter out failed nodes https://github.com/lettuce-io/lettuce-core/issues/2318 */
	.nodeFilter(node -> !node.getFlags().contains(NodeFlag.FAIL))
	.timeoutOptions(TimeoutOptions.enabled())
	.pingBeforeActivateConnection(true)
	.topologyRefreshOptions(topologyRefreshOptions)
	.socketOptions(socketOptions)
	.build());

@mp911de
Copy link
Collaborator

mp911de commented May 30, 2023

Can you record debug logs from the failover period? If we see what happens under the hood, then we might be able to identify what takes so long.

@mp911de mp911de added the status: waiting-for-feedback We need additional information before we can continue label May 30, 2023
Copy link

If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 30 days this issue will be closed.

@github-actions github-actions bot added the status: feedback-reminder We've sent a reminder that we need additional information before we can continue label May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: feedback-reminder We've sent a reminder that we need additional information before we can continue status: waiting-for-feedback We need additional information before we can continue
Projects
None yet
Development

No branches or pull requests

3 participants