-
Notifications
You must be signed in to change notification settings - Fork 951
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Delay in cluster client detecting failover? #2308
Comments
Adaptive refresh triggers have a timeout of 30 seconds to back off from event bursts. This is to protect the cluster from rapid connects/disconnects and topology retrievals. |
thank you @mp911de - does explicitly calling refreshPartitions() share this same timeout? To add some additional context, anytime there's an exception from any command, I'm sleeping for 4s, calling refreshPartitions(), and retrying. However even after the failover is complete and the Redis server is stable, it still sometimes takes my client 30+ seconds to adjust. Is there any obvious reason for this lag? |
We are seeing similar behaviour on our Redis cluster of seven nodes; 3 × primary and 4 × replica. When we failover one of our primaries to a replica we get Lettuce threads hanging for 30-60s, and then just coming right, whereas the Redis failover appears to occur within ~1s. We have set: .enableAllAdaptiveRefreshTriggers()
.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(2)) ... to attempt to encourage Lettuce to unblock itself, but without success. We would love to get to the bottom of this as it makes it challenging for us to maintain Redis nodes. We can reproduce the issue easily in our setup and we're happy to test configuration! This is our full client config: final ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
.enablePeriodicRefresh(Duration.of(15, ChronoUnit.MINUTES))
.enableAllAdaptiveRefreshTriggers()
.adaptiveRefreshTriggersTimeout(Duration.ofSeconds(2))
.closeStaleConnections(true)
.build();
final SocketOptions socketOptions = SocketOptions.builder()
.connectTimeout(Duration.of(5, ChronoUnit.SECONDS))
.tcpNoDelay(true)
.build();
client.setOptions(ClusterClientOptions.builder()
.autoReconnect(true)
/* Filter out failed nodes https://github.com/lettuce-io/lettuce-core/issues/2318 */
.nodeFilter(node -> !node.getFlags().contains(NodeFlag.FAIL))
.timeoutOptions(TimeoutOptions.enabled())
.pingBeforeActivateConnection(true)
.topologyRefreshOptions(topologyRefreshOptions)
.socketOptions(socketOptions)
.build()); |
Can you record debug logs from the failover period? If we see what happens under the hood, then we might be able to identify what takes so long. |
If you would like us to look at this issue, please provide the requested information. If the information is not provided within the next 30 days this issue will be closed. |
I have a cluster client configured with enableAllAdaptiveRefreshTriggers() and also call refreshPartitions() on every retry. From a Redis server perspective, failover occurs in < 5 seconds, but the client is taking upward to 30 seconds to adjust. Is this gap between client and server expected with this configuration, and if so, why? (Apologies if this is outlined in documentation somewhere, I was unable to find it).
The text was updated successfully, but these errors were encountered: