Best practice for Cluster Topology Refresh #339

oklahomer · 2016-08-23T03:48:43Z

This is rather a question than an issue or bug report.
My lettuce version: 4.2.1

Recently I found an issue discussing the Topology refresh and the use of ClusterClientOptions. In this comment @mp911de recommends to enable ClusterTopologyRefreshOptions#enableAllAdaptiveRefreshTriggers and set relatively longer period of time to ClusterTopologyRefreshOptions#enablePeriodicRefresh or even get rid of periodic refresh.

It makes sense to me because the adaptive refresh is only triggered by MOVED, ASK, and reconnection timeout event so the cost is minimal. The official document also recommend this kind of strategy:

An alternative is to just refresh the whole client-side cluster layout using the CLUSTER NODES or CLUSTER SLOTS commands when a MOVED redirection is received. When a redirection is encountered, it is likely multiple slots were reconfigured rather than just one, so updating the client configuration as soon as possible is often the best strategy.

So here comes my question.

What is the best practice for Cluster Topology Refresh settings?

Currently the wiki has a sample code of below, which states both periodic refresh and adaptive refresh trigger:
https://github.com/mp911de/lettuce/wiki/Client-options#cluster-specific-options

ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(30, TimeUnit.SECONDS)
                .enableAllAdaptiveRefreshTriggers()
                .build();

client.setOptions(ClusterClientOptions.builder()
                       .topologyRefreshOptions(topologyRefreshOptions)
                       .build());

I understand this is not necessarily the recommended settings for topology refresh, but it kind of confuses me so let me make sure.

My assumptions are:

Client only have to set adaptive refresh trigger, ClusterTopologyRefreshOptions#enableAllAdaptiveRefreshTriggers, for regular purpose since this refreshes topology view on demand.
The application code is responsible for command retrial caused by MOVED, ASK, and reconnection timeout; while lettuce is responsible for refreshing topology view with above setting.
If the RedisClusterClient instance is used for pub/sub purpose and expects to reconnect on master-slave failover, the client still needs to set ClusterTopologyRefreshOptions#enablePeriodicRefresh with reasonable interval depending on application use so that the subscribing connection can check its target node’s state.

To supplement 2nd assumption -- I still have vague understanding, though -- ClusterTopologyRefreshOptions#cancelCommandsOnReconnectFailure is used to reset enqueued command on reconnection fail. And it doesn't handle the command retrial caused by topology refresh. It seems like com.lambdaworks.redis.cluster. PooledClusterConnectionProvider#getConnection still have a chance to throws Exception on first connection fetch during or after topology refresh. ref. gist
I still wonder how I should detect those command failure caused by this particular case and handle retrial.

To sum up, the option setting for regular usage becomes one like below:

ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enableAllAdaptiveRefreshTriggers()
                .build();

And for cluster pub/sub, it’ll be something like below:

ClusterTopologyRefreshOptions topologyRefreshOptions = ClusterTopologyRefreshOptions.builder()
                .enablePeriodicRefresh(30, TimeUnit.SECONDS)
                .enableAllAdaptiveRefreshTriggers()
                .build();

Am I right about my above assumption, and is there any recommended way to handle retiral?
Thanks in advance.

The text was updated successfully, but these errors were encountered:

mp911de · 2016-08-23T06:28:05Z

Thanks for your question. You deal with a couple of issues here.

First, and most important, the attached gist depicts a bug. Would you care to file an issue?

About topology refreshing: There are different use-cases, and the only true refresh mechanism is no topology changes/no refreshing at all. Every other approach is trying to compensate in the one or other way. I know that a frozen topology is not feasible for the real world, so we have to stick to the one or the other refreshing mechanism.

My rationale why the current refreshing mechanism possibilities are compensation:

Adaptive refreshing only works in the scope of the data/nodes a client accesses. If you never experience a disconnect or a redirection, the client is not able to discover a change. That's ok for slot changes but in a case of a master/slave failover, that happens before the client opens a connection, there's currently no chance to discover that. I think lettuce could improve here by checking the cluster node role after connecting.
Periodic refresh is a generic attempt to catch 'em all. It comes at the cost of scheduling and cluster utilization. The connection is established in a blocking fashion (again something lettuce could improve). Refreshing in a large cluster (say 200 nodes) previously opened up to 200 additional connections. Refreshing is now configurable to open only connections to the initially specified seed nodes. Because of the scheduling nature, it's not possible to obtain changes in real time, but they are seen always delayed – by the scheduling intervals.

How to improve Redis Cluster topology updates:
Redis Sentinel utilizes Pub/Sub to communicate topology changes, something similar would be great for Redis Cluster.

About cancelCommandsOnReconnectFailure: That's a feature to release/cancel commands in case a reconnection attempt does not succeed. It's not Redis Cluster-specific but to protect the application from a server that went down and never returns, so the application is not permanently blocked.

About your gist: The gist shows a bug. The command is redirected to a host that was not connected before. The connection attempt fails with a deadlock. This issue is not related to reconnection.

oklahomer · 2016-08-23T07:47:05Z

Thanks for your immediate response.
I totally missed your below idea.

If you never experience a disconnect or a redirection, the client is not able to discover a change. That's ok for slot changes but in a case of a master/slave failover, that happens before the client opens a connection, there's currently no chance to discover that.

And about the exception's gist I attached here, I filed another issue #340 so I'm closing this one.
Thanks again for your detailed and immediate response.

SushmaReddyLoka · 2020-08-17T16:13:58Z

What does this means ?
The connection is established in a blocking fashion (again something lettuce could improve) ?

by any chance is this improved in 5.3.0 version ?
It comes at the cost of scheduling and cluster utilization.
how redis cluster utilization matters here ? Can you please elaborate on this.

That's ok for slot changes but in a case of a master/slave failover, that happens before the client opens a connection, there's currently no chance to discover that. I think lettuce could improve here by checking the cluster node role after connecting.
Any improvement made here..?
@mp911de

mp911de added the for: stackoverflow A question that is better suited to stackoverflow.com label Aug 23, 2016

oklahomer mentioned this issue Aug 23, 2016

Connection to newly added node is initially unavailable with validateClusterNodeMembership: false. #340

Closed

oklahomer closed this as completed Aug 23, 2016

SushmaReddyLoka mentioned this issue Aug 18, 2020

Topology refresh mechanisms #1394

Closed

jhmartin mentioned this issue Jan 21, 2022

Support periodic refreshes in Adaptive mode profunktor/redis4cats#656

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practice for Cluster Topology Refresh #339

Best practice for Cluster Topology Refresh #339

oklahomer commented Aug 23, 2016 •

edited

mp911de commented Aug 23, 2016

oklahomer commented Aug 23, 2016

SushmaReddyLoka commented Aug 17, 2020 •

edited

Best practice for Cluster Topology Refresh #339

Best practice for Cluster Topology Refresh #339

Comments

oklahomer commented Aug 23, 2016 • edited

What is the best practice for Cluster Topology Refresh settings?

mp911de commented Aug 23, 2016

oklahomer commented Aug 23, 2016

SushmaReddyLoka commented Aug 17, 2020 • edited

oklahomer commented Aug 23, 2016 •

edited

SushmaReddyLoka commented Aug 17, 2020 •

edited