Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConnectionWatchdog tries to reconnect to the node's previous IP #1343

Closed
DamianFekete opened this issue Jul 14, 2020 · 5 comments
Closed

ConnectionWatchdog tries to reconnect to the node's previous IP #1343

DamianFekete opened this issue Jul 14, 2020 · 5 comments
Labels
type: bug A general bug

Comments

@DamianFekete
Copy link

Bug Report

Lettuce' ConnectionWatchdog keeps trying to connect to old IPs of Redis nodes.

Current Behavior

  • Cluster with 6 nodes (3 master nodes, each with one replica), running under Kubernetes
  • Redis nodes are restarted one by one, each getting a new IP address
  • The ConnectionWatchog doesn't seem to use this information to stop connecting to the previous IPs.

RoundRobinSocketAddressSupplier says that the IP Address of 3da56d06b4a34c20ee560d3ed28a2679ba089a30 is 10.6.21.237, but RedisStateMachine has it is as 10.6.37.76.

Log messages
09:08:44.918    DEBUG           RedisStateMachine        Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=
ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.34.87:6379@16379 master - 0 1594717721000 90 connected 10923-16383
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.26.31:6379@16379 myself,slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 0 1594717722000 84 connected
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.141:6379@16379 slave b4cb4eb98ca653e888d3b2ec931898ab7c97b867 0 1594717723043 92 connected
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.21.140:6379@16379 master - 0 1594717722039 92 connected 0-5460
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.11.155:6379@16379 master - 0 1594717721036 86 connected 5461-10922
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.37.76:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594717724047 86 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true

09:09:08.805    DEBUG           RoundRobinSocketAddressSupplier                Resolved SocketAddress 10.6.21.237:6379 using for Cluster node 3da56d06b4a34c20ee560d3ed28a2679ba089a30
09:09:08.806    DEBUG           ReconnectionHandler      Reconnecting to Redis at 10.6.21.237:6379

09:09:08.844    DEBUG           RedisStateMachine         Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=
ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.34.87:6379@16379 master - 0 1594717747000 90 connected 10923-16383
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.141:6379@16379 slave b4cb4eb98ca653e888d3b2ec931898ab7c97b867 0 1594717745384 92 connected
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.37.76:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594717747390 86 connected
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.11.155:6379@16379 myself,master - 0 1594717744000 86 connected 5461-10922
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.21.140:6379@16379 master - 0 1594717748401 92 connected 0-5460
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.26.31:6379@16379 slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 0 1594717747000 90 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true

09:09:18.824    DEBUG           ConnectionWatchdog      [channel=0x03ba1129, /10.6.13.138:40214 -> /10.6.21.237:6379, last known addr=/10.6.21.237:6379] scheduleReconnect()

09:09:18.824    DEBUG           ConnectionWatchdog       Cannot reconnect to [10.6.21.237:6379]: connection timed out: /10.6.21.237:6379 io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.21.237:6379
        at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe$1.run(AbstractNioChannel.java:261)
        at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
        at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
        at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
        at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
        at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.lang.Thread.run(Thread.java:834)

09:09:18.825    DEBUG           ConnectionWatchdog        [channel=0x03ba1129, /10.6.13.138:40214 -> /10.6.21.237:6379, last known addr=/10.6.21.237:6379] Reconnect attempt 61, delay 30000ms

Lettuce Configuration

Relevant Spring Boot Configuration
spring.redis.lettuce.cluster.refresh.adaptive=true
spring.redis.lettuce.cluster.refresh.period=1M

custom.redis.lettuce.cluster.refresh.dynamic-sources=false (dynamicRefreshSources)

spring.redis.cluster.nodes=\
  redis-cluster-0.redis-cluster:6379,\
  redis-cluster-1.redis-cluster:6379,\
  redis-cluster-2.redis-cluster:6379,\
  redis-cluster-3.redis-cluster:6379,\
  redis-cluster-4.redis-cluster:6379,\
  redis-cluster-5.redis-cluster:6379

Expected behavior/code

ConnectionWatchdog should stop trying to connect to a previous IP address of a Redis node (which is now known to have another IP address).

Environment

  • Lettuce version(s): 5.3.1.RELEASE
  • Redis version: 6.0.1
  • Spring Boot: 2.3.1
@DamianFekete DamianFekete added the type: bug A general bug label Jul 14, 2020
@mp911de
Copy link
Collaborator

mp911de commented Jul 14, 2020

The round robin mechanism uses the given seed nodes for reconnect meaning that DNS caching needs to be considered. That mechanism is used formte default connection which is used for key-less commands.

@DamianFekete
Copy link
Author

@mp911de, thanks for your quick reply.

I've seen a couple of Github issues mentioning the DNS caching, but I think this does not apply here. I also forgot to mention that we're using Java 11.

In the following log we have:

  • redis-cluster-3: ff06a774b13ad88a63b39dcd0c9325caf3e4fa16
  • Initial IP address: 10.6.35.174
  • New IP address: 10.6.35.180

Why I think this is not a DNS issue:

  • Short before 12:19:14 the Redis node is restarted (Connection refused)
  • At 12:19:14 the hostname can not be resolved: (UnknownHostException: Name or service not known)
  • At 12:21:39 Lettuce connects to the new IP
  • At 12:21:49 we can see the relationship between the hostname and the new IP (redis-cluster-3.redis-cluster/10.6.35.180:6379)
    • redis-cluster-3.redis-cluster/10.6.35.180:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.180:6379
  • Even an hour later (at 13:32:45) Lettuce keeps trying to connect to the old IP (Resolved SocketAddress 10.6.35.174:6379 using for Cluster node ff06a774b13ad88a63b39dcd0c9325caf3e4fa16)
Logs

12:06:38.3	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster.pwtool-dev-emea/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster.pwtool-dev-emea/10.6.35.174:6379] userEventTriggered(ctx, io.lettuce.core.ConnectionEvents$Activated@9f2d948)		

12:06:38.4	RedisStateMachine             	Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.35.174:6379@16379 master - 0 1594641995126 90 connected 10923-16383
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.21.229:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594641996129 86 connected
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.243:6379@16379 myself,master - 0 1594641996000 91 connected 0-5460
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.18.196:6379@16379 slave 9bc0e2cb46f424cac90d8816485d1a2728919765 0 1594641997131 91 connected
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.26.146:6379@16379 master - 0 1594641998133 86 connected 5461-10922
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.37.184:6379@16379 slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 0 1594641994124 90 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true		


12:19:14.9	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] channelInactive()		
12:19:14.9	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] Reconnect attempt 1, delay 1ms		
12:19:14.9	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] Reconnect attempt 2, delay 2ms		
12:19:14.9	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] scheduleReconnect()		
12:19:14.9	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] scheduleReconnect()		
12:19:14.9	ConnectionWatchdog            	[channel=0xad2ec2f2, /10.6.21.236:34340 -> /10.6.35.174:6379, last known addr=/10.6.35.174:6379] channelInactive()		
12:19:14.9	ConnectionWatchdog            	[channel=0xad2ec2f2, /10.6.21.236:34340 -> /10.6.35.174:6379, last known addr=/10.6.35.174:6379] Reconnect attempt 1, delay 1ms		
12:19:14.9	ConnectionWatchdog            	[channel=0xad2ec2f2, /10.6.21.236:34340 -> /10.6.35.174:6379, last known addr=/10.6.35.174:6379] scheduleReconnect()		
12:19:14.9	ConnectionWatchdog            	Cannot reconnect to [10.6.35.174:6379]: Connection refused: /10.6.35.174:6379	io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: /10.6.35.174:6379
Caused by: java.net.ConnectException: Connection refused

12:19:14.9	ConnectionWatchdog            	Cannot reconnect to [redis-cluster-3.redis-cluster:6379]: redis-cluster-3.redis-cluster: Name or service not known	java.net.UnknownHostException: redis-cluster-3.redis-cluster: Name or service not known

12:19:14.9	ConnectionWatchdog            	Reconnecting, last destination was /10.6.35.174:6379		
12:19:14.9	ConnectionWatchdog            	Reconnecting, last destination was redis-cluster-3.redis-cluster/10.6.35.174:6379		

12:19:15.0	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] Reconnect attempt 3, delay 4ms		
12:19:15.0	ConnectionWatchdog            	[channel=0x6595437c, /10.6.21.236:51826 -> redis-cluster-3.redis-cluster/10.6.35.174:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.174:6379] scheduleReconnect()		
12:19:15.0	ConnectionWatchdog            	Cannot reconnect to [redis-cluster-3.redis-cluster:6379]: redis-cluster-3.redis-cluster	java.net.UnknownHostException: redis-cluster-3.redis-cluster
	at java.base/java.net.InetAddress$CachedAddresses.get(InetAddress.java:797)
	at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1505)

12:19:15.0	ConnectionWatchdog            	Reconnecting, last destination was redis-cluster-3.redis-cluster:6379		
12:19:15.0	ReconnectionHandler           	Reconnecting to Redis at redis-cluster-3.redis-cluster:6379		
12:19:19.2	DefaultEndpoint               	[channel=0x2696ce8c, /10.6.21.236:59556 -> /10.6.35.174:6379, epid=0x72] write() done		
12:19:19.2	DefaultEndpoint               	[channel=0x2696ce8c, /10.6.21.236:59556 -> /10.6.35.174:6379, epid=0x72] write() done		
12:19:19.2	DefaultEndpoint               	[channel=0x2696ce8c, /10.6.21.236:59556 -> /10.6.35.174:6379, epid=0x72] writeToDisconnectedBuffer() buffering (disconnected) command TimedAsyncCommand [type=CLUSTER, output=StatusOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]		
12:19:19.2	DefaultEndpoint               	[channel=0x2696ce8c, /10.6.21.236:59556 -> /10.6.35.174:6379, epid=0x72] writeToDisconnectedBuffer() buffering (disconnected) command TimedAsyncCommand [type=INFO, output=StatusOutput [output=null, error='null'], commandType=io.lettuce.core.protocol.Command]		
12:19:19.2	RedisStateMachine             	Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.26.146:6379@16379 master - 1594642753999 1594642749000 86 connected 5461-10922
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.18.196:6379@16379 slave 9bc0e2cb46f424cac90d8816485d1a2728919765 0 1594642759015 91 connected
ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.35.174:6379@16379 master - 1594642758010 1594642752998 90 connected 10923-16383
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.21.237:6379@16379 myself,slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594642754000 85 connected
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.243:6379@16379 master - 1594642750992 1594642746980 91 connected 0-5460
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.37.184:6379@16379 slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 1594642755003 1594642748000 90 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true		
12:19:19.2	RedisStateMachine             	Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.35.174:6379@16379 master - 1594642755294 1594642752000 90 connected 10923-16383
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.26.146:6379@16379 master - 1594642752287 1594642749280 86 connected 5461-10922
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.18.196:6379@16379 myself,slave 9bc0e2cb46f424cac90d8816485d1a2728919765 0 1594642757000 89 connected
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.37.184:6379@16379 slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 1594642755595 1594642748000 90 connected
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.243:6379@16379 master - 1594642753290 1594642747269 91 connected 0-5460
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.21.237:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594642758307 86 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true		
12:19:30.3	ReconnectionHandler           	Reconnecting to Redis at 10.6.35.174:6379		
12:19:30.3	RoundRobinSocketAddressSupplie	Resolved SocketAddress 10.6.35.174:6379 using for Cluster node ff06a774b13ad88a63b39dcd0c9325caf3e4fa16		
12:19:40.3	ConnectionWatchdog            	Cannot reconnect to [10.6.35.174:6379]: connection timed out: /10.6.35.174:6379	io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.35.174:6379



12:21:29.7	ClusterTopologyRefreshSchedule	ClusterTopologyRefreshTask requesting partitions from [RedisURI [host='10.6.13.250', port=6379], RedisURI [host='10.6.18.196', port=6379], RedisURI [host='10.6.21.237', port=6379], RedisURI [host='10.6.26.152', port=6379], RedisURI [host='10.6.35.174', port=6379], RedisURI [host='10.6.37.190', port=6379]]		
12:21:29.7	ConnectionWatchdog            	Reconnecting, last destination was 10.6.35.174:6379		
12:21:29.7	PooledClusterConnectionProvide	Resolved SocketAddress 10.6.35.174:6379 using for Cluster node at 10.6.35.174:6379		
12:21:29.7	ReconnectionHandler           	Reconnecting to Redis at 10.6.35.174:6379		
12:21:29.7	RedisClient                   	Connecting to Redis at 10.6.35.174:6379		
12:21:39.7	ClusterTopologyRefresh        	Unable to connect to [10.6.35.174:6379]: connection timed out: /10.6.35.174:6379	io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.35.174:6379

12:21:39.7	CommandHandler                	[channel=0x4fd6bbfd, /10.6.21.236:38658 -> /10.6.35.180:6379, chid=0xef] channelActive() done		
12:21:39.7	CommandHandler                	[channel=0x4fd6bbfd, /10.6.21.236:38658 -> /10.6.35.180:6379, chid=0xef] channelActive()		
12:21:39.7	ConnectionWatchdog            	[channel=0x4fd6bbfd, /10.6.21.236:38658 -> /10.6.35.180:6379, last known addr=/10.6.35.180:6379] channelActive()		

12:21:39.7	ConnectionWatchdog            	[channel=0xad2ec2f2, /10.6.21.236:34340 -> /10.6.35.174:6379, last known addr=/10.6.35.174:6379] Reconnect attempt 7, delay 64ms		
12:21:39.7	ConnectionWatchdog            	[channel=0xad2ec2f2, /10.6.21.236:34340 -> /10.6.35.174:6379, last known addr=/10.6.35.174:6379] scheduleReconnect()		
12:21:39.7	ConnectionWatchdog            	Cannot reconnect to [10.6.35.174:6379]: connection timed out: /10.6.35.174:6379	io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.35.174:6379

12:21:49.9	CommandHandler                	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, chid=0xf9] channelActive() done		
12:21:49.9	CommandHandler                	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, chid=0xf9] channelActive()		
12:21:49.9	CommandHandler                	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, chid=0xf7] channelInactive() done		
12:21:49.9	CommandHandler                	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, chid=0xf7] channelInactive()		
12:21:49.9	CommandHandler                	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, chid=0xf7] channelUnregistered()		
12:21:49.9	ConnectionWatchdog            	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.180:6379] channelActive()		
12:21:49.9	ConnectionWatchdog            	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, last known addr=redis-cluster-3.redis-cluster/10.6.35.180:6379] userEventTriggered(ctx, io.lettuce.core.ConnectionEvents$Activated@4bb7c560)		
12:21:49.9	ConnectionWatchdog            	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, last known addr=/10.6.35.180:6379] channelInactive()		
12:21:49.9	ConnectionWatchdog            	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, last known addr=/10.6.35.180:6379] Reconnect scheduling disabled		
12:21:49.9	ConnectionWatchdog            	Reconnecting, last destination was redis-cluster-3.redis-cluster:6379		
12:21:49.9	DefaultEndpoint               	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, epid=0x4] activateEndpointAndExecuteBufferedCommands 0 command(s) buffered		
12:21:49.9	DefaultEndpoint               	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, epid=0x4] activating endpoint		
12:21:49.9	DefaultEndpoint               	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, epid=0x4] flushCommands() Flushing 0 commands		
12:21:49.9	DefaultEndpoint               	[channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379, epid=0x4] flushCommands()		
12:21:49.9	DefaultEndpoint               	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, epid=0x9b] closeAsync()		
12:21:49.9	DefaultEndpoint               	[channel=0xe5af42f3, /10.6.21.236:43206 -> /10.6.35.180:6379, epid=0x9b] deactivating endpoint handler		
12:21:49.9	ReconnectionHandler           	Reconnected to redis-cluster-3.redis-cluster:6379, Channel channel=0x81316a92, /10.6.21.236:43210 -> redis-cluster-3.redis-cluster/10.6.35.180:6379		
12:21:49.9	ReconnectionHandler           	Reconnecting to Redis at redis-cluster-3.redis-cluster:6379		


12:22:00.0	RedisClient                   	Connecting to Redis at 10.6.35.180:6379: Success		
12:22:06.5	ConnectionWatchdog            	Cannot reconnect to [10.6.35.174:6379]: connection timed out: /10.6.35.174:6379	io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.35.174:6379

12:33:44.8	RedisStateMachine             	Decoded LatencyMeteredCommand [type=CLUSTER, output=StatusOutput [output=ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 10.6.35.180:6379@16379 master - 0 1594643622000 90 connected 10923-16383
fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 10.6.26.152:6379@16379 master - 0 1594643622447 86 connected 5461-10922
b4cb4eb98ca653e888d3b2ec931898ab7c97b867 10.6.34.207:6379@16379 slave 9bc0e2cb46f424cac90d8816485d1a2728919765 0 1594643623450 91 connected
9bc0e2cb46f424cac90d8816485d1a2728919765 10.6.13.250:6379@16379 master - 0 1594643621445 91 connected 0-5460
72641dc0aa44972fd1ffd31ad1342cbcfd01b4fb 10.6.37.190:6379@16379 myself,slave ff06a774b13ad88a63b39dcd0c9325caf3e4fa16 0 1594643622000 84 connected
3da56d06b4a34c20ee560d3ed28a2679ba089a30 10.6.21.237:6379@16379 slave fbc6db4e984a9142efd0c5c7ab01b2f21abb0787 0 1594643624455 86 connected
, error='null'], commandType=io.lettuce.core.cluster.topology.TimedAsyncCommand], empty stack: true		
12:33:44.8	RoundRobinSocketAddressSupplie	Resolved SocketAddress 10.6.35.174:6379 using for Cluster node ff06a774b13ad88a63b39dcd0c9325caf3e4fa16		
12:33:54.8	ConnectionWatchdog            	Cannot reconnect to [10.6.35.174:6379]: connection timed out: /10.6.35.174:6379	io.netty.channel.ConnectTimeoutException: connection timed out: /10.6.35.174:6379

13:22:45.0      ConnectionWatchdog              Reconnecting, last destination was 10.6.35.174:6379

@mp911de
Copy link
Collaborator

mp911de commented Jul 17, 2020

All DNS resolution is handled by SocketAddressResolver and DnsResolver. After looking into RoundRobinSocketAddressSupplier, it seems that DNS resolution isn't involved at all as RoundRobinSocketAddressSupplier is based on the initial Partitions object. Whenever a reconnect occurs, RoundRobinSocketAddressSupplier is asked to provide a new endpoint to connect to. If the Partitions change in between calls to RoundRobinSocketAddressSupplier.get(), RoundRobin is rebuilt.

For some reason, this doesn't happen in your case. If you are able to reproduce the issue, please step into RoundRobinSocketAddressSupplier.get() to capture the state of partitions and the inner state of RoundRobin to see where the mismatch stems from.

@DamianFekete
Copy link
Author

TL;DR dynamicRefreshSources was not set to false


Thanks again @mp911de for the provided information.

I tried to reproduce the issue locally and I took the time to step into the code and understand it better.
I now assume that dynamicRefreshSources was not set to false as I (wrongly) said in my first post.

If dynamicRefreshSources=false then the initialUris are used instead of the topology (which contains only IP addresses and no DNS names).

Some notes:

  • Does it make sense to retry the connection if the topology already contains the node with another IP? I'm also asking myself what would the Watchdog do if the node would be removed from the topology.

  • Reading the Javadoc for dynamicRefreshSources made me think that is a good idea to keep this enabled. I'm wondering if the Javadoc could be improved.

If set to false, only the initial seed nodes will be used as sources for topology discovery and the number of clients will be obtained only for the initial seed nodes. This can be useful when using Redis Cluster with many nodes.

Even reading through the Lettuce Reference Guide I wouldn't have it expected that the Watchdog would keep retrying old IP addresses.

Note that enabling dynamic topology refresh sources uses node addresses reported by Redis CLUSTER NODES output which typically contains IP addresses.

  • I'll open a ticket with Spring Boot to be able to configure the dynamicRefreshSources using the standard spring.redis.lettuce.cluster. properties.

@1zg12
Copy link

1zg12 commented May 12, 2022

I am facing the same exception as well, with almost all default configuration with lettuce.
This only happens with a larger data set.

These are the configs:

redis:
  lettuce:
   pool:
     max-active: 10
     enabled: true

this works with smaller data set, and always fail on large date set.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug A general bug
Projects
None yet
Development

No branches or pull requests

3 participants