Description
Using redis-py==6.2.0
. I have a 6 node Redis Cluster running in EKS (3 primaries / 3 replicas), and whenever one of the primary nodes goes down, my service will raise TimeoutError
despite the replica being available, having sufficient Retry
on the client, and using client-side load balancing as LoadBalancingStrategy.ROUND_ROBIN
.
I've boiled it down to the following, so I can better observe what's happening in the RedisCluster._internal_execute_command
retry loop (note the client doesn't have a Retry
on it in this example because I'm handling it with the manual loop, but rest assured the service that is running has something like retry=Retry(backoff=NoBackoff(), retries=3)
on it):
import os
from redis.cluster import LoadBalancingStrategy, RedisCluster
rc = RedisCluster(
host=os.getenv('REDIS_HOST'),
port=os.getenv('REDIS_PORT'),
password=os.getenv('REDIS_PASSWORD'),
load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
socket_connect_timeout=1.0,
require_full_coverage=False,
decode_responses=True,
)
rc.set("foo", "bar") # True
slot = rc.determine_slot("GET", "foo") # 12182
lbs = rc.load_balancing_strategy # ROUND_ROBIN
for i in range(5):
print(f"\nAttempt {i + 1}")
try:
primary_name = rc.nodes_manager.slots_cache[slot][0].name
n_slots = len(rc.nodes_manager.slots_cache[slot])
node_idx = rc.nodes_manager.read_load_balancer.get_server_index(primary_name, n_slots, lbs)
node = rc.nodes_manager.slots_cache[slot][node_idx]
print(f"idx: {node_idx} | node: {node.name} | type: {node.server_type}")
rc._execute_command(node, "GET", "foo")
except Exception as e:
print(f"Exception: {e}")
With a healthy cluster, this will output:
Attempt 1
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'
Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'
Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'
If I kill the primary node in EKS (kubectl delete pod redis-node-3
where this was the 100.66.97.179
pod), and running the loop again, I get the following (until EKS gets redis-node-3
back up and running):
Attempt 1
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'
Attempt 2
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Attempt 4
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server
Basically, as soon as I get the TimeoutError
from the primary node, the load balancer gets stuck and stops bouncing between the primary and the replica, but keeps trying the primary over and over.
If I instead kill the replica node in EKS, I get exactly what I'd expect, where the load balancer still round robins between both nodes:
Attempt 1
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'
Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server
Attempt 3
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'
Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server
Attempt 5
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'
Activity
drewfustin commentedon Jun 18, 2025
I tried with a 9 node cluster (3 primaries with 2 replicas each) and swapped the
load_balancing_strategy
toLoadBalancingStrategy.ROUND_ROBIN_REPLICAS
and noticed that if the "first" replica got taken down (smallestserver_index
here), I ended up in the same retry loop without trying the "second" replica, which was the aha that lead to understanding the real issue here: it's not an error triggered only when a primary node goes down, it's an error that happens when the node with the smallestserver_index
that matches the load balancer type goes down.On
ConnectionError
orTimeoutError
:RedisCluster.nodes_manager.initialize()
linkNodesManager.reset()
linkNodesManager.read_load_balancer.reset()
linkLoadBalancer.primary_to_idx.clear()
linkThe
reset()
of theLoadBalancer
will make the next retry attempt try onserver_index = 0
(bumped toserver_index = 1
ifreplicas_only
). If that was the node that failed, even if the load balancer round robins to the next index for the next retry, it'll get reset back to 0 on theConnectionError
/TimeoutError
.petyaslavova commentedon Jun 19, 2025
Hi @drewfustin, thank you for reporting this! We will investigate the issue soon.
LoadBalancer
keyed on slot instead of primary node, not reset onNodesManager.initialize()
#3683drewfustin commentedon Jun 19, 2025
Thanks @petyaslavova. I have a proposed fix to this issue in PR #3683. It requires a bit of a re-thinking how the
LoadBalancer
works, but I think it makes sense. Plus, it seems to work.