Skip to content

Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode #3681

Open
@drewfustin

Description

@drewfustin

Using redis-py==6.2.0. I have a 6 node Redis Cluster running in EKS (3 primaries / 3 replicas), and whenever one of the primary nodes goes down, my service will raise TimeoutError despite the replica being available, having sufficient Retry on the client, and using client-side load balancing as LoadBalancingStrategy.ROUND_ROBIN.

I've boiled it down to the following, so I can better observe what's happening in the RedisCluster._internal_execute_command retry loop (note the client doesn't have a Retry on it in this example because I'm handling it with the manual loop, but rest assured the service that is running has something like retry=Retry(backoff=NoBackoff(), retries=3) on it):

import os
from redis.cluster import LoadBalancingStrategy, RedisCluster
rc = RedisCluster(
    host=os.getenv('REDIS_HOST'),
    port=os.getenv('REDIS_PORT'),
    password=os.getenv('REDIS_PASSWORD'),
    load_balancing_strategy=LoadBalancingStrategy.ROUND_ROBIN,
    socket_connect_timeout=1.0,
    require_full_coverage=False,
    decode_responses=True,
)
rc.set("foo", "bar")  # True
slot = rc.determine_slot("GET", "foo")  # 12182
lbs = rc.load_balancing_strategy  # ROUND_ROBIN
for i in range(5):
    print(f"\nAttempt {i + 1}")
    try:
        primary_name = rc.nodes_manager.slots_cache[slot][0].name
        n_slots = len(rc.nodes_manager.slots_cache[slot])
        node_idx = rc.nodes_manager.read_load_balancer.get_server_index(primary_name, n_slots, lbs)
        node = rc.nodes_manager.slots_cache[slot][node_idx]
        print(f"idx: {node_idx} | node: {node.name} | type: {node.server_type}")
        rc._execute_command(node, "GET", "foo")
    except Exception as e:
        print(f"Exception: {e}")

With a healthy cluster, this will output:

Attempt 1
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
'bar'

If I kill the primary node in EKS (kubectl delete pod redis-node-3 where this was the 100.66.97.179 pod), and running the loop again, I get the following (until EKS gets redis-node-3 back up and running):

Attempt 1
idx: 1 | node: 100.66.106.241:6379 | type: replica
'bar'

Attempt 2
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 3
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 4
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Attempt 5
idx: 0 | node: 100.66.97.179:6379 | type: primary
Exception: Timeout connecting to server

Basically, as soon as I get the TimeoutError from the primary node, the load balancer gets stuck and stops bouncing between the primary and the replica, but keeps trying the primary over and over.

If I instead kill the replica node in EKS, I get exactly what I'd expect, where the load balancer still round robins between both nodes:

Attempt 1
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Attempt 2
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server

Attempt 3
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Attempt 4
idx: 1 | node: 100.66.106.241:6379 | type: replica
Exception: Timeout connecting to server

Attempt 5
idx: 0 | node: 100.66.111.89:6379 | type: primary
'bar'

Activity

drewfustin

drewfustin commented on Jun 18, 2025

@drewfustin
Author

I tried with a 9 node cluster (3 primaries with 2 replicas each) and swapped the load_balancing_strategy to LoadBalancingStrategy.ROUND_ROBIN_REPLICAS and noticed that if the "first" replica got taken down (smallest server_index here), I ended up in the same retry loop without trying the "second" replica, which was the aha that lead to understanding the real issue here: it's not an error triggered only when a primary node goes down, it's an error that happens when the node with the smallest server_index that matches the load balancer type goes down.

On ConnectionError or TimeoutError:

  • RedisCluster.nodes_manager.initialize() link
  • NodesManager.reset() link
  • NodesManager.read_load_balancer.reset() link
  • LoadBalancer.primary_to_idx.clear() link

The reset() of the LoadBalancer will make the next retry attempt try on server_index = 0 (bumped to server_index = 1 if replicas_only). If that was the node that failed, even if the load balancer round robins to the next index for the next retry, it'll get reset back to 0 on the ConnectionError/TimeoutError.

petyaslavova

petyaslavova commented on Jun 19, 2025

@petyaslavova
Collaborator

Hi @drewfustin, thank you for reporting this! We will investigate the issue soon.

drewfustin

drewfustin commented on Jun 19, 2025

@drewfustin
Author

Thanks @petyaslavova. I have a proposed fix to this issue in PR #3683. It requires a bit of a re-thinking how the LoadBalancer works, but I think it makes sense. Plus, it seems to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Participants

      @drewfustin@petyaslavova

      Issue actions

        Round robin load balancing isn't working as expected if primary node goes down for Redis cluster mode · Issue #3681 · redis/redis-py