Skip to content

Possible bug: Redis cluster: Circuit breaker does not skip retries #227

@mbyio

Description

@mbyio

You might not consider this an actual bug, it is more of an optimization. Also, this might not be the right gem to fix this in.

Context

We are using Redis cluster, with these gems:

  • redis 5.3.0
  • redis-client 0.23.1
  • redis-clustering 5.3.0
  • redis-cluster-client 0.13.1

We are using Elasticache and Valkey 8.0.1

I have been testing edge cases in cluster mode, like scaling up and down, deleting arbitrary nodes, failover, etc. We are using Redis cluster for our Rails cache.

We have the following configuration:

reconnect_attempts: [rand(0.1..0.2), rand(0.2..0.3)],
circuit_breaker: {
  error_threshold: 3,
  error_threshold_timeout: 5,
  error_timeout: rand(1.0..2.0),
  success_threshold: 1,
}

My intent is for us to retry connection up to 2 times (so 3 total tries). If we failed all 3 times, then the circuit breaker should open for a while, so that we don't waste time making more attempts. This is both to let the server recover and also so errors don't cause latency spikes (which would then cause request queuing).

I setup Datadog tracing for redis, and I added a patch to also add tracing for Redis connect calls.

Problem

Thanks to the Datadog tracing, I found the circuit does correctly open when there is a spike in errors, so the Redis server is protected from a spike. However, in the Ruby code, it looks like we are still executing the sleep and retry process, even though each actual connection attempt is just immediately aborted by the circuit breaker.

When combined with redis cluster, where we then go back and try with a different node, this can end up causing the request to take a couple of seconds. I was surprised by this, because I expected the circuit breaker to let us "fail fast" until error_timeout expires.

I am not a Ruby expert so this might be totally wrong, but I think the problem might be here: https://github.com/redis-rb/redis-client/blob/master/lib/redis_client.rb#L716 We run raw_connection after each attempt, above where the circuit breaker is running, so they aren't communicating with each other. This results in us sleeping for the retry timeout even if we are guaranteed to fail because of the circuit breaker.

Edit: I think I meant that we run retry_connecting? after every attempt

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions