-
Notifications
You must be signed in to change notification settings - Fork 69
Description
You might not consider this an actual bug, it is more of an optimization. Also, this might not be the right gem to fix this in.
Context
We are using Redis cluster, with these gems:
- redis 5.3.0
- redis-client 0.23.1
- redis-clustering 5.3.0
- redis-cluster-client 0.13.1
We are using Elasticache and Valkey 8.0.1
I have been testing edge cases in cluster mode, like scaling up and down, deleting arbitrary nodes, failover, etc. We are using Redis cluster for our Rails cache.
We have the following configuration:
reconnect_attempts: [rand(0.1..0.2), rand(0.2..0.3)],
circuit_breaker: {
error_threshold: 3,
error_threshold_timeout: 5,
error_timeout: rand(1.0..2.0),
success_threshold: 1,
}
My intent is for us to retry connection up to 2 times (so 3 total tries). If we failed all 3 times, then the circuit breaker should open for a while, so that we don't waste time making more attempts. This is both to let the server recover and also so errors don't cause latency spikes (which would then cause request queuing).
I setup Datadog tracing for redis, and I added a patch to also add tracing for Redis connect calls.
Problem
Thanks to the Datadog tracing, I found the circuit does correctly open when there is a spike in errors, so the Redis server is protected from a spike. However, in the Ruby code, it looks like we are still executing the sleep and retry process, even though each actual connection attempt is just immediately aborted by the circuit breaker.
When combined with redis cluster, where we then go back and try with a different node, this can end up causing the request to take a couple of seconds. I was surprised by this, because I expected the circuit breaker to let us "fail fast" until error_timeout expires.
I am not a Ruby expert so this might be totally wrong, but I think the problem might be here: https://github.com/redis-rb/redis-client/blob/master/lib/redis_client.rb#L716 We run raw_connection after each attempt, above where the circuit breaker is running, so they aren't communicating with each other. This results in us sleeping for the retry timeout even if we are guaranteed to fail because of the circuit breaker.
Edit: I think I meant that we run retry_connecting? after every attempt