This repository has been archived by the owner on Mar 31, 2023. It is now read-only.
fix(clusterdown_wrapper): fix broken retry logic #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There may be a better solution than what I've added here, we might want to consider instead using
backoff
This solution should put the code in line with what it's trying to do (which might still be wrong)
The Problem:
We're been seeing a bunch of RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node errors. Following the stack trace I was able to isolate what was happening here.
For whatever reason Redis sometimes sends us a
CLUSTERDOWN
response when we are trying to execute a command (using theexecute_command
function). Theexecute_command
function has a wrapper that will retry up to 3 times on aClusterDownError
. Howeverexecute_command
catches this error and setsself.refresh_table_asap = True
before raising it to the wrapper.This means the second time
execute_command
tries to send the command it hits aif self.refresh_table_asap:
block that forces it to rebuild thenode_manager
, and when thenode_manager
fails to connect to any node in the cluster it raises aRedisClusterException
, which bypasses the wrapper and returns control back to the calling program.The Solution:
The solution I'm presenting in this PR is to create a new Exception class
ClusterUnreachableError
for cases where we can't connect to the cluster, and have that return instead of aRedisClusterException
. I am also splittingrefresh_table_asap
intomoved
andcluster_down
so it's clearer why we need to rebuild thenode_manager
mapping. In the case where the cause iscluster_down
we can pass on aClusterUnreachableError
, to allow the@clusterdown_wrapper
to do it's work.This doesn't fix the error necessarily, but it will ensure that the correct error is being thrown, and that we actually retry first.
I'm not entirely sure this will fix the root cause, it may be a good idea to replace this custom wrapper with
backoff
, to give the cluster or the network time to get back into a functional state.Anyways, that's all. Happy to reimplement with a combination of these solutions, or additional ideas that pop up.