Skip to content
This repository has been archived by the owner on Mar 31, 2023. It is now read-only.

fix(clusterdown_wrapper): fix broken retry logic #8

Merged
merged 6 commits into from
Jul 2, 2021

Conversation

kylepad
Copy link

@kylepad kylepad commented Jun 15, 2021

There may be a better solution than what I've added here, we might want to consider instead using backoff
This solution should put the code in line with what it's trying to do (which might still be wrong)

The Problem:

We're been seeing a bunch of RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node errors. Following the stack trace I was able to isolate what was happening here.

For whatever reason Redis sometimes sends us a CLUSTERDOWN response when we are trying to execute a command (using the execute_command function). The execute_command function has a wrapper that will retry up to 3 times on a ClusterDownError. However execute_command catches this error and sets self.refresh_table_asap = True before raising it to the wrapper.

This means the second time execute_command tries to send the command it hits a if self.refresh_table_asap: block that forces it to rebuild the node_manager, and when the node_manager fails to connect to any node in the cluster it raises a RedisClusterException, which bypasses the wrapper and returns control back to the calling program.

The Solution:

The solution I'm presenting in this PR is to create a new Exception class ClusterUnreachableError for cases where we can't connect to the cluster, and have that return instead of a RedisClusterException. I am also splitting refresh_table_asap into moved and cluster_down so it's clearer why we need to rebuild the node_manager mapping. In the case where the cause is cluster_down we can pass on a ClusterUnreachableError, to allow the @clusterdown_wrapper to do it's work.

This doesn't fix the error necessarily, but it will ensure that the correct error is being thrown, and that we actually retry first.

I'm not entirely sure this will fix the root cause, it may be a good idea to replace this custom wrapper with backoff, to give the cluster or the network time to get back into a functional state.

Anyways, that's all. Happy to reimplement with a combination of these solutions, or additional ideas that pop up.

@TheKevJames
Copy link
Member

Changes make sense to me, that should at least help with debugging! Just need to fixup the tests.

@kylepad kylepad force-pushed the khersey/redis_cluster_cannot_be_connected branch from 903d78e to 3832dbb Compare July 2, 2021 19:24
@kylepad kylepad merged commit a16c35d into master Jul 2, 2021
@kylepad kylepad deleted the khersey/redis_cluster_cannot_be_connected branch July 2, 2021 19:30
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants