fix(clusterdown_wrapper): fix broken retry logic #8

kylepad · 2021-06-15T22:24:11Z

There may be a better solution than what I've added here, we might want to consider instead using backoff
This solution should put the code in line with what it's trying to do (which might still be wrong)

The Problem:

We're been seeing a bunch of RedisClusterException: Redis Cluster cannot be connected. Please provide at least one reachable node errors. Following the stack trace I was able to isolate what was happening here.

For whatever reason Redis sometimes sends us a CLUSTERDOWN response when we are trying to execute a command (using the execute_command function). The execute_command function has a wrapper that will retry up to 3 times on a ClusterDownError. However execute_command catches this error and sets self.refresh_table_asap = True before raising it to the wrapper.

This means the second time execute_command tries to send the command it hits a if self.refresh_table_asap: block that forces it to rebuild the node_manager, and when the node_manager fails to connect to any node in the cluster it raises a RedisClusterException, which bypasses the wrapper and returns control back to the calling program.

The Solution:

The solution I'm presenting in this PR is to create a new Exception class ClusterUnreachableError for cases where we can't connect to the cluster, and have that return instead of a RedisClusterException. I am also splitting refresh_table_asap into moved and cluster_down so it's clearer why we need to rebuild the node_manager mapping. In the case where the cause is cluster_down we can pass on a ClusterUnreachableError, to allow the @clusterdown_wrapper to do it's work.

This doesn't fix the error necessarily, but it will ensure that the correct error is being thrown, and that we actually retry first.

I'm not entirely sure this will fix the root cause, it may be a good idea to replace this custom wrapper with backoff, to give the cluster or the network time to get back into a functional state.

Anyways, that's all. Happy to reimplement with a combination of these solutions, or additional ideas that pop up.

…annot be reached on retry

TheKevJames · 2021-07-01T21:23:14Z

Changes make sense to me, that should at least help with debugging! Just need to fixup the tests.

fix(clusterdown_wrapper): return ClusterUnreachableError if cluster c…

6e9d711

…annot be reached on retry

kylepad requested a review from TheKevJames as a code owner June 15, 2021 22:24

kylepad added 3 commits June 15, 2021 16:00

fix(execute_command): split refresh_table_asap

3f38c74

linter reorders imports

8a9f57a

chore(pipelines): apply moved and cluster_down to StrictClusterPipeline

3cd2693

TheKevJames approved these changes Jul 1, 2021

View reviewed changes

kylepad added 2 commits July 2, 2021 14:53

Merge branch 'master' into khersey/redis_cluster_cannot_be_connected

19fcb83

chore(tests): fixup node_manager_test

3832dbb

kylepad force-pushed the khersey/redis_cluster_cannot_be_connected branch from 903d78e to 3832dbb Compare July 2, 2021 19:24

kylepad merged commit a16c35d into master Jul 2, 2021

kylepad deleted the khersey/redis_cluster_cannot_be_connected branch July 2, 2021 19:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(clusterdown_wrapper): fix broken retry logic #8

fix(clusterdown_wrapper): fix broken retry logic #8

kylepad commented Jun 15, 2021 •

edited

Loading

TheKevJames commented Jul 1, 2021

fix(clusterdown_wrapper): fix broken retry logic #8

fix(clusterdown_wrapper): fix broken retry logic #8

Conversation

kylepad commented Jun 15, 2021 • edited Loading

The Problem:

The Solution:

TheKevJames commented Jul 1, 2021

kylepad commented Jun 15, 2021 •

edited

Loading