Inconsistent slot mapping #3776

doyoubi · 2017-01-26T04:44:33Z

Hi:

As we used redis cluster this year, at several times we found that the slot mapping from each node could become inconsistent after fixing failed slot migration. The reason of migration failure may sometimes be that the machine serving redis was down or the resharding script was forced to stop. And after clearing the importing and migrating flag and waiting for a long time the cluster couldn't fix this inconsistency automatically.
We should have keep some broken clusters from real life but we didn't... Here I will give two ways to build an inconsistent cluster. Hope that we can encounter this problem again so that we can find out what exactly happened to the broken cluster.
The version we are using is 3.0.4 and 3.0.7.
This problem may be relevant to issues #3442 #2969.

Inconsistent Cluster

How To Build It

At first we have an consistent cluster with three nodes A, B, and C. The important part here is epoch A < epoch B < epoch C.

000000000000000000000000000000000000000b 127.0.0.1:6001 master - 0 1485312016448 2 connected 5462-10922
000000000000000000000000000000000000000a 127.0.0.1:6000 myself,master - 0 0 1 connected 0-5461
000000000000000000000000000000000000000c 127.0.0.1:6002 master - 0 1485312015433 3 connected 10923-16383

Now use cluster setslot node to change slot 10922 from node B to node A. Note that we must change node B first since it has larger epoch than node A.

redis-cli -p 6001 cluster setslot 10922 node 000000000000000000000000000000000000000a
redis-cli -p 6000 cluster setslot 10922 node 000000000000000000000000000000000000000a

Then you will find even though both node A and B think slot 10922 belongs to node A now, but node C still insist that 10922 belongs to node B. Then no matter how long you wait, it just stays still.

$ redis-cli -p 6000 cluster nodes
000000000000000000000000000000000000000b 127.0.0.1:6001 master - 0 1485312916488 2 connected 5462-10921
000000000000000000000000000000000000000a 127.0.0.1:6000 myself,master - 0 0 1 connected 0-5461 10922
000000000000000000000000000000000000000c 127.0.0.1:6002 master - 0 1485312915478 3 connected 10923-16383

$ redis-cli -p 6001 cluster nodes
000000000000000000000000000000000000000a 127.0.0.1:6000 master - 0 1485312929939 1 connected 0-5461 10922
000000000000000000000000000000000000000b 127.0.0.1:6001 myself,master - 0 0 2 connected 5462-10921
000000000000000000000000000000000000000c 127.0.0.1:6002 master - 0 1485312928930 3 connected 10923-16383

$ redis-cli -p 6002 cluster nodes
000000000000000000000000000000000000000b 127.0.0.1:6001 master - 0 1485312940047 2 connected 5462-10922
000000000000000000000000000000000000000a 127.0.0.1:6000 master - 0 1485312941063 1 connected 0-5461
000000000000000000000000000000000000000c 127.0.0.1:6002 myself,master - 0 0 3 connected 10923-16383

Why

In the slot table of node C, the slot 10922 is associated with node B and its epoch, which is 2 here. After node A get slot 10922, it tell it to node C with its epoch 1 but node C reject it for its lower epoch.
I'm not sure whether this is by design. It seems that Redis Cluster use a unique epoch for each node and the consensus is based on these rules:

Cluster should never change if epoch doesn't change.
Each epoch is bound to at most one specific cluster configuration change.

The example above violate the first rule. Well, we should never use cluster setslot [slot] node casually. I think our slot inconsistency problem is mostly caused by using setslot node to a node with low epoch. However, the second rule is out of our control.

Epoch Collision

In some rare cases, the same epoch in a cluster can correspond to multiple configuration changes. Here's an example. let's say there's a node A importing slot from node B. When the setslot node is running on node A after migration, and at the same time node B is doing something that will bump its epoch, such as fixing epoch collision or dealing setslot node itself (even though it's nonsense to do that), they bump their own epoch to the same one. Now if the node id of node B is less than node A, node B bump a higher epoch and may spread it out to the cluster before node A. Finally the inconsistent state mentioned above are produced.
Actually I only reproduced it by sending cluster bumpepoch on node B to simulate this case, and I don't think it's the main reason.

$ redis-cli -p 6000 cluster setslot 0 migrating 000000000000000000000000000000000000000b
$ redis-cli -p 6001 cluster setslot 0 importing 000000000000000000000000000000000000000a
$ python script_to_setslot_and_bumpepoch_at_the_same_time.py

# Now from the POV of node C on port 6002, slot zero didn't transfer succefully
$ redis-cli -p 6000 cluster nodes
000000000000000000000000000000000000000b 127.0.0.1:6001 master - 0 1485329581910 5 connected 0 5462-10922
000000000000000000000000000000000000000a 127.0.0.1:6000 myself,master - 0 0 6 connected 1-5461
000000000000000000000000000000000000000c 127.0.0.1:6002 master - 0 1485329580901 3 connected 10923-16383
000000000000000000000000000000000000000d 127.0.0.1:6003 master - 0 1485329582923 4 connected

$ redis-cli -p 6001 cluster nodes
000000000000000000000000000000000000000b 127.0.0.1:6001 myself,master - 0 0 5 connected 0 5462-10922
000000000000000000000000000000000000000c 127.0.0.1:6002 master - 0 1485329671929 3 connected 10923-16383
000000000000000000000000000000000000000d 127.0.0.1:6003 master - 0 1485329673954 4 connected
000000000000000000000000000000000000000a 127.0.0.1:6000 master - 0 1485329672943 6 connected 1-5461

$ redis-cli -p 6002 cluster nodes
000000000000000000000000000000000000000c 127.0.0.1:6002 myself,master - 0 0 3 connected 10923-16383
000000000000000000000000000000000000000a 127.0.0.1:6000 master - 0 1485329688210 6 connected 0-5461
000000000000000000000000000000000000000d 127.0.0.1:6003 master - 0 1485329684170 4 connected
000000000000000000000000000000000000000b 127.0.0.1:6001 master - 0 1485329690223 5 connected 5462-10922

The Problem In Real Life

I don't have any corrupted cluster at hand now to analyse. Usually the broken cluster is produced by fixing slot after resharding failure. We used our own fixing script which clear the importing, migrating flags and restart migration even if there's only importing or migrating flag. Most of the time it got the job done because the gossip protocol of redis may just be blocked by the importing flags in clusterUpdateSlotsConfigWith. But we did suffered from this slot inconsistency problem several times and it's quite hard to fix it by hand. Maybe it's a result of misusing cluster setslot node or there's something wrong before migration which lead to the epoch collision problem.

Anyway I think it make no sense to allow inconsistency to exist forever in a cluster with gossip. Is it possible to fix the slot inconsistency by deleting the slot binding if the former node declares that it doesn't own the slot any more?

The text was updated successfully, but these errors were encountered:

yurial · 2018-05-21T17:30:47Z

Hi, i have the same problem on redis-cluster 4.0.8. My cluster configured as 19 shards (57 nodes total), network latency ~1,76ms.
Problem was not reproduced by redis-trib, cause redis-trib processed nodes one-by-one (migrate all slots from one node, then migrate all slots from next node).
To reproduce this bug, i'm write a script (on error, an exception will occur):
migration method: https://pastebin.com/QZLpNyv7
migration script: https://pastebin.com/htvGJU5i
log tail (at the begin of the log, messages are the same): https://pastebin.com/fJJmGR2T

djdawson3 · 2018-09-14T18:06:10Z

Hi,

I've also seen this error with Redis version 3.2.8. I can confirm that this is caused by resharding failures and attempting to manually migrate slots after said failure.

skyckp123 · 2023-10-14T20:05:46Z

Hi Facing similar problem in Redis version 6.2. Do we have any fix or workaround for this issue?

itamarhaber mentioned this issue Sep 14, 2018

[META] Cluster open issues #5349

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent slot mapping #3776

Inconsistent slot mapping #3776

doyoubi commented Jan 26, 2017 •

edited

yurial commented May 21, 2018

djdawson3 commented Sep 14, 2018

skyckp123 commented Oct 14, 2023

Inconsistent slot mapping #3776

Inconsistent slot mapping #3776

Comments

doyoubi commented Jan 26, 2017 • edited

Inconsistent Cluster

How To Build It

Why

Epoch Collision

The Problem In Real Life

yurial commented May 21, 2018

djdawson3 commented Sep 14, 2018

skyckp123 commented Oct 14, 2023

doyoubi commented Jan 26, 2017 •

edited