Fix cluster inconsistent slots mappings after slot migrating and failover occur simultaneously #12336

cyningsun · 2023-06-21T14:44:55Z

Background

Node A(Master) update clusterState based on clusterMsg from another Node B(Master). This is a chance to cause slot mapping conflict between Node A and Node B, When:

Node C(Master) was migrated(SETSLOT) Slot S from Node B
Node B received updating from Node C. It's configEpoch collision with another node Node D(Master). Node B's node id is smaller, so Node B bumps to new configEpoch, which is maximum in the cluster.
Node A received PONG from Node B. POV of Node A: new configEpoch of Node B is updated but Slot S is not updated because it didn't belong to Node A anymore
Node C(Master), which was assigned this slot, POV of Node A failed to assign it to Node C because its config epoch(senderConfigEpoch) is smaller than server.cluster->slots[j]->configEpoch.

As shown in the figure:

How to fix it

Skip updating the sender's configEpoch and Slots before the migrated slot is claimed in POV.

configEpoch acts as the guard of slots it owns in POV. If there is a migrated slot of the sender in POV pending claim by a new master, the sender's config epoch in POV cannot be updated separately.

Otherwise, the slot will be guarded by a new config epoch of the previous master. This config epoch maybe has been bumped after this slot was assigned to another master.

The new master will fail to claim the slot because senderConfigEpoch is smaller than server.cluster->slots[j]->configEpoch.

Example

Here is a cluster that has 5 shards, and each shard has 2 replicas, shown below before slot migrating.

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	2	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	8	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375	1	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372

Operations

Step 1: Slot 0 is migrated from 6ea4c2f8e7efe9180db83cf0adf1b055a557c74d 10.53.52.144:6379 to d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378
Step 2: Manual failover is sent to 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
Step 3: Slot 16383 is migrated from c013fefabf931b77a5488e4e0fe1e81db1eaf4aa 10.53.52.144:6375 to 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376

Let's focus on how Slot 16383 is conflicted in the cluster

Timeline

1. failover election for 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372

failover_auth_epoch = 9

2. After Slot 0 is SETSLOT to d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	9	10.53.52.144:6378 (9)	10.53.52.144:6375 (1)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	8	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375	1	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372

3. After slot 16383 is SETSLOT to 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376

After 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376 receiving PONG send by d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378, configEpoch,currentEpoch is 9+1

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	9	10.53.52.144:6378 (9)	10.53.52.144:6375 (1)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	10	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375	1	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372

4. After Failover is authed to 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	9	10.53.52.144:6378 (9)	10.53.52.144:6375 (1)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	10	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	8	10.53.52.144:6379 (5)	10.53.52.144:6375 (1)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372	9	9	10.53.52.144:6379 (5)	10.53.52.144:6372 (9)	c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375

5. After other nodes receive PONG send by 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	9	10.53.52.144:6379 (5)	10.53.52.144:6372 (9)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	9	10.53.52.144:6378 (9)	10.53.52.144:6372 (9)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	10	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	9	10.53.52.144:6379 (5)	10.53.52.144:6372 (9)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372	9	9	10.53.52.144:6379 (5)	10.53.52.144:6372 (9)	c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375

6. After 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 receiving PONG send by 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	8	10.53.52.144:6379 (5)	10.53.52.144:6372 (9)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	9	10.53.52.144:6378 (9)	10.53.52.144:6372 (9)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	10	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	8	10.53.52.144:6379 (5)	10.53.52.144:6372 (9)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372	9	10	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375

7. 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 resolve configEpoch Collision and send PONG msg

configEpoch Collision between 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 and d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 has smaller Node ID

slot 0 and 16383 ownership on other nodes will not be affected because myslots in clusterMsg header of 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 don't have those two slots.

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	11	10.53.52.144:6379 (5)	10.53.52.144:6372 (11)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	11	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	11	10.53.52.144:6379 (5)	10.53.52.144:6372 (11)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372	11	11	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375

8. 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376 is reset by CLUSTERMSG_TYPE_UPDATE message

0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376: slot 16383 ownership is reset because 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376 node's configEpoch(senderConfigEpoch) 10 which is smaller than server.cluster->slots[j]->configEpoch 11, then other nodes will find it has an old configurations and send CLUSTERMSG_TYPE_UPDATE message to 0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376 ask it to reset the ownership to 6372, and finally succeed because slot 16383 is guarded by configEpoch 10 is smaller than senderConfigEpoch(11, it is not actually the "Sender" of the information, so it's 11 according by slot)
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372: slot 16383 ownership is not reset because
1. 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 not claimed slot 16383
2. 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 node's configEpoch(senderConfigEpoch) 11 which is equal to server.cluster->slots[j]->configEpoch 11
so there is no CLUSTERMSG_TYPE_UPDATE message. And also, There is no Sender claim slot 16383 whose configEpoch is greater than 10, so there is no clusterUpdateSlotsConfigWith through CLUSTERMSG_TYPE_PING/PONG... messages

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	11	10.53.52.144:6379 (5)	10.53.52.144:6372 (11)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	11	10.53.52.144:6379 (5)	10.53.52.144:6372 (11)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372	11	11	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375

9. Slot 0 finished Gossip among cluster

master nodeid	master node	config epoch	current epoch	slot 0 owner (configEpoch)	slot 16383 owner (configEpoch)	slave nodeid	slave node
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d	10.53.52.144:6379	5	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	626af417d334896000f5e23e6305f29242945948	10.53.52.144:6373
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59	10.53.52.144:6378	9	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	95e75b159b3bcb7c33a8df9db93b5443413dfccb	10.53.52.144:6377
0c92e0b02f839260b508a6fdf258539a26e96903	10.53.52.144:6376	10	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	900e489fcf28e0648272fc6c8b61f0bd998a01a0	10.53.52.144:6371
f567e9d0fb1ccbdfbfc6a5c505e1d84d224de0ef	10.53.52.144:6374	3	11	10.53.52.144:6378 (9)	10.53.52.144:6372 (11)	f65353c097d7f7e0f44644a28ddb980c21fc37bf	10.53.52.144:6380
2ddaf93f230facc9893bc95bf9234a3d580d0291	10.53.52.144:6372	11	11	10.53.52.144:6378 (9)	10.53.52.144:6376 (10)	c013fefabf931b77a5488e4e0fe1e81db1eaf4aa	10.53.52.144:6375

Result

The cluster has inconsistent slots mapping about slot 16383, between 2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372 and other nodes

zuiderkwast · 2023-06-21T23:34:26Z

@PingXie is this covered by (or affected by) your PR? #10517

madolson · 2023-06-22T19:33:23Z

@zuiderkwast I don't think so, but I believe we observed this internally and someone might produce a fix soonish.

zuiderkwast · 2023-06-22T20:37:21Z

I don't think so, but I believe we observed this internally and someone might produce a fix soonish.

@madolson This PR is a fix to the problem. Do you have another fix in mind?

@cyningsun Are you able to reproduce the problem consistently? It is always good to have a test.

There is a test suite tests/cluster/tests/21-many-slot-migration.tcl which starts like this:

# Tests for many simultaneous migrations.

# TODO: Test is currently disabled until it is stabilized (fixing the test
# itself or real issues in Redis).

if {false} {

source "../tests/includes/init-tests.tcl"
source "../tests/includes/utils.tcl"

# TODO: This test currently runs without replicas, as failovers (which may
# happen on lower-end CI platforms) are still not handled properly by the
# cluster during slot migration (related to #6339).

Maybe this test suite can be enabled with your fix. Delete the if {false} { (the first TODO) and add replicas and make sure some failover happens during migration (the second TODO). What do you think?

madolson · 2023-06-22T23:42:08Z

@madolson This PR is a fix to the problem. Do you have another fix in mind?

hehe, yes, it does fix a problem. I think the problem is more pervasive though, and can happen after the slot migration can complete and the epoch is bumped. I'm still trying to wrap my head around if they are complementary fixes, we need both, or if the one we are looking at internally solves this, maybe we just need one.

cyningsun · 2023-06-23T01:04:35Z

Sorry @zuiderkwast , Currently, I can't reproduce this problem consistently. I wrote a demo and some scripts to simulate the entire process that happened in our production environment. It relies on several random conditions:

source node failover must start before SETSLOT to target node and is authed after it
source node PONG with migrated slots to other nodes before target node
node id of the source node must be smaller than the target node, making config epoch collision happen, and the source node eventually has the largest configEpoch

using my script, the probability of the problem occurring is around 10%.

I did notice this test case. Just as @madolson said, only this specific issue has been fixed. It's uncertain whether there are more general issues that can be fixed once and for all. That's why I didn't complete these two TODOs. If suitable, I will.

srgsanky · 2023-06-23T17:18:02Z

Thanks @madolson! We observed a similar behavior internally. The main difference between the scenario that @cyningsun posted and what we observed is

The above scenario did a manual failover on another shard that caused source to bump the epoch to resolve configEpoch collision
In our case, addition of new shard bumped the epoch on the source to resolve configEpoch collision.

Here is an example scenario: The key detail is that the slot migration source interacts with Node A (a third observer node) before the slot migration target interacted with Node A.

Here is the code for the above diagram Not processing loss of slot ownership.txt

The gist of the fix is to process loss of slot ownership - when a node no longer claims owning a slot, we need to record that. We added an else block to https://github.com/redis/redis/blob/unstable/src/cluster.c#L2244-L2245

Will send a PR later today.

PingXie · 2023-06-25T06:51:53Z

@PingXie is this covered by (or affected by) your PR? #10517

@madolson is right. My PR handles a SPOF during the slot migration but this is a race condition between the entire cluster learning about the new slot owner and the old owner getting a configEpoch bump, whether through consensus or consensus-less means. This race condition happens after the slot migration has been successfully completed. Apart from the scenarios mentioned by @cyningsun (manual failover) and @srgsanky (conflicting epochs on the new node), I believe an automatic failover in node 3's shard (using @srgsanky's illustration) would have the same effect.

Regarding the fix, I am thinking whether node 1 should clear the ownership for slot 1200 if it is no longer claimed by node 4, when handling the PING message from node 4 in clusterUpdateSlotsConfigWith. This would allow node 2 to have an opportunity to "reclaim" slot 1200 when its turn comes.

srgsanky · 2023-06-25T18:08:05Z

Regarding the fix, I am thinking whether node 1 should clear the ownership for slot 1200 if it is no longer claimed by node 4, when handling the PING message from node 4 in clusterUpdateSlotsConfigWith. This would allow node 2 to have an opportunity to "reclaim" slot 1200 when its turn comes.

@PingXie our fix is similar. There is one catch though - if node 1 clears the ownership information of the slot, clients can temporarily see errors at the end of slot migration in the common case until the target broadcasts the change in ownership.

CLUSTERDOWN Hash slot not served
CLUSTERDOWN The cluster is down

Not all clients will retry on these kinds of failures. We need to fix the uncommon race condition and we don't want to cause errors in the common case. So it might be better to redirect clients even if we are not sure about the ownership change. When the client connects to the redirected node, it might correct the client (one more redirect) or give the Hash slot not served failure. This kind of false information propagation is benign - no data is lost.

But, we strictly don't want this node to update other nodes that the source still owns the slot (when source is no longer claiming it). If we don't prevent this, it leads to propagation of misinformation that can make owner of the slot to purge the slot, losing data. This is destructive.

We added a bitfield in cluster state to track the slots for which the owner is no longer claiming the slot. If a different node claims the slot that the previous owner is no longer claiming, we shouldn't try to update the node.

One side effect of this behavior is that when a slot is deleted, this information is not propagated to all nodes. When cluster-require-full-coverage is set to yes, only the node that owned the deleted slot will report a cluster state of FAIL whereas the rest of the nodes in the cluster will report OK. This is reasonable as we normally don't delete slots without adding them to some other shard. This is the behavior even without the changes in the PR.

I have opened a PR with the proposed change #12344

PingXie · 2023-06-25T23:12:00Z

Make sense, @srgsanky. I like the "tombstone" idea. Worst case, clients get redirected a second time on node 3 (client -> node 1 -> node 3 -> node 2) but it is all transient and short-lived. Once the cluster converges on the topology, everything will be back to normal.

cyningsun · 2023-06-26T03:08:03Z

@PingXie @srgsanky I’d like to add some considerations why PR skip the updating instead of temporarily removing slots from POV

Compatibility: New client connection established with nodes during slots temporarily removed, unable to figure out with node to request which hashed to this removed slots. Old configuration doesn’t have such concerns because of MOVED redirection
Cohesion: When slots migrate, slots always change together with configEpoch bumped at target nodes, but hard to figure out such relevance in clusterProcessPacket when nodes update POV. That's why this bug occurs. It consists of two parts in two places：L2676: updating configEpoch directly without considering impaction on slots and
L2262: comparison before updating slots in POV.
Trouble-shooting: Old configurations carry slots' history, which helped me to present what happened. Removing slots will clear that information
Simplicity: Currently it's one-way slots updating in POV by new masters. Introducing tombstone and two-way(old-masters&new-masters) increases complexity

However, there is also a negative effect

Slow down agreement about configuration a bit, because configEpoch in POV is not updated when skipping somePONG messages which are sent by old masters when configEpoch collisions

PingXie · 2023-06-26T04:42:42Z

@cyningsun, skipping clusterUpdateSlotsConfigWith is a big deal that would break our efforts to improve the slot migration reliability in general (please see PR #10517). The whole idea of this tombstone bitmap is to achieve exactly what you called out above without the downsides of slowing down the topology convergence and hindering further improvements in the slot migration area. More specifically, it does not clear out the ownership information (on slot 1200) when processing node 3's PING message that has a configEpoch of 4. There is actually more context/history preserved with PR #12344 because it doesn't skip processing any cluster messages.

zuiderkwast · 2023-07-06T10:59:28Z

Is this PR obsolete now that #12344 is merged?

madolson · 2023-07-07T21:23:18Z

Is this PR obsolete now that #12344 is merged?

I do think it's obsolete, but it would be great if @cyningsun can either write a test or validate it does solve their use case as well.

cyningsun · 2023-07-07T21:59:08Z

Is this PR obsolete now that #12344 is merged?

I do think it's obsolete, but it would be great if @cyningsun can either write a test or validate it does solve their use case as well.

Issue resolved too. Thank you all for your help. 👍

madolson · 2023-07-07T23:23:25Z

@cyningsun Thank you for taking the effort to report the fix!

yinhang.sun added 4 commits June 21, 2023 20:57

skip update sender's POV before migrated slots be claimed

7f9c54f

update key words

7676701

code format and remove duplicate comments

0aa51e1

update keywords in comments

d23270e

srgsanky mentioned this pull request Jun 25, 2023

Process loss of slot ownership in cluster bus #12344

Merged

madolson closed this Jul 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix cluster inconsistent slots mappings after slot migrating and failover occur simultaneously #12336

Fix cluster inconsistent slots mappings after slot migrating and failover occur simultaneously #12336

cyningsun commented Jun 21, 2023

zuiderkwast commented Jun 21, 2023 •

edited

madolson commented Jun 22, 2023

zuiderkwast commented Jun 22, 2023

madolson commented Jun 22, 2023

cyningsun commented Jun 23, 2023 •

edited

srgsanky commented Jun 23, 2023 •

edited

PingXie commented Jun 25, 2023

srgsanky commented Jun 25, 2023

PingXie commented Jun 25, 2023

cyningsun commented Jun 26, 2023

PingXie commented Jun 26, 2023

zuiderkwast commented Jul 6, 2023

madolson commented Jul 7, 2023

cyningsun commented Jul 7, 2023

madolson commented Jul 7, 2023

Fix cluster inconsistent slots mappings after slot migrating and failover occur simultaneously #12336

Fix cluster inconsistent slots mappings after slot migrating and failover occur simultaneously #12336

Conversation

cyningsun commented Jun 21, 2023

Background

How to fix it

Example

Operations

Timeline

Result

zuiderkwast commented Jun 21, 2023 • edited

madolson commented Jun 22, 2023

zuiderkwast commented Jun 22, 2023

madolson commented Jun 22, 2023

cyningsun commented Jun 23, 2023 • edited

srgsanky commented Jun 23, 2023 • edited

PingXie commented Jun 25, 2023

srgsanky commented Jun 25, 2023

PingXie commented Jun 25, 2023

cyningsun commented Jun 26, 2023

PingXie commented Jun 26, 2023

zuiderkwast commented Jul 6, 2023

madolson commented Jul 7, 2023

cyningsun commented Jul 7, 2023

madolson commented Jul 7, 2023

zuiderkwast commented Jun 21, 2023 •

edited

cyningsun commented Jun 23, 2023 •

edited

srgsanky commented Jun 23, 2023 •

edited