New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix cluster inconsistent slots mappings after slot migrating and failover occur simultaneously #12336
Conversation
@zuiderkwast I don't think so, but I believe we observed this internally and someone might produce a fix soonish. |
@madolson This PR is a fix to the problem. Do you have another fix in mind? @cyningsun Are you able to reproduce the problem consistently? It is always good to have a test. There is a test suite # Tests for many simultaneous migrations.
# TODO: Test is currently disabled until it is stabilized (fixing the test
# itself or real issues in Redis).
if {false} {
source "../tests/includes/init-tests.tcl"
source "../tests/includes/utils.tcl"
# TODO: This test currently runs without replicas, as failovers (which may
# happen on lower-end CI platforms) are still not handled properly by the
# cluster during slot migration (related to #6339). Maybe this test suite can be enabled with your fix. Delete the |
hehe, yes, it does fix a problem. I think the problem is more pervasive though, and can happen after the slot migration can complete and the epoch is bumped. I'm still trying to wrap my head around if they are complementary fixes, we need both, or if the one we are looking at internally solves this, maybe we just need one. |
Sorry @zuiderkwast , Currently, I can't reproduce this problem consistently. I wrote a demo and some scripts to simulate the entire process that happened in our production environment. It relies on several
using my script, the probability of the problem occurring is around 10%. I did notice this test case. Just as @madolson said, only this specific issue has been fixed. It's uncertain whether there are more general issues that can be fixed once and for all. That's why I didn't complete these two TODOs. If suitable, I will. |
Thanks @madolson! We observed a similar behavior internally. The main difference between the scenario that @cyningsun posted and what we observed is
Here is an example scenario: The key detail is that the slot migration source interacts with Node A (a third observer node) before the slot migration target interacted with Node A. Here is the code for the above diagram Not processing loss of slot ownership.txt The gist of the fix is to process loss of slot ownership - when a node no longer claims owning a slot, we need to record that. We added an else block to https://github.com/redis/redis/blob/unstable/src/cluster.c#L2244-L2245 Will send a PR later today. |
@madolson is right. My PR handles a SPOF during the slot migration but this is a race condition between the entire cluster learning about the new slot owner and the old owner getting a Regarding the fix, I am thinking whether |
@PingXie our fix is similar. There is one catch though - if node 1 clears the ownership information of the slot, clients can temporarily see errors at the end of slot migration in the common case until the target broadcasts the change in ownership.
Not all clients will retry on these kinds of failures. We need to fix the uncommon race condition and we don't want to cause errors in the common case. So it might be better to redirect clients even if we are not sure about the ownership change. When the client connects to the redirected node, it might correct the client (one more redirect) or give the Hash slot not served failure. This kind of false information propagation is benign - no data is lost. But, we strictly don't want this node to update other nodes that the source still owns the slot (when source is no longer claiming it). If we don't prevent this, it leads to propagation of misinformation that can make owner of the slot to purge the slot, losing data. This is destructive. We added a bitfield in cluster state to track the slots for which the owner is no longer claiming the slot. If a different node claims the slot that the previous owner is no longer claiming, we shouldn't try to update the node. One side effect of this behavior is that when a slot is deleted, this information is not propagated to all nodes. When I have opened a PR with the proposed change #12344 |
Make sense, @srgsanky. I like the "tombstone" idea. Worst case, clients get redirected a second time on |
@PingXie @srgsanky I’d like to add some considerations why PR
However, there is also a negative effect
|
@cyningsun, skipping |
Is this PR obsolete now that #12344 is merged? |
I do think it's obsolete, but it would be great if @cyningsun can either write a test or validate it does solve their use case as well. |
Issue resolved too. Thank you all for your help. 👍 |
@cyningsun Thank you for taking the effort to report the fix! |
Background
Node A(Master)
updateclusterState
based onclusterMsg
from anotherNode B(Master)
. This is a chance to cause slot mapping conflict betweenNode A
andNode B
, When:Node C(Master)
was migrated(SETSLOT)Slot S
fromNode B
Node B
received updating fromNode C
. It's configEpoch collision with another nodeNode D(Master)
.Node B
's node id is smaller, soNode B
bumps to new configEpoch, which is maximum in the cluster.Node A
received PONG fromNode B
. POV ofNode A
: new configEpoch ofNode B
is updated butSlot S
is not updated because it didn't belong toNode A
anymoreNode C(Master)
, which was assigned this slot, POV ofNode A
failed to assign it toNode C
because its config epoch(senderConfigEpoch) is smaller thanserver.cluster->slots[j]->configEpoch
.As shown in the figure:
How to fix it
Skip updating the sender's configEpoch and Slots before the migrated slot is claimed in POV.
configEpoch acts as the guard of slots it owns in POV. If there is a migrated slot of the sender in POV pending claim by a new master, the sender's config epoch in POV cannot be updated separately.
Otherwise, the slot will be guarded by a new config epoch of the previous master. This config epoch maybe has been bumped after this slot was assigned to another master.
The new master will fail to claim the slot because
senderConfigEpoch
is smaller thanserver.cluster->slots[j]->configEpoch
.Example
Here is a cluster that has 5 shards, and each shard has 2 replicas, shown below before slot migrating.
Operations
6ea4c2f8e7efe9180db83cf0adf1b055a557c74d 10.53.52.144:6379
tod69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
c013fefabf931b77a5488e4e0fe1e81db1eaf4aa 10.53.52.144:6375
to0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
Let's focus on how
Slot 16383
is conflicted in the clusterTimeline
1. failover election for
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
failover_auth_epoch = 9
2. After Slot 0 is SETSLOT to
d69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378
3. After slot 16383 is SETSLOT to
0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
After
0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
receivingPONG
send byd69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378
, configEpoch,currentEpoch is 9+14. After Failover is authed to
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
5. After other nodes receive
PONG
send by2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
6. After
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
receivingPONG
send by0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
7.
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
resolve configEpoch Collision and sendPONG
msg2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
andd69ae698cb4ec7af73e8c15eb64d6f8b6cde4f59 10.53.52.144:6378
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
has smaller Node IDslot 0 and 16383 ownership on other nodes will not be affected because
myslots
inclusterMsg
header of2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
don't have those two slots.8.
0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
is reset byCLUSTERMSG_TYPE_UPDATE
message0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
: slot 16383 ownership is reset because0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
node's configEpoch(senderConfigEpoch
) 10 which is smaller thanserver.cluster->slots[j]->configEpoch
11, then other nodes will find it has an old configurations and sendCLUSTERMSG_TYPE_UPDATE
message to0c92e0b02f839260b508a6fdf258539a26e96903 10.53.52.144:6376
ask it to reset the ownership to6372
, and finally succeed because slot 16383 is guarded by configEpoch 10 is smaller than senderConfigEpoch(11,it is not actually the "Sender" of the information
, so it's 11 according by slot)2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
: slot 16383 ownership is not reset because2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
not claimed slot 163832ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
node's configEpoch(senderConfigEpoch
) 11 which is equal toserver.cluster->slots[j]->configEpoch
11so there is no
CLUSTERMSG_TYPE_UPDATE
message. And also, There is no Sender claim slot 16383 whose configEpoch is greater than 10, so there is noclusterUpdateSlotsConfigWith
throughCLUSTERMSG_TYPE_PING/PONG
... messages9. Slot 0 finished Gossip among cluster
Result
The cluster has inconsistent slots mapping about slot 16383, between
2ddaf93f230facc9893bc95bf9234a3d580d0291 10.53.52.144:6372
and other nodes