-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Description
The current design of the cluster has some problems to deal with a master node crash during slot migration.
Some notes about the current design need to be mentioned first:
- The importing flag and the migrating flag are local to the master node.
- When using gossip to propagate slots distribution, the owner of a slot is the only source can spread out the information.
- The design of epoch can't carry enough information to resolve config confliction between nodes from different 'slice'. Epoch is suitable for resolving confliction inside same 'slice'.
More explanation about 2 & 3:
During migrating slot x from A to B, if we called cluster setslot x node {B-id} on all master nodes(slave node reject this command). Then B crashed before B pinged any of its slave nodes, then after a failover one slave node gets promoted. The new B will never know that itself has the ownership of slot x, because the old B is the single failure point who can spread out the information.
The design of epoch is similar to term in Raft protocol, it's useful to do leader election. I call a master node plus its slave nodes as a slice. Confliction within same slice means that a node B may think slot x belongs to node C, while node A think slot x belongs to node A. When node A pings node B, node B will notice the confliction. If both C and A belong to the same slice, then this is a confliction within the same slice, else this is a confliction between different slice.
Confliction between different slice can't be resolved simply by comparing epoch. Suppose we're migrating slot x from A to B, just after we called cluster setslot x node {B-id} on node B, node A crashed. The new A still think itself has the slot x(due to problem 1 mentioned above), so the confliction here is from two different slices. The new A may have a bigger epoch than B(after B bump epoch locally), also it can have a smaller epoch than B. But we all know that the right ownership of x is B, it doesn't depend on who has bigger epoch. So the epoch based confliction resolving algorithm is totally broken here.