You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Consider the case: replicasets rs1, rs2, rs3. Rs1 sends a bucket to rs2. The first message, supposed to add 'RECEIVING' bucket to rs2, is lost in the network. Rs1 gets a timeout, recreates the connection, goes to rs2, does not find the bucket here, and activates it locally. Then it is sent to rs3. Now the first old TCP message arrives to rs2. The bucket becomes 'RECEIVING' here. Recovery on rs2 will go to rs1 and won't find the bucket here (or will find it 'GARBAGE'). And will activate it locally. Now the bucket is active on rs2 and rs3.
It was not reproduced anywhere, but can happen in theory.
A dumb solution, which probably would help in most of the cases: let recovery scan the whole cluster, not only sender of a 'RECEIVING' bucket. Unfortunately, it does not really solve the issue, because it can happen, that it appears to be 'RECEIVING' on some other node too, and it can't be determined who should be activated.
A right solution: add bucket versioning. Each bucket gets a persistent counter called 'version'. To send a bucket, the sender increments its version, and sends {'RECEIVING', new_version} to a destination node. The destination node remembers the version.
Recovery uses the versions. If a RECEIVING bucket is being recovered, and the recovery sees it has a bigger version on the sender node, then it was already sent to other place, and can be garbaged here. Regardless of its status on the sender node. In all other cases recovery works like now.
Problem here is that it requires upgrade of vshard (it never was needed before) to change _bucket space format. And garbage buckets are never removed from _bucket space. Their content is removed, but a record from _bucket can't be deleted. Each node should remember last seen version of each bucket stored here ever. Perhaps 'ever' could mean several days, or until an admin manually forces collection of such bucket records.
Also a stray TCP message may carry some space data, and arrive in the middle of receiving the bucket from some third node. That data should not be applied. And versions would help here as well.
The text was updated successfully, but these errors were encountered:
Consider the case: replicasets rs1, rs2, rs3. Rs1 sends a bucket to rs2. The first message, supposed to add 'RECEIVING' bucket to rs2, is lost in the network. Rs1 gets a timeout, recreates the connection, goes to rs2, does not find the bucket here, and activates it locally. Then it is sent to rs3. Now the first old TCP message arrives to rs2. The bucket becomes 'RECEIVING' here. Recovery on rs2 will go to rs1 and won't find the bucket here (or will find it 'GARBAGE'). And will activate it locally. Now the bucket is active on rs2 and rs3.
It was not reproduced anywhere, but can happen in theory.
A dumb solution, which probably would help in most of the cases: let recovery scan the whole cluster, not only sender of a 'RECEIVING' bucket. Unfortunately, it does not really solve the issue, because it can happen, that it appears to be 'RECEIVING' on some other node too, and it can't be determined who should be activated.
A right solution: add bucket versioning. Each bucket gets a persistent counter called 'version'. To send a bucket, the sender increments its version, and sends {'RECEIVING', new_version} to a destination node. The destination node remembers the version.
Recovery uses the versions. If a RECEIVING bucket is being recovered, and the recovery sees it has a bigger version on the sender node, then it was already sent to other place, and can be garbaged here. Regardless of its status on the sender node. In all other cases recovery works like now.
Problem here is that it requires upgrade of vshard (it never was needed before) to change _bucket space format. And garbage buckets are never removed from _bucket space. Their content is removed, but a record from _bucket can't be deleted. Each node should remember last seen version of each bucket stored here ever. Perhaps 'ever' could mean several days, or until an admin manually forces collection of such bucket records.
Also a stray TCP message may carry some space data, and arrive in the middle of receiving the bucket from some third node. That data should not be applied. And versions would help here as well.
The text was updated successfully, but these errors were encountered: