Stray TCP message with big delay may duplicate a bucket #214

Gerold103 · 2020-02-24T18:39:46Z

Consider the case: replicasets rs1, rs2, rs3. Rs1 sends a bucket to rs2. The first message, supposed to add 'RECEIVING' bucket to rs2, is lost in the network. Rs1 gets a timeout, recreates the connection, goes to rs2, does not find the bucket here, and activates it locally. Then it is sent to rs3. Now the first old TCP message arrives to rs2. The bucket becomes 'RECEIVING' here. Recovery on rs2 will go to rs1 and won't find the bucket here (or will find it 'GARBAGE'). And will activate it locally. Now the bucket is active on rs2 and rs3.

It was not reproduced anywhere, but can happen in theory.

A dumb solution, which probably would help in most of the cases: let recovery scan the whole cluster, not only sender of a 'RECEIVING' bucket. Unfortunately, it does not really solve the issue, because it can happen, that it appears to be 'RECEIVING' on some other node too, and it can't be determined who should be activated.

A right solution: add bucket versioning. Each bucket gets a persistent counter called 'version'. To send a bucket, the sender increments its version, and sends {'RECEIVING', new_version} to a destination node. The destination node remembers the version.

Recovery uses the versions. If a RECEIVING bucket is being recovered, and the recovery sees it has a bigger version on the sender node, then it was already sent to other place, and can be garbaged here. Regardless of its status on the sender node. In all other cases recovery works like now.

Problem here is that it requires upgrade of vshard (it never was needed before) to change _bucket space format. And garbage buckets are never removed from _bucket space. Their content is removed, but a record from _bucket can't be deleted. Each node should remember last seen version of each bucket stored here ever. Perhaps 'ever' could mean several days, or until an admin manually forces collection of such bucket records.

Also a stray TCP message may carry some space data, and arrive in the middle of receiving the bucket from some third node. That data should not be applied. And versions would help here as well.

Gerold103 added bug Something isn't working storage complicated rebalancer labels Feb 24, 2020

Gerold103 added this to the 0.2 milestone Feb 24, 2020

Gerold103 mentioned this issue Jun 17, 2021

Make discovery almost free and continuous after all is discovered #238

Open

kyukhin added the teamS Scaling label Sep 3, 2021

Gerold103 mentioned this issue Nov 11, 2022

Add bucket generation counter to public info #347

Open

Gerold103 mentioned this issue Apr 20, 2023

Doubled buckets #412

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stray TCP message with big delay may duplicate a bucket #214

Stray TCP message with big delay may duplicate a bucket #214

Gerold103 commented Feb 24, 2020

Stray TCP message with big delay may duplicate a bucket #214

Stray TCP message with big delay may duplicate a bucket #214

Comments

Gerold103 commented Feb 24, 2020