Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stray TCP message with big delay may duplicate a bucket #214

Open
Gerold103 opened this issue Feb 24, 2020 · 0 comments
Open

Stray TCP message with big delay may duplicate a bucket #214

Gerold103 opened this issue Feb 24, 2020 · 0 comments
Labels
Milestone

Comments

@Gerold103
Copy link
Collaborator

Consider the case: replicasets rs1, rs2, rs3. Rs1 sends a bucket to rs2. The first message, supposed to add 'RECEIVING' bucket to rs2, is lost in the network. Rs1 gets a timeout, recreates the connection, goes to rs2, does not find the bucket here, and activates it locally. Then it is sent to rs3. Now the first old TCP message arrives to rs2. The bucket becomes 'RECEIVING' here. Recovery on rs2 will go to rs1 and won't find the bucket here (or will find it 'GARBAGE'). And will activate it locally. Now the bucket is active on rs2 and rs3.

It was not reproduced anywhere, but can happen in theory.

A dumb solution, which probably would help in most of the cases: let recovery scan the whole cluster, not only sender of a 'RECEIVING' bucket. Unfortunately, it does not really solve the issue, because it can happen, that it appears to be 'RECEIVING' on some other node too, and it can't be determined who should be activated.

A right solution: add bucket versioning. Each bucket gets a persistent counter called 'version'. To send a bucket, the sender increments its version, and sends {'RECEIVING', new_version} to a destination node. The destination node remembers the version.

Recovery uses the versions. If a RECEIVING bucket is being recovered, and the recovery sees it has a bigger version on the sender node, then it was already sent to other place, and can be garbaged here. Regardless of its status on the sender node. In all other cases recovery works like now.

Problem here is that it requires upgrade of vshard (it never was needed before) to change _bucket space format. And garbage buckets are never removed from _bucket space. Their content is removed, but a record from _bucket can't be deleted. Each node should remember last seen version of each bucket stored here ever. Perhaps 'ever' could mean several days, or until an admin manually forces collection of such bucket records.

Also a stray TCP message may carry some space data, and arrive in the middle of receiving the bucket from some third node. That data should not be applied. And versions would help here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants