-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
box: apply deletion of replica from _cluster
on the deleted replica
#10114
base: master
Are you sure you want to change the base?
Conversation
c682fc6
to
fd00a73
Compare
}} | ||
cg.replica:start() | ||
-- Add the replica to the master's replicaset table: needed for testing | ||
-- behaviour or deleted replica after recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- behaviour or deleted replica after recovery. | |
-- the behaviour of the deleted replica after recovery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
"replica as an anonymous replica" | ||
-- If the deleted replica is configured as anonymous during recovery, the | ||
-- master sees it as a non-anonymous replica in his replicaset table, and | ||
-- reject the subscription request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- reject the subscription request. | |
-- rejects the subscription request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
end) | ||
|
||
-- Test that synchronous deletion from 3-member cluster with 1 disabled node and | ||
-- (the deleted node alive) works properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-- (the deleted node alive) works properly. | |
-- the deleted node alive works properly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -0,0 +1,4 @@ | |||
## feature/replication | |||
|
|||
* The deletion of a replica from the `_cluster` space is now applied by the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* The deletion of a replica from the `_cluster` space is now applied by the | |
* A replica deleted from the `_cluster` space now applies its own deletion and does not try to rejoin |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
## feature/replication | ||
|
||
* The deletion of a replica from the `_cluster` space is now applied by the | ||
deleted replica (gh-10088). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deleted replica (gh-10088). | |
(gh-10088). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
684618b
to
01e87f1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Good planing, I like how you managed to split the feature into independent small PRs.
01e87f1
to
b46fab4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Cool, the synchro system spaces are moving quick.
...lication-luatest/gh_10088_apply_deletion_of_replica_from_cluster_on_deleted_replica_test.lua
Outdated
Show resolved
Hide resolved
...lication-luatest/gh_10088_apply_deletion_of_replica_from_cluster_on_deleted_replica_test.lua
Show resolved
Hide resolved
b46fab4
to
c646d20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the fixes! All good now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch. No objections)
c646d20
to
9ff6e08
Compare
Sorry, didn't have time today. Will take a closer look and merge tomorrow. |
78a1925
to
7be2395
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the patch. No objections
if (!replica->anon) | ||
tnt_raise(ClientError, ER_PROTOCOL, "Can't subscribe " | ||
"a previously deleted non-anonymous replica " | ||
"as an anonymous replica"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure this check is needed?
Why can't we allow to subscribe the replica as anonymous?
Besides, you need another check for !req.is_anon
: I just managed to:
- delete replica from _cluster
- issue
box.cfg{replication=""}
andbox.cfg{replication="old_replication"}
on it - and insert it into the just-deleted slot.
Is this expected?
I think that the replica should be able to subscribe anonymously after the _cluster deletion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot subscribe the deleted replica as anonymous, since the master node sees it as non-anonymous in his replicaset table, and AFAIC, transitioning it to anonymous (instead of deleting it from the replicaset table and subscribing it as a new anonymous replica) is not easy, since it will break a lot of invariants.
Why can't we allow to subscribe the replica as anonymous?
Besides, you need another check for
!req.is_anon
: I just managed to:
- delete replica from _cluster
- issue
box.cfg{replication=""}
andbox.cfg{replication="old_replication"}
on > it- and insert it into the just-deleted slot.
Is this expected?
Yes, this is exactly the expected behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this expected?
Also seems valid to me.
since it will break a lot of invariants.
The question here why it is forbidden to turn a normal replica into an anonymous one. Maybe @sergepetrenko knows the motivation. Moreover, you won't catch this situation of resubscription if replica object is deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I didn't explain my point good enough.
Basically what happened on steps 2 and 3, the deleted replica registered in the cluster once again.
Anyway, I thought about this some more and I don't see how we can forbid this.
This only happens if the user restarts replication manually, so this seems fine, indeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question here why it is forbidden to turn a normal replica into an anonymous one. Maybe @sergepetrenko knows the motivation. Moreover, you won't catch this situation of resubscription if replica object is deleted.
This simply was never needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After oral discussion: this check is legit.
src/box/alter.cc
Outdated
diag_set(ClientError, ER_LOCAL_INSTANCE_ID_IS_READ_ONLY, | ||
(unsigned)instance_id); | ||
struct diag *diag = diag_get(); | ||
applier_kill(origin->applier, diag_last_error(diag)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for digging into this!
You're right, it's hard to fix this.
I think the correct behavior should be:
- ack the transaction which contains replica deletion from
_cluster
. - Do not ack any further transactions.
Otherwise we might end up in a situation where master commits replica deletion (2 out of 3 nodes agree, but the replica to be deleted hasn't seen the transaction yet), then master is free to add another replica in place of the old one and perform more transactions. Our old replica might receive its own deletion from cluster together with a bunch of new transactions and ack them wrongly. This can lead to a split-brain potentially.
Maybe you can stop acking everything at the moment cluster deletion is received by the replica and send a single special ack just for the transaction containing the deletion from an on_commit trigger?
Another option is to track lsns up to which a particular node has the right to ACK.
I mean, you perform _cluster deletion of a particular replica in a transaction which ends at LSN 8. Then txn_limbo may only count acks for transactions with lsn <= 8.
The second option has a problem that you have to distinguish acks not by replica ids of their senders, but by their UUIDs.
I'm summoning @Gerold103 and @Serpentian here to discuss
} | ||
diag_clear(diag); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't we also wanna stop the relay on the over nodes to the deleted replica? So that it becomes orphan and we can delete it in replica_clear_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say we want to stop relays on the deleted node to the others.
Because basically it has become "anonymous", and we don't allow to replicate from anonymous nodes.
The relays on the other nodes to this replica will be stopped just because the replica stops its appliers.
This patch series applies deletion of replica from
_cluster
on the deleted replica.Closes #10088