New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable split-brain checks for partially upgraded clusters #8996
Labels
2.10
Target is 2.10 and all newer release/master branches
bug
Something isn't working
qsync replication
Comments
sergepetrenko
added
bug
Something isn't working
qsync replication
2.10
Target is 2.10 and all newer release/master branches
labels
Aug 17, 2023
In order to reproduce, one has to issue some synchronous requests on 2.8, wait for their confirmation, issue Once
|
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Aug 22, 2023
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Aug 28, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE/DEMOTE entry is written by one of the nodes, and thus the coorect confirmed lsn is propagated to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix NO_TEST=hard to test, involves multiple versions
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Aug 28, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the coorect confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix NO_TEST=hard to test, involves multiple versions
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Sep 1, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the coorect confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix NO_TEST=hard to test, involves multiple versions
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Sep 15, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the coorect confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix NO_TEST=hard to test, involves multiple versions
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Sep 15, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the coorect confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Sep 27, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the correct confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix
sergepetrenko
added a commit
that referenced
this issue
Sep 28, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the correct confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes #8996 NO_DOC=bugfix
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Sep 28, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the correct confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix (cherry picked from commit a844bd3)
sergepetrenko
added a commit
to sergepetrenko/tarantool
that referenced
this issue
Sep 28, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the correct confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes tarantool#8996 NO_DOC=bugfix NO_TEST=2.10 lacks box.schema.downgrade(), which's used in the test (cherry picked from commit a844bd3)
sergepetrenko
added a commit
that referenced
this issue
Sep 29, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the correct confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes #8996 NO_DOC=bugfix (cherry picked from commit a844bd3)
sergepetrenko
added a commit
that referenced
this issue
Sep 29, 2023
Our split-brain detection machinery relies among other things on all nodes tracking the synchro queue confirmed lsn. This tracking was only added together with the split-brain detection. Only the synchro queue owner tracked the confirmed lsn before. This means that after an upgrade all the replicas remember the latest confirmed lsn as 0, and any PROMOTE/DEMOTE request from the queue owner is treated as a split brain. Let's fix this and only enable split-brain detection on the replica set once the schema version is updated. Thanks to the synchro queue freeze on restart, this can only happen after a new PROMOTE or DEMOTE entry is written by one of the nodes, and thus the correct confirmed lsn is propagated with this PROMOTE/DEMOTE to all the cluster members. Closes #8996 NO_DOC=bugfix NO_TEST=2.10 lacks box.schema.downgrade(), which's used in the test (cherry picked from commit a844bd3)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.10
Target is 2.10 and all newer release/master branches
bug
Something isn't working
qsync replication
Split-brain detection first appeared in version 2.10. In order for it to work some changes to the synchro queue were made: notably, replicas started tracking confirmed_lsn (previously it was tracked only on the synchro queue owner).
This means when a cluster is updated from, say, 2.8 to 2.10, all the replicas remember that the last confirmed_lsn is 0. Regardless of the actual value. The first updated node to issue a PROMOTE or DEMOTE will trigger an ER_SPLIT_BRAIN error, since it'll write PROMOTE with a correct confirmed_lsn, not 0.
Let's fix this by disabling split-brain checks in the cluster until the schema version is updated. Once it's updated, it means someone became writeable by issuing a PROMOTE or DEMOTE, writing the correct confirmed_lsn. The schema version update tuple will arrive only after this PROMOTE/DEMOTE, so everyone who has the recent schema version will also have the correct confirmed_lsn.
The text was updated successfully, but these errors were encountered: