New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write term and vote atomically during box.ctl.promote()
#8497
Comments
Actually this bug stops any maintenance on cluster. Consider you have RAFT replicaset with some candidates (>1) and some voters. You need to shutdown server with current leader and to minimise downtime you better switch current leader to another available candidate. Execution |
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Let's go back to writing term and vote together for self votes. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible NO_TEST=covered by the next commit
box.ctl.promote() was impplemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Let's go back to writing term and vote together for self votes. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible NO_TEST=covered by the next commit
box.ctl.promote() was impplemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Let's go back to writing term and vote together for self votes. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible NO_TEST=covered by the next commit
box.ctl.promote() was impplemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was impplemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was impplemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
Co-authored-by: Andrey Aksenov <38073144+andreyaksenov@users.noreply.github.com>
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite #8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes #8497 NO_DOC=bugfix
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite #8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible (cherry picked from commit 8a124e5)
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes #8497 NO_DOC=bugfix (cherry picked from commit 1737121)
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite #8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible (cherry picked from commit 8a124e5)
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes #8497 NO_DOC=bugfix (cherry picked from commit 1737121)
Bug description
Reproduced on Tarantool 2.11.0-rc2, master (3.0.0-entrypoint-99-ge987427f6)
With
box.ctl.promote()
an instance votes for self only after writing the new term and broadcasting it to everyone.If other nodes in cluster are configured to be candidates, they might have time to bump the term accordingly and, seeing that noone is candidate (because original candidate's vote for self wasn't delivered yet), start their own elections. This may lead to split-votes and box.ctl.promote() failing to promote the desired node.
Steps to reproduce
Start 2 tarantool instances (built with debug) in interactive mode:
After the fourth step the leader might be either instance 1 or instance 2, but instance 2's log will definitely contain
2023-03-24 18:02:42.069 [25157] main/128/lua raft.c:494 E> ER_OLD_TERM: The term is outdated: old - 3, new - 4
and the term after these manipulations will be >= 4 (while the expected behaviour is to see that instance 2 is the leader in term 3).How the reproducer works: at step 4 we set an injection which will block the second wal write until
ERRINJ_WAL_DELAY
is set to false. After that we issuebox.ctl.promote()
, which will write the term successfully, but will block on writing a vote. After that we wait a bit and unblock the write. This way by the time the write's unblocked, the other node already votes for self, so the elections span one more round.Node 1 log at that moment:
The text was updated successfully, but these errors were encountered: