New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve election stability and reproducibility #8699
Improve election stability and reproducibility #8699
Conversation
e62a64b
to
7d85b33
Compare
7d90de1
to
bef4007
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for you patch. LGTM
box_cfg = cg.box_cfg, | ||
} | ||
cg.box_cfg.election_mode = 'voter' | ||
cg.server2 =cg.replica_set:build_and_add_server{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: space after '='
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Fixed.
@@ -1211,6 +1211,7 @@ raft_promote(struct raft *raft) | |||
return; | |||
raft_sm_schedule_new_term(raft, raft->volatile_term + 1); | |||
raft_start_candidate(raft); | |||
raft_sm_schedule_new_vote(raft, raft->self, raft->vclock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't we use raft_sm_schedule_new_election
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really. We have to do raft_sm_schedule_new_term
before raft_start_candidate
for the sake of 'manual' nodes. Otherwise raft_sm_schedule_new_term
would restore a manual node to non-candidate again.
At the same time we have to do raft_sm_schedule_new_vote
after raft_start_candidate
(again, for the manual nodes).
On the other hand I can do this, because raft_start_candidate
is a no-op for a real candidate. I'm not sure it's worth it though. What do you think? cc: @Gerold103
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index f31cee5b1..e0b073b5c 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -959,8 +959,7 @@ raft_sm_schedule_new_election(struct raft *raft)
say_info("RAFT: begin new election round");
assert(raft->is_cfg_candidate);
/* Everyone is a follower until its vote for self is persisted. */
- raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
- raft_sm_schedule_new_vote(raft, raft->self, raft->vclock);
+ raft_promote(raft);
}
static void
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for explanation, I see now, that raft_start_candidate
has to be exactly between term and vote when manual promote is called. And I would rather not apply change you proposed, as indeed raft_start_candidate
won't have any effect in automatic elections, and it would make the code even harder to understand, I'm on the side of explicit call of new_term
and new_vote
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't like this hunk either. Let's revert it then:
diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e0b073b5c..f31cee5b1 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -959,7 +959,8 @@ raft_sm_schedule_new_election(struct raft *raft)
say_info("RAFT: begin new election round");
assert(raft->is_cfg_candidate);
/* Everyone is a follower until its vote for self is persisted. */
- raft_promote(raft);
+ raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
+ raft_sm_schedule_new_vote(raft, raft->self, raft->vclock);
}
static void
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the patch! Just one comment.
@@ -1,4 +1,4 @@ | |||
## bugfix/replication | |||
|
|||
* Fixed possible failure to promote the desired node by `box.ctl.promote()` on | |||
* Fixed a possible failure to promote the desired node by `box.ctl.promote()` on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change better be in the previous commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for noticing! Fixed.
bef4007
to
9e3521c
Compare
Commit c9155ac ("raft: persist new term and vote separately") made the nodes persist new term and vote separately, using 2 WAL writes. Writing the term first is needed to flush all the ongoing transactions, so that the node's vclock is updated and can be checked against the candidate's vclock. Otherwise it could happen that the node persists a vote for some candidate only to find that it's vclock would actually become incomparable with the candidate's. Actually, this guard is not needed when checking a vote for self, because a node can always vote for self. Besides, splitting term bump and vote can lead to increased probability of split-vote. It may happen that a candidate bumps and broadcasts the new term without a vote, making other nodes vote for self. Let's go back to writing term and vote together for self votes. This change makes raft candidate persist term bump and vote for self in one WAL write instead of two, so all the tests which count WAL writes or expect 2 separate state updates for term and vote are rewritten. Prerequisite tarantool#8497 NO_DOC=not user-visible NO_CHANGELOG=not user-visible
box.ctl.promote() was implemented as follows: an instance bumps the term and marks itself a candidate, but doesn't vote for self immediately. Instead it relies on the machinery which makes a candidate vote for self as soon as it persists a new term. This differs from a normal election start due to leader timeout: there term and vote are bumped at once. Besides, this increases probability of box.ctl.promote() resulting in other node getting elected: if a node first broadcasts a term without a vote, it is not considered a candidate, so other candidates might start elections and vote for themselves. Let's bring promote into line with automatic elections. Closes tarantool#8497 NO_DOC=bugfix
Due to a typo raft candidate counted a vote for another node as a vote for self in its split-vote detector. This could lead to spurious split-vote detection in cases when another node wins elections with a bare minimum of votes for it (exactly a quorum of votes). Closes tarantool#8698 NO_DOC=bugfix
9e3521c
to
1ec1f55
Compare
Fix a couple of bugs regarding election stability:
box.ctl.promote()
Closes #8497
Closes #8698