Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve election stability and reproducibility #8699

Merged
merged 3 commits into from
May 29, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions changelogs/unreleased/gh-8497-promote-atomic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
## bugfix/replication

* Fixed a possible failure to promote the desired node by `box.ctl.promote()` on
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change better be in the previous commit.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for noticing! Fixed.

a cluster with nodes configured with `election_mode = "candidate"` (gh-8497).
5 changes: 5 additions & 0 deletions changelogs/unreleased/gh-8698-spurious-split-vote.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
## bugfix/replication

* Fixed nodes configured with `election_mode = 'candidate'` spuriously detecting
a split-vote when another candidate should win with exactly a quorum of votes
for it (gh-8698).
21 changes: 11 additions & 10 deletions src/lib/raft/raft.c
Original file line number Diff line number Diff line change
Expand Up @@ -740,6 +740,15 @@ raft_worker_handle_io(struct raft *raft)
assert(raft->volatile_term >= raft->term);
if (raft->volatile_vote == 0)
goto do_dump;
/*
* Skip self. When vote is issued, own vclock can be smaller,
* but that doesn't matter. Can always vote for self. Not having
* this special case still works if the node is configured as a
* candidate, but the node might log that it canceled a vote for
* self, which is confusing.
*/
if (raft->volatile_vote == raft->self)
goto do_dump_with_vote;
/*
* Vote and term bumps are persisted separately. This serves as
* a flush of all transactions going to WAL right now so as the
Expand All @@ -750,15 +759,6 @@ raft_worker_handle_io(struct raft *raft)
*/
if (raft->volatile_term > raft->term)
goto do_dump;
/*
* Skip self. When vote was issued, own vclock could be smaller,
* but that doesn't matter. Can always vote for self. Not having
* this special case still works if the node is configured as a
* candidate, but the node might log that it canceled a vote for
* self, which is confusing.
*/
if (raft->volatile_vote == raft->self)
goto do_dump_with_vote;
if (!raft_can_vote_for(raft, &raft->candidate_vclock)) {
say_info("RAFT: vote request for %u is canceled - the "
"vclock is not acceptable anymore",
Expand Down Expand Up @@ -935,7 +935,7 @@ raft_sm_schedule_new_vote(struct raft *raft, uint32_t candidate_id,
assert(!raft->votes[raft->self].did_vote);
raft->volatile_vote = candidate_id;
vclock_copy(&raft->candidate_vclock, candidate_vclock);
raft_add_vote(raft, raft->self, raft->self);
raft_add_vote(raft, raft->self, candidate_id);
raft_sm_pause_and_dump(raft);
/* Nothing visible is changed - no broadcast. */
}
Expand Down Expand Up @@ -1211,6 +1211,7 @@ raft_promote(struct raft *raft)
return;
raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
raft_start_candidate(raft);
raft_sm_schedule_new_vote(raft, raft->self, raft->vclock);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we use raft_sm_schedule_new_election here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. We have to do raft_sm_schedule_new_term before raft_start_candidate for the sake of 'manual' nodes. Otherwise raft_sm_schedule_new_term would restore a manual node to non-candidate again.
At the same time we have to do raft_sm_schedule_new_vote after raft_start_candidate (again, for the manual nodes).

On the other hand I can do this, because raft_start_candidate is a no-op for a real candidate. I'm not sure it's worth it though. What do you think? cc: @Gerold103

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index f31cee5b1..e0b073b5c 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -959,8 +959,7 @@ raft_sm_schedule_new_election(struct raft *raft)
        say_info("RAFT: begin new election round");
        assert(raft->is_cfg_candidate);
        /* Everyone is a follower until its vote for self is persisted. */
-       raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
-       raft_sm_schedule_new_vote(raft, raft->self, raft->vclock);
+       raft_promote(raft);
 }
 
 static void

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for explanation, I see now, that raft_start_candidate has to be exactly between term and vote when manual promote is called. And I would rather not apply change you proposed, as indeed raft_start_candidate won't have any effect in automatic elections, and it would make the code even harder to understand, I'm on the side of explicit call of new_term and new_vote here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I don't like this hunk either. Let's revert it then:

diff --git a/src/lib/raft/raft.c b/src/lib/raft/raft.c
index e0b073b5c..f31cee5b1 100644
--- a/src/lib/raft/raft.c
+++ b/src/lib/raft/raft.c
@@ -959,7 +959,8 @@ raft_sm_schedule_new_election(struct raft *raft)
        say_info("RAFT: begin new election round");
        assert(raft->is_cfg_candidate);
        /* Everyone is a follower until its vote for self is persisted. */
-       raft_promote(raft);
+       raft_sm_schedule_new_term(raft, raft->volatile_term + 1);
+       raft_sm_schedule_new_vote(raft, raft->self, raft->vclock);
 }
 
 static void

}

void
Expand Down
2 changes: 1 addition & 1 deletion test/replication-luatest/gh_6036_qsync_order_test.lua
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ g.test_promote_order = function(cg)
cg.r1:exec(function() box.cfg{replication=""} end)
cg.r2:exec(function()
box.cfg{replication = {}}
box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 2)
box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
require('fiber').create(function() box.ctl.promote() end)
end)
t.helpers.retrying({}, function()
Expand Down
56 changes: 56 additions & 0 deletions test/replication-luatest/gh_8497_atomic_promote_test.lua
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
local t = require('luatest')
local server = require('luatest.server')
local replica_set = require('luatest.replica_set')

local g = t.group('gh-8497-atomic-promote')

g.before_each(function(cg)
t.tarantool.skip_if_not_debug()
cg.replica_set = replica_set:new({})
cg.box_cfg = {
replication_timeout = 0.1,
replication = {
server.build_listen_uri('server1', cg.replica_set.id),
server.build_listen_uri('server2', cg.replica_set.id),
},
}
cg.box_cfg.election_mode = 'candidate'
cg.server1 = cg.replica_set:build_and_add_server{
alias = 'server1',
box_cfg = cg.box_cfg,
}
cg.box_cfg.election_mode = 'voter'
cg.server2 = cg.replica_set:build_and_add_server{
alias = 'server2',
box_cfg = cg.box_cfg,
}
cg.replica_set:start()
cg.replica_set:wait_for_fullmesh()
cg.server1:wait_for_election_leader()
end)

g.after_each(function(cg)
cg.replica_set:drop()
end)

g.test_election_promote_finishes_in_one_term = function(cg)
cg.server2:update_box_cfg{election_mode = 'candidate'}
local term = cg.server1:get_election_term()
t.assert_equals(term, cg.server2:get_election_term(),
'The cluster is stable')
local ok, err = cg.server2:exec(function()
local fiber = require('fiber')
box.error.injection.set('ERRINJ_WAL_DELAY_COUNTDOWN', 1)
local fib = fiber.new(box.ctl.promote)
fib:set_joinable(true)
fiber.sleep(2 * box.cfg.replication_timeout)
box.error.injection.set('ERRINJ_WAL_DELAY', false)
return fib:join()
end)
t.assert_equals({ok, err}, {true, nil}, 'No error in promote')
cg.server2:wait_for_election_leader()
t.assert_equals(term + 1, cg.server1:get_election_term(),
'The term is bumped once')
t.assert_equals(term + 1, cg.server2:get_election_term(),
'The term is bumped once')
end
42 changes: 14 additions & 28 deletions test/replication/election_basic.result
Original file line number Diff line number Diff line change
Expand Up @@ -338,68 +338,63 @@ box.cfg{election_mode='candidate'}
| ---
| ...

test_run:wait_cond(function() return #election_tbl == 4 end)
test_run:wait_cond(function() return #election_tbl == 3 end)
| ---
| - true
| ...
assert(election_tbl[1].state == 'follower')
| ---
| - true
| ...
assert(election_tbl[2].state == 'follower')
| ---
| - true
| ...
assert(election_tbl[2].term > election_tbl[1].term)
| ---
| - true
| ...
-- Vote is visible here already, but it is volatile.
assert(election_tbl[2].vote == 1)
| ---
| - true
| ...
assert(election_tbl[3].state == 'candidate')
assert(election_tbl[2].state == 'candidate')
| ---
| - true
| ...
assert(election_tbl[3].vote == 1)
assert(election_tbl[2].vote == 1)
| ---
| - true
| ...
assert(election_tbl[4].state == 'leader')
assert(election_tbl[3].state == 'leader')
| ---
| - true
| ...

box.cfg{election_mode='voter'}
| ---
| ...
test_run:wait_cond(function() return #election_tbl == 5 end)
test_run:wait_cond(function() return #election_tbl == 4 end)
| ---
| - true
| ...
assert(election_tbl[5].state == 'follower')
assert(election_tbl[4].state == 'follower')
| ---
| - true
| ...

box.cfg{election_mode='off'}
| ---
| ...
test_run:wait_cond(function() return #election_tbl == 6 end)
test_run:wait_cond(function() return #election_tbl == 5 end)
| ---
| - true
| ...

box.cfg{election_mode='manual'}
| ---
| ...
test_run:wait_cond(function() return #election_tbl == 7 end)
test_run:wait_cond(function() return #election_tbl == 6 end)
| ---
| - true
| ...
assert(election_tbl[7].state == 'follower')
assert(election_tbl[6].state == 'follower')
| ---
| - true
| ...
Expand All @@ -408,32 +403,23 @@ box.ctl.promote()
| ---
| ...

test_run:wait_cond(function() return #election_tbl == 10 end)
| ---
| - true
| ...
assert(election_tbl[8].state == 'follower')
| ---
| - true
| ...
assert(election_tbl[8].term == election_tbl[7].term + 1)
test_run:wait_cond(function() return #election_tbl == 8 end)
| ---
| - true
| ...
-- Vote is visible here already, but it is volatile.
assert(election_tbl[8].vote == 1)
assert(election_tbl[7].state == 'candidate')
| ---
| - true
| ...
assert(election_tbl[9].state == 'candidate')
assert(election_tbl[7].term == election_tbl[6].term + 1)
| ---
| - true
| ...
assert(election_tbl[9].vote == 1)
assert(election_tbl[7].vote == 1)
| ---
| - true
| ...
assert(election_tbl[10].state == 'leader')
assert(election_tbl[8].state == 'leader')
| ---
| - true
| ...
Expand Down
33 changes: 14 additions & 19 deletions test/replication/election_basic.test.lua
Original file line number Diff line number Diff line change
Expand Up @@ -144,37 +144,32 @@ _ = box.ctl.on_election(trig)
box.cfg{replication_synchro_quorum=2}
box.cfg{election_mode='candidate'}

test_run:wait_cond(function() return #election_tbl == 4 end)
test_run:wait_cond(function() return #election_tbl == 3 end)
assert(election_tbl[1].state == 'follower')
assert(election_tbl[2].state == 'follower')
assert(election_tbl[2].term > election_tbl[1].term)
-- Vote is visible here already, but it is volatile.
assert(election_tbl[2].vote == 1)
assert(election_tbl[3].state == 'candidate')
assert(election_tbl[3].vote == 1)
assert(election_tbl[4].state == 'leader')
assert(election_tbl[2].state == 'candidate')
assert(election_tbl[2].vote == 1)
assert(election_tbl[3].state == 'leader')

box.cfg{election_mode='voter'}
test_run:wait_cond(function() return #election_tbl == 5 end)
assert(election_tbl[5].state == 'follower')
test_run:wait_cond(function() return #election_tbl == 4 end)
assert(election_tbl[4].state == 'follower')

box.cfg{election_mode='off'}
test_run:wait_cond(function() return #election_tbl == 6 end)
test_run:wait_cond(function() return #election_tbl == 5 end)

box.cfg{election_mode='manual'}
test_run:wait_cond(function() return #election_tbl == 7 end)
assert(election_tbl[7].state == 'follower')
test_run:wait_cond(function() return #election_tbl == 6 end)
assert(election_tbl[6].state == 'follower')

box.ctl.promote()

test_run:wait_cond(function() return #election_tbl == 10 end)
assert(election_tbl[8].state == 'follower')
assert(election_tbl[8].term == election_tbl[7].term + 1)
-- Vote is visible here already, but it is volatile.
assert(election_tbl[8].vote == 1)
assert(election_tbl[9].state == 'candidate')
assert(election_tbl[9].vote == 1)
assert(election_tbl[10].state == 'leader')
test_run:wait_cond(function() return #election_tbl == 8 end)
assert(election_tbl[7].state == 'candidate')
assert(election_tbl[7].term == election_tbl[6].term + 1)
assert(election_tbl[7].vote == 1)
assert(election_tbl[8].state == 'leader')

test_run:cmd('stop server replica')
test_run:cmd('delete server replica')
Expand Down