Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write term and vote atomically during box.ctl.promote() #8497

Closed
sergepetrenko opened this issue Mar 24, 2023 · 1 comment · Fixed by #8699
Closed

Write term and vote atomically during box.ctl.promote() #8497

sergepetrenko opened this issue Mar 24, 2023 · 1 comment · Fixed by #8699
Assignees
Labels
2.10 Target is 2.10 and all newer release/master branches bug Something isn't working raft RAFT protocol

Comments

@sergepetrenko
Copy link
Collaborator

sergepetrenko commented Mar 24, 2023

Bug description
Reproduced on Tarantool 2.11.0-rc2, master (3.0.0-entrypoint-99-ge987427f6)

With box.ctl.promote() an instance votes for self only after writing the new term and broadcasting it to everyone.
If other nodes in cluster are configured to be candidates, they might have time to bump the term accordingly and, seeing that noone is candidate (because original candidate's vote for self wasn't delivered yet), start their own elections. This may lead to split-votes and box.ctl.promote() failing to promote the desired node.

Steps to reproduce
Start 2 tarantool instances (built with debug) in interactive mode:

-- Instance 1
-- Step 1
box.cfg{listen=3301, replication={3301,3302}, election_mode='candidate'}

-- Step 3 (at this point instance 1 must be the leader in term 2)
box.schema.user.grant('guest', 'replication')

-- Instance 2
-- Step 2
box.cfg{listen=3302, replication={3301,3302}, election_mode='candidate', read_only=true}

-- Step 4
fiber = require('fiber')
box.error.injection.set("ERRINJ_WAL_DELAY_COUNTDOWN", 1)
fiber.new(box.ctl.promote)
fiber.sleep(0.3)
box.error.injection.set("ERRINJ_WAL_DELAY", false)

After the fourth step the leader might be either instance 1 or instance 2, but instance 2's log will definitely contain 2023-03-24 18:02:42.069 [25157] main/128/lua raft.c:494 E> ER_OLD_TERM: The term is outdated: old - 3, new - 4 and the term after these manipulations will be >= 4 (while the expected behaviour is to see that instance 2 is the leader in term 3).

How the reproducer works: at step 4 we set an injection which will block the second wal write until ERRINJ_WAL_DELAY is set to false. After that we issue box.ctl.promote(), which will write the term successfully, but will block on writing a vote. After that we wait a bit and unblock the write. This way by the time the write's unblocked, the other node already votes for self, so the elections span one more round.
Node 1 log at that moment:

2023-03-24 18:02:41.576 [25155] main/111/applier/3302 I> RAFT: message {term: 3, state: follower} from 2
2023-03-24 18:02:41.576 [25155] main/111/applier/3302 I> RAFT: received a newer term from 2
2023-03-24 18:02:41.576 [25155] main/111/applier/3302 I> RAFT: bump term to 3, follow
2023-03-24 18:02:41.577 [25155] main/116/raft_worker I> RAFT: persisted state {term: 3}
2023-03-24 18:02:41.577 [25155] main/116/raft_worker I> RAFT: vote for 1, follow
<here instance 1 finishes term bump before instance 2 sends a vote request, so instance 1 nominates self>
2023-03-24 18:02:41.577 [25155] main/116/raft_worker I> RAFT: persisted state {term: 3, vote: 1}
2023-03-24 18:02:41.577 [25155] main/116/raft_worker I> RAFT: enter candidate state with 1 self vote
<only then instance 2 manages to send a vote request for self>
2023-03-24 18:02:41.876 [25155] main/111/applier/3302 I> RAFT: message {term: 3, vote: 2, state: candidate, vclock: {1: 3}} from 2
2023-03-24 18:02:41.876 [25155] main/111/applier/3302 I> RAFT: split vote is discovered - {1: 1, 2: 1}, new term in 0.190000 sec
2023-03-24 18:02:41.876 [25155] main/111/applier/3302 I> RAFT: vote request is skipped - competing candidate
2023-03-24 18:02:42.067 [25155] main I> RAFT: begin new election round
2023-03-24 18:02:42.067 [25155] main I> RAFT: bump term to 4, follow
2023-03-24 18:02:42.067 [25155] main I> RAFT: vote for 1, follow
@sergepetrenko sergepetrenko added bug Something isn't working raft RAFT protocol labels Mar 24, 2023
@sergepetrenko sergepetrenko self-assigned this Mar 24, 2023
@ochaton
Copy link
Member

ochaton commented Apr 17, 2023

Actually this bug stops any maintenance on cluster. Consider you have RAFT replicaset with some candidates (>1) and some voters. You need to shutdown server with current leader and to minimise downtime you better switch current leader to another available candidate.

Execution box.ctl.promote() on candidate can elect any available candidate including current leader (with term increasing of course). It seems to be correct behaviour, but can't be used for daily maintenance operations :(

@sergos sergos added 2.11 Target is 2.11 and all newer release/master branches 2.10 Target is 2.10 and all newer release/master branches and removed 2.11 Target is 2.11 and all newer release/master branches labels May 17, 2023
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 19, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Let's go back to writing term
and vote together for self votes.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
NO_TEST=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 19, 2023
box.ctl.promote() was impplemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 19, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Let's go back to writing term
and vote together for self votes.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
NO_TEST=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 19, 2023
box.ctl.promote() was impplemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Let's go back to writing term
and vote together for self votes.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
NO_TEST=covered by the next commit
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
box.ctl.promote() was impplemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Let's go back to writing term
and vote together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
box.ctl.promote() was impplemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Let's bring promote into line with automatic elections.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
box.ctl.promote() was impplemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 24, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 25, 2023
Co-authored-by: Andrey Aksenov <38073144+andreyaksenov@users.noreply.github.com>
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 25, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 25, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 26, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 26, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 29, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite tarantool#8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
sergepetrenko added a commit to sergepetrenko/tarantool that referenced this issue May 29, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes tarantool#8497

NO_DOC=bugfix
sergepetrenko added a commit that referenced this issue May 29, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite #8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible
sergepetrenko added a commit that referenced this issue May 29, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes #8497

NO_DOC=bugfix
sergepetrenko added a commit that referenced this issue May 29, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite #8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible

(cherry picked from commit 8a124e5)
sergepetrenko added a commit that referenced this issue May 29, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes #8497

NO_DOC=bugfix

(cherry picked from commit 1737121)
sergepetrenko added a commit that referenced this issue May 29, 2023
Commit c9155ac ("raft: persist new term and vote separately") made
the nodes persist new term and vote separately, using 2 WAL writes.
Writing the term first is needed to flush all the ongoing transactions,
so that the node's vclock is updated and can be checked against the
candidate's vclock. Otherwise it could happen that the node persists a
vote for some candidate only to find that it's vclock would actually
become incomparable with the candidate's.

Actually, this guard is not needed when checking a vote for self,
because a node can always vote for self. Besides, splitting term bump
and vote can lead to increased probability of split-vote. It may happen
that a candidate bumps and broadcasts the new term without a vote,
making other nodes vote for self. Let's go back to writing term and vote
together for self votes.

This change makes raft candidate persist term bump and vote for self in
one WAL write instead of two, so all the tests which count WAL writes or
expect 2 separate state updates for term and vote are rewritten.

Prerequisite #8497

NO_DOC=not user-visible
NO_CHANGELOG=not user-visible

(cherry picked from commit 8a124e5)
sergepetrenko added a commit that referenced this issue May 29, 2023
box.ctl.promote() was implemented as follows: an instance bumps the
term and marks itself a candidate, but doesn't vote for self
immediately. Instead it relies on the machinery which makes a candidate
vote for self as soon as it persists a new term.

This differs from a normal election start due to leader timeout: there
term and vote are bumped at once.

Besides, this increases probability of box.ctl.promote() resulting in
other node getting elected: if a node first broadcasts a term without a
vote, it is not considered a candidate, so other candidates might start
elections and vote for themselves.

Let's bring promote into line with automatic elections.

Closes #8497

NO_DOC=bugfix

(cherry picked from commit 1737121)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.10 Target is 2.10 and all newer release/master branches bug Something isn't working raft RAFT protocol
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants