Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug where the leader is never elected even if the majority of the members are alive. #442

Merged
merged 5 commits into from
Jun 5, 2024

Conversation

sile
Copy link
Contributor

@sile sile commented May 17, 2024

Proposed Changes

This PR addresses the issue reported in #439.

To summarize #439, if there is a candidate member and a pre_vote member where the pre_vote member has a higher log index than the candidate member, neither of them can ever be elected as the leader.
(This holds true even if there are additional N / 2 - 1 or fewer followers without election timers, where N is the cluster size.)

This PR adds a branch to ra_server:handle_candidate(#pre_vote_rpc{}, ...) to handle cases where the pre_vote member has a higher log index. By the new branch, when such a message is received, the candidate transitions to the follower state.

I think this is somewhat ad-hoc. However, since I don't know much about the ra code base (especially regarding the role of the pre_vote state), I made a patch to minimize the impact range.
Feel free to suggest any better alternative approaches.

Closes #439.

FYI

By applying the patch for reproduction from issue #439 to this PR branch, the execution result became as follows:

(foo@localhost)1> repro:run().
# create cluster
* [repro_a] init
* [repro_c] init
* [repro_b] init
* [repro_a] state_enter: recover
* [repro_c] state_enter: recover
* [repro_b] state_enter: recover
* [repro_a] state_enter: recovered
* [repro_c] state_enter: recovered
* [repro_b] state_enter: recovered
* [repro_a] state_enter: follower
* [repro_c] state_enter: follower
* [repro_b] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader
# Please wait 5 seconds...
# trigger election
ok
* [repro_a] state_enter: pre_vote
* [repro_a] state_enter: candidate
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_a] state_enter: follower
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader  # new leader is elected
* [repro_a] state_enter: await_condition
* [repro_a] state_enter: follower

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

Checklist

Put an x in the boxes that apply. You can also fill these out after creating
the PR. If you're unsure about any of them, don't hesitate to ask on the
mailing list. We're here to help! This is simply a reminder of what we are
going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

Further Comments

If this is a relatively large or complex change, kick off the discussion by
explaining why you chose the solution you did and what alternatives you
considered, etc.

@michaelklishin
Copy link
Member

@sile thank you for your ongoing contributions! We will get to this PR in the next couple of weeks.

Copy link
Contributor

@kjnilsson kjnilsson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Higher states are only really exited when encountering a higher term which this goes against. In this case I think it should be sufficient for the candidate to simply reply to the pre-vote request (with a success) as this would allow the pre-voter to potentially proceed to candidate state itself and eventually win the election.

@sile
Copy link
Contributor Author

sile commented May 21, 2024

@kjnilsson That sounds reasonable. Thank you for your comment!

I have incorporated the change in commit e6bbd1a. Here are the current results of the repro function:

> repro:run().
# create cluster
* [repro_b] init
* [repro_c] init
* [repro_a] init
* [repro_b] state_enter: recover
* [repro_c] state_enter: recover
* [repro_a] state_enter: recover
* [repro_b] state_enter: recovered
* [repro_c] state_enter: recovered
* [repro_a] state_enter: recovered
* [repro_b] state_enter: follower
* [repro_c] state_enter: follower
* [repro_a] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_c] state_enter: leader
# Please wait 5 seconds...
# trigger election
ok
* [repro_a] state_enter: pre_vote
* [repro_a] state_enter: candidate
* [repro_c] state_enter: follower
* [repro_c] state_enter: pre_vote
* [repro_c] state_enter: candidate
* [repro_a] state_enter: follower  # Unlike before the commit e6bbd1a, repro_a becomes follower after repro_c becomes candidate.
* [repro_c] state_enter: leader
* [repro_a] state_enter: await_condition
* [repro_a] state_enter: follower

src/ra_server.erl Outdated Show resolved Hide resolved
@kjnilsson kjnilsson merged commit d128c81 into rabbitmq:main Jun 5, 2024
8 of 10 checks passed
@sile sile deleted the fix-issue-439 branch June 5, 2024 07:50
@sile
Copy link
Contributor Author

sile commented Jun 5, 2024

Thank you for reviewing and merging this PR!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

There is a possibility that the leader is never elected even if the majority of the members are alive
4 participants