Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: Persist expected mon endpoints immediately during mon failover #7884

Merged
merged 1 commit into from May 12, 2021

Conversation

travisn
Copy link
Member

@travisn travisn commented May 11, 2021

Description of your changes:
After mon failover is initiated, there was a time window where if the operator was restarted, the new mon is started and has joined quorum, but the operator does not believe the mon should be in quorum after the operator restart. The operator was mistakenly removing the extra mon prematurely, sometimes causing quorum to be lost if another mon was also down at the same time. If the mon does not come back online, steps to recover quroum would need to be followed from the disaster guide. Now the expected list of mons will be updated immediately during mon failover if the operator successfully created the new mon deployment, thus removing the window where restarting the operator can cause quorum loss.

Most commonly this issue was hit during node drain where the operator might be restarted at a similar time that mon failover is triggered if node drains are taking longer than 10 min.

Which issue is resolved by this Pull Request:
Resolves #7797

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@travisn travisn requested a review from leseb May 11, 2021 21:17
Copy link
Member

@leseb leseb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can go away now since it's called by saveMonConfig():

// make sure we have the connection info generated so connections can happen
if err := WriteConnectionConfig(c.context, c.ClusterInfo); err != nil {
return err
}

@travisn
Copy link
Member Author

travisn commented May 12, 2021

I think this can go away now since it's called by saveMonConfig():

// make sure we have the connection info generated so connections can happen
if err := WriteConnectionConfig(c.context, c.ClusterInfo); err != nil {
return err
}

Agreed that is duplicated, I'll remove it

After mon failover is initiated, there was a time window where if the operator
was restarted, the new mon is started and has joined quorum, but the operator
does not believe the mon should be in quorum after the operator restart.
The operator was mistakenly removing the extra mon prematurely, sometimes
causing quorum to be lost if another mon was also down at the same time.
If the mon does not come back online, steps to recover quroum would need
to be followed from the disaster guide. Now the expected list of mons
will be updated immediately during mon failover if the operator successfully
created the new mon deployment, thus removing the window where restarting
the operator can cause quorum loss.

Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
@travisn travisn merged commit 081e2fd into rook:master May 12, 2021
@travisn travisn deleted the mon-failover-window branch May 12, 2021 16:42
mergify bot added a commit that referenced this pull request May 12, 2021
ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)
mergify bot added a commit that referenced this pull request May 12, 2021
ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)
airshipbot pushed a commit to airshipit/treasuremap that referenced this pull request Jun 7, 2021
This PS is to update the rook operator image to v1.6.3.

Release notes:

https://github.com/rook/rook/releases/tag/v1.6.3

The storage team is specifically interested in:

rook/rook#7951
rook/rook#7884

Change-Id: Iea9479ccb6664d499e90cbad46a43912f7936530
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Mon failover can cause mons to fall out of quorum if the operator is disrupted in the middle of the failover
3 participants