New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: Persist expected mon endpoints immediately during mon failover #7884
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can go away now since it's called by saveMonConfig()
:
rook/pkg/operator/ceph/cluster/mon/mon.go
Lines 505 to 508 in e06c36d
// make sure we have the connection info generated so connections can happen | |
if err := WriteConnectionConfig(c.context, c.ClusterInfo); err != nil { | |
return err | |
} |
Agreed that is duplicated, I'll remove it |
a72a773
to
35a2ae2
Compare
After mon failover is initiated, there was a time window where if the operator was restarted, the new mon is started and has joined quorum, but the operator does not believe the mon should be in quorum after the operator restart. The operator was mistakenly removing the extra mon prematurely, sometimes causing quorum to be lost if another mon was also down at the same time. If the mon does not come back online, steps to recover quroum would need to be followed from the disaster guide. Now the expected list of mons will be updated immediately during mon failover if the operator successfully created the new mon deployment, thus removing the window where restarting the operator can cause quorum loss. Signed-off-by: Travis Nielsen <tnielsen@redhat.com>
35a2ae2
to
bb4191a
Compare
ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)
ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)
This PS is to update the rook operator image to v1.6.3. Release notes: https://github.com/rook/rook/releases/tag/v1.6.3 The storage team is specifically interested in: rook/rook#7951 rook/rook#7884 Change-Id: Iea9479ccb6664d499e90cbad46a43912f7936530
Description of your changes:
After mon failover is initiated, there was a time window where if the operator was restarted, the new mon is started and has joined quorum, but the operator does not believe the mon should be in quorum after the operator restart. The operator was mistakenly removing the extra mon prematurely, sometimes causing quorum to be lost if another mon was also down at the same time. If the mon does not come back online, steps to recover quroum would need to be followed from the disaster guide. Now the expected list of mons will be updated immediately during mon failover if the operator successfully created the new mon deployment, thus removing the window where restarting the operator can cause quorum loss.
Most commonly this issue was hit during node drain where the operator might be restarted at a similar time that mon failover is triggered if node drains are taking longer than 10 min.
Which issue is resolved by this Pull Request:
Resolves #7797
Checklist:
make codegen
) has been run to update object specifications, if necessary.