ceph: Persist expected mon endpoints immediately during mon failover #7884

travisn · 2021-05-11T21:17:04Z

Description of your changes:
After mon failover is initiated, there was a time window where if the operator was restarted, the new mon is started and has joined quorum, but the operator does not believe the mon should be in quorum after the operator restart. The operator was mistakenly removing the extra mon prematurely, sometimes causing quorum to be lost if another mon was also down at the same time. If the mon does not come back online, steps to recover quroum would need to be followed from the disaster guide. Now the expected list of mons will be updated immediately during mon failover if the operator successfully created the new mon deployment, thus removing the window where restarting the operator can cause quorum loss.

Most commonly this issue was hit during node drain where the operator might be restarted at a similar time that mon failover is triggered if node drains are taking longer than 10 min.

Which issue is resolved by this Pull Request:
Resolves #7797

Checklist:

leseb

I think this can go away now since it's called by saveMonConfig():

rook/pkg/operator/ceph/cluster/mon/mon.go

Lines 505 to 508 in e06c36d

    
           // make sure we have the connection info generated so connections can happen 
        
           if err := WriteConnectionConfig(c.context, c.ClusterInfo); err != nil { 
        
           	return err 
        
           }

travisn · 2021-05-12T14:34:39Z

I think this can go away now since it's called by saveMonConfig():

rook/pkg/operator/ceph/cluster/mon/mon.go

Lines 505 to 508 in e06c36d

// make sure we have the connection info generated so connections can happen

if err := WriteConnectionConfig(c.context, c.ClusterInfo); err != nil {

return err

}

Agreed that is duplicated, I'll remove it

After mon failover is initiated, there was a time window where if the operator was restarted, the new mon is started and has joined quorum, but the operator does not believe the mon should be in quorum after the operator restart. The operator was mistakenly removing the extra mon prematurely, sometimes causing quorum to be lost if another mon was also down at the same time. If the mon does not come back online, steps to recover quroum would need to be followed from the disaster guide. Now the expected list of mons will be updated immediately during mon failover if the operator successfully created the new mon deployment, thus removing the window where restarting the operator can cause quorum loss. Signed-off-by: Travis Nielsen <tnielsen@redhat.com>

ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)

This PS is to update the rook operator image to v1.6.3. Release notes: https://github.com/rook/rook/releases/tag/v1.6.3 The storage team is specifically interested in: rook/rook#7951 rook/rook#7884 Change-Id: Iea9479ccb6664d499e90cbad46a43912f7936530

travisn added ceph main ceph tag backport-release-1.5 labels May 11, 2021

travisn requested a review from leseb May 11, 2021 21:17

sp98 approved these changes May 12, 2021

View reviewed changes

leseb requested changes May 12, 2021

View reviewed changes

travisn force-pushed the mon-failover-window branch from a72a773 to 35a2ae2 Compare May 12, 2021 14:36

leseb approved these changes May 12, 2021

View reviewed changes

travisn force-pushed the mon-failover-window branch from 35a2ae2 to bb4191a Compare May 12, 2021 15:44

travisn merged commit 081e2fd into rook:master May 12, 2021

travisn deleted the mon-failover-window branch May 12, 2021 16:42

This was referenced May 12, 2021

ceph: Persist expected mon endpoints immediately during mon failover (backport #7884) #7895

Merged

ceph: Persist expected mon endpoints immediately during mon failover (backport #7884) #7896

Merged

mergify bot added a commit that referenced this pull request May 12, 2021

Merge pull request #7896 from rook/mergify/bp/release-1.6/pr-7884

5dd2d38

ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)

mergify bot added a commit that referenced this pull request May 12, 2021

Merge pull request #7895 from rook/mergify/bp/release-1.5/pr-7884

46300c4

ceph: Persist expected mon endpoints immediately during mon failover (backport #7884)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph: Persist expected mon endpoints immediately during mon failover #7884

ceph: Persist expected mon endpoints immediately during mon failover #7884

travisn commented May 11, 2021

leseb left a comment

travisn commented May 12, 2021

	// make sure we have the connection info generated so connections can happen
	if err := WriteConnectionConfig(c.context, c.ClusterInfo); err != nil {
	return err
	}

ceph: Persist expected mon endpoints immediately during mon failover #7884

ceph: Persist expected mon endpoints immediately during mon failover #7884

Conversation

travisn commented May 11, 2021

leseb left a comment

Choose a reason for hiding this comment

travisn commented May 12, 2021