Skip to content

Commit

Permalink
ceph: do not check ok-to-stop when OSDs are in CLBO
Browse files Browse the repository at this point in the history
This handles the scenario where the OSDs have been created but not yet
started due to a wrong CR configuration.
For instance, when OSDs are encrypted and Vault is used to store
encryption keys, if the KV version is incorrect during the cluster
initialization the OSDs will fail to start and stay in CLBO until the
CR is updated again with the correct KV version so that it can start.

For this scenario, if the CRUSH map has no host registered yet it's fair
to assume the initialization broke and we need to fix it. So when don't
need to call ok-to-stop since it will always fail and eventually force
pass but let's not wait for nothing.

Signed-off-by: Sébastien Han <seb@redhat.com>
  • Loading branch information
leseb committed Aug 24, 2021
1 parent 6582bac commit 0f02792
Showing 1 changed file with 12 additions and 2 deletions.
14 changes: 12 additions & 2 deletions pkg/daemon/ceph/client/upgrade.go
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,8 @@ const (
var (
// we don't perform any checks on these daemons
// they don't have any "ok-to-stop" command implemented
daemonNoCheck = []string{"mgr", "rgw", "rbd-mirror", "nfs", "fs-mirror"}
daemonNoCheck = []string{"mgr", "rgw", "rbd-mirror", "nfs", "fs-mirror"}
errNoHostInCRUSH = errors.New("no host in crush map yet?")
)

func getCephMonVersionString(context *clusterd.Context, clusterInfo *ClusterInfo) (string, error) {
Expand Down Expand Up @@ -311,7 +312,7 @@ func allOSDsSameHost(context *clusterd.Context, clusterInfo *ClusterInfo) (bool,

hostOsdNodes := len(hostOsdTree.Nodes)
if hostOsdNodes == 0 {
return false, errors.New("no host in crush map yet?")
return false, errNoHostInCRUSH
}

// If the number of OSD node is 1, chances are this is simple setup with all OSDs on it
Expand Down Expand Up @@ -401,6 +402,15 @@ func osdDoNothing(context *clusterd.Context, clusterInfo *ClusterInfo) bool {
// aio means all in one
aio, err := allOSDsSameHost(context, clusterInfo)
if err != nil {
// We return true so that we can continue without a retry and subsequently not test if the osd can be stopped
// This handles the scenario where the OSDs have been created but not yet started due to a wrong CR configuration
// For instance, when OSDs are encrypted and Vault is used to store encryption keys, if the KV version is incorrect during the cluster initialization
// the OSDs will fail to start and stay in CLBO until the CR is updated again with the correct KV version so that it can start
// For this scenario we don't need to go through the path where the check whether the OSD can be stopped or not, so it will always fail and make us wait for nothing
if errors.Is(err, errNoHostInCRUSH) {
logger.Warning("the CRUSH map has not 'host' entries so not performing ok-to-stop checks")
return true
}
logger.Warningf("failed to determine if all osds are running on the same host, performing upgrade check anyways. %v", err)
return false
}
Expand Down

0 comments on commit 0f02792

Please sign in to comment.