Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: set blocking PDB even if no unhealthy PGs appear #13511

Merged
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
39 changes: 4 additions & 35 deletions pkg/operator/ceph/disruption/clusterdisruption/osd.go
Original file line number Diff line number Diff line change
Expand Up @@ -338,24 +338,9 @@ func (r *ReconcileClusterDisruption) reconcilePDBsForOSDs(
}

switch {
// osd is down but pgs are active+clean
case osdDown && pgClean:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the PGs are clean, the intent is that the PDBs will again reset after some timeout. Perhaps 30s was just too short. What about a timeout of 5 or 10 minutes? @sp98 thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case osdDown && pgClean:

  • This case would handle the scenario when OSD was down but PGs are still clean.
  • This can happen if OSD went down and it took some time for ceph to update the pg down status or for rook to read the pg down status.
  • So we wait for around 30 seconds after the OSD went down to confirm if the PGs went down or not.
  • If the PGs didn't go down after 30 seconds, we reset (build: fix incremental build on linux #350) back to normal state and allow more OSDs to be drained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the whole idea of the 30 second time period is to give enough time to Rook to read the pg status from correctly.

Copy link
Contributor

@sp98 sp98 Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the OSD is down but PGs are acitive+clean, why would we want to even enable the blocking PDBs? Shouldn't we allow the next OSD to drain as data is safe? @ushitora-anqou

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sp98 If there are multiple OSDs on the node, the only way for a node to be drained is if the blocking PDBs are applied to other nodes or zones to allow this one to drain. If I'm understanding the timeout correctly, perhaps we need a longer timeout if there are multiple OSDs on a node, so the local node/zone can be allowed to drain.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ushitora-anqou Thanks for the details. Can you please confirm the pdb behavior for the following scenarios as well?

  • This disk backing the OSD was removed.
  • OSD deployment was deleted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@sp98 sp98 Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed testing @ushitora-anqou . Really appreciate the effort you have put into testing this.

@travisn The PR looks good to me so approving this. I don't think there will be any regression due to this change (fingers crossed). But still requesting your approval before we merge this, just ensure that I'm not missing anything.

Copy link
Member

@travisn travisn Feb 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the changes in this PR, but I want to be clear about the behavior change. If any OSD pod is down, now we would expect the blocking PDBs to be in place. Even if the PGs become active+clean, the blocking PDBs will always be enabled. This means that if an OSD is down due to a failed disk, that OSD would either have to be repaired or purged before the PDBs will be reset again.

This behavior is simple and intuitive to me, but it is different from the previous behavior that would allow the PDBs to be reset after some time even if an OSD disk fails and Ceph backfills its data to other OSDs.

Is that correct, or any other clarification needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems correct to me.

it is different from the previous behavior that would allow the PDBs to be reset after some time even if an OSD disk fails and Ceph backfills its data to other OSDs.

Yes, it is. My PR claims that the previous behavior is problematic because the reset PDBs doesn't allow the OSDs that can be drained to actually be drained.

lastDrainTimeStamp, err := getLastDrainTimeStamp(pdbStateMap, drainingFailureDomainDurationKey)
if err != nil {
return reconcile.Result{}, errors.Wrapf(err, "failed to get last drain timestamp from the configmap %q", pdbStateMap.Name)
}
timeSinceOSDDown := time.Since(lastDrainTimeStamp)
if timeSinceOSDDown > 30*time.Second {
logger.Infof("osd is down in failure domain %q is down for the last %.2f minutes, but pgs are active+clean", drainingFailureDomain, timeSinceOSDDown.Minutes())
resetPDBConfig(pdbStateMap)
} else {
logger.Infof("osd is down in the failure domain %q, but pgs are active+clean. Requeuing in case pg status is not updated yet...", drainingFailureDomain)
return reconcile.Result{Requeue: true, RequeueAfter: 15 * time.Second}, nil
}

// osd is down and pgs are not healthy
case osdDown && !pgClean:
logger.Infof("osd is down in failure domain %q and pgs are not active+clean. pg health: %q", drainingFailureDomain, pgHealthMsg)
// osd is down
case osdDown:
logger.Infof("osd is down in failure domain %q. pg health: %q", drainingFailureDomain, pgHealthMsg)
currentlyDrainingFD, ok := pdbStateMap.Data[drainingFailureDomainKey]
if !ok || drainingFailureDomain != currentlyDrainingFD {
pdbStateMap.Data[drainingFailureDomainKey] = drainingFailureDomain
Expand Down Expand Up @@ -383,7 +368,7 @@ func (r *ReconcileClusterDisruption) reconcilePDBsForOSDs(
}
}

if pdbStateMap.Data[drainingFailureDomainKey] != "" && !pgClean {
if pdbStateMap.Data[drainingFailureDomainKey] != "" {
// delete default OSD pdb and create blocking OSD pdbs
err := r.handleActiveDrains(allFailureDomains, pdbStateMap.Data[drainingFailureDomainKey], failureDomainType, clusterInfo.Namespace, pgClean)
if err != nil {
Expand Down Expand Up @@ -646,22 +631,6 @@ func getPDBName(failureDomainType, failureDomainName string) string {
return k8sutil.TruncateNodeName(fmt.Sprintf("%s-%s-%s", osdPDBAppName, failureDomainType, "%s"), failureDomainName)
}

func getLastDrainTimeStamp(pdbStateMap *corev1.ConfigMap, key string) (time.Time, error) {
var err error
var lastDrainTimeStamp time.Time
lastDrainTimeStampString, ok := pdbStateMap.Data[key]
if !ok || len(lastDrainTimeStampString) == 0 {
return time.Now(), nil
} else {
lastDrainTimeStamp, err = time.Parse(time.RFC3339, pdbStateMap.Data[key])
if err != nil {
return time.Time{}, errors.Wrapf(err, "failed to parse timestamp %q", pdbStateMap.Data[key])
}
}

return lastDrainTimeStamp, nil
}

func (r *ReconcileClusterDisruption) getAllowedDisruptions(pdbName, namespace string) (int32, error) {
usePDBV1Beta1, err := k8sutil.UsePDBV1Beta1Version(r.context.ClusterdContext.Clientset)
if err != nil {
Expand Down