ceph: reconcile osd pdb if allowed disruption is 0 #8698

sp98 · 2021-09-13T10:50:00Z

Rook checks for down OSDs by checking the ReadyReplicas count
in the OSD deployement. When an OSD pod goes into CBLO due to
disk failure, there is a delay before this ReadyReplicas count
becomes 0. The deplay is very small but may result in rook missing
OSD down event. As a result no blocking PDBs will be created and
only default PDB with AllowedDisruptions count as 0 is available.
This PR tries to solve this. The OSD pdb reconciler will be
reconciled again if AllowedDisruptions count in the main
PDB is 0.

Signed-off-by: Santosh Pillai sapillai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

travisn · 2021-09-13T15:00:53Z

pkg/operator/ceph/disruption/clusterdisruption/osd.go

+	allowedDisruptions, err := r.getAllowedDisruptions(osdPDBAppName, request.Namespace)
+	if err != nil {
+		logger.Debugf("failed to get allowed disruptions count from default osd pdb %q. Skipping reconcile. Error: %v", osdPDBAppName, err)
+		return reconcile.Result{}, nil


Shouldn't we return the error and requeue? It seems like a rare error condition.

Only possible error here is that the default PDB (osdPDBAppName) resource is not present. That would happen when blocking PDBs are present instead . Rook would not want to reconcile again in this situation. So I think its safe to ignore this error.

How about if the getAllowedDisruptions() method returns the err directly instead of wrapping it, then here we could check for errors.IsNotFound(). Then it's obvious that we can safely ignore it if the pdb doesn't exist.

travisn · 2021-09-13T15:03:35Z

pkg/operator/ceph/disruption/clusterdisruption/osd.go

+	}
+
+	if allowedDisruptions == 0 {
+		logger.Info("reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0")


If an OSD is crashlooping, but the PGs are healthy because they have been backfilled, this will cause the pdb reconcile to continue every 30s until the OSD stops crashing, right? Or will this be skipped if the PGs are healthy?

When OSDs are crashlooping and PGs are not healthy, then we won't be seeing the default PDB (osdPDBAppName). Only blocking PDBs would be there. This condition won't hit because of the error in line 412

Its a good point though. I'll test it a bit more and confirm.

So OSD goes into CLBO and pgs become degraded. When pgs become active again after re-balancing, we end up with following:

Every 2.0s: oc get pdb -n rook-ceph localhost.localdomain: Tue Sep 14 12:04:08 2021 NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mon-pdb N/A 1 1 37m rook-ceph-osd N/A 1 0 112s

Observe Allowed Disruptions is 0 because the OSD is still down. PDB reconciles because this is hit

Now user purges the OSD for the CLBO OSD pod. PBDs move back to the desired state again

$ oc get pdb -n rook-ceph NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE rook-ceph-mon-pdb N/A 1 1 71m rook-ceph-osd N/A 1 1 36m

Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.

Yes, thats true. Only way is to remove the crashed OSD. This behavior would still have been the same even before this PR. I'll a take a look (in a separate PR) if we handle this situation better.

@sp98 please open up an issue to track this. Thanks

travisn · 2021-09-13T17:21:14Z

pkg/operator/ceph/disruption/clusterdisruption/osd.go

+	allowedDisruptions, err := r.getAllowedDisruptions(osdPDBAppName, request.Namespace)
+	if err != nil {
+		logger.Debugf("failed to get allowed disruptions count from default osd pdb %q. Skipping reconcile. Error: %v", osdPDBAppName, err)
+		return reconcile.Result{}, nil


How about if the getAllowedDisruptions() method returns the err directly instead of wrapping it, then here we could check for errors.IsNotFound(). Then it's obvious that we can safely ignore it if the pdb doesn't exist.

pkg/operator/ceph/disruption/clusterdisruption/osd.go

Rook checks for down OSDs by checking the `ReadyReplicas` count in the OSD deployement. When an OSD pod goes into CBLO due to disk failure, there is a delay before this `ReadyReplicas` count becomes 0. The deplay is very small but may result in rook missing OSD down event. As a result no blocking PDBs will be created and only default PDB with `AllowedDisruptions` count as 0 is available. This PR tries to solve this. The OSD pdb reconciler will be reconciled again if `AllowedDisruptions` count in the main PDB is 0. Signed-off-by: Santosh Pillai <sapillai@redhat.com>

travisn · 2021-09-14T15:29:31Z

pkg/operator/ceph/disruption/clusterdisruption/osd.go

+	}
+
+	if allowedDisruptions == 0 {
+		logger.Info("reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0")


Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.

ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)

mergify bot added the ceph main ceph tag label Sep 13, 2021

sp98 marked this pull request as ready for review September 13, 2021 14:34

sp98 requested a review from travisn September 13, 2021 14:34

travisn requested changes Sep 13, 2021

View reviewed changes

leseb requested changes Sep 14, 2021

View reviewed changes

pkg/operator/ceph/disruption/clusterdisruption/osd.go Show resolved Hide resolved

pkg/operator/ceph/disruption/clusterdisruption/osd.go Outdated Show resolved Hide resolved

pkg/operator/ceph/disruption/clusterdisruption/osd.go Show resolved Hide resolved

sp98 requested review from leseb and travisn September 14, 2021 10:40

leseb approved these changes Sep 14, 2021

View reviewed changes

travisn approved these changes Sep 14, 2021

View reviewed changes

leseb added backport-release-1.6 labels Sep 15, 2021

leseb merged commit bcae365 into rook:master Sep 15, 2021

This was referenced Sep 15, 2021

ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698) #8719

Merged

ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698) #8720

Merged

travisn added a commit that referenced this pull request Sep 15, 2021

Merge pull request #8719 from rook/mergify/bp/release-1.6/pr-8698

f36318b

ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)

travisn added a commit that referenced this pull request Sep 15, 2021

Merge pull request #8720 from rook/mergify/bp/release-1.7/pr-8698

3eb0a5b

ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ceph: reconcile osd pdb if allowed disruption is 0 #8698

ceph: reconcile osd pdb if allowed disruption is 0 #8698

sp98 commented Sep 13, 2021

travisn Sep 13, 2021

sp98 Sep 13, 2021

travisn Sep 13, 2021

travisn Sep 13, 2021

sp98 Sep 13, 2021

sp98 Sep 14, 2021 •

edited

travisn Sep 14, 2021

sp98 Sep 15, 2021

leseb Sep 15, 2021

travisn Sep 13, 2021

travisn Sep 14, 2021

ceph: reconcile osd pdb if allowed disruption is 0 #8698

ceph: reconcile osd pdb if allowed disruption is 0 #8698

Conversation

sp98 commented Sep 13, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sp98 Sep 14, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sp98 Sep 14, 2021 •

edited