Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ceph: reconcile osd pdb if allowed disruption is 0 #8698

Merged
merged 1 commit into from Sep 15, 2021
Merged

ceph: reconcile osd pdb if allowed disruption is 0 #8698

merged 1 commit into from Sep 15, 2021

Conversation

sp98
Copy link
Contributor

@sp98 sp98 commented Sep 13, 2021

Rook checks for down OSDs by checking the ReadyReplicas count
in the OSD deployement. When an OSD pod goes into CBLO due to
disk failure, there is a delay before this ReadyReplicas count
becomes 0. The deplay is very small but may result in rook missing
OSD down event. As a result no blocking PDBs will be created and
only default PDB with AllowedDisruptions count as 0 is available.
This PR tries to solve this. The OSD pdb reconciler will be
reconciled again if AllowedDisruptions count in the main
PDB is 0.

Signed-off-by: Santosh Pillai sapillai@redhat.com

Description of your changes:

Which issue is resolved by this Pull Request:
Resolves #

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.

@mergify mergify bot added the ceph main ceph tag label Sep 13, 2021
@sp98 sp98 marked this pull request as ready for review September 13, 2021 14:34
@sp98 sp98 requested a review from travisn September 13, 2021 14:34
allowedDisruptions, err := r.getAllowedDisruptions(osdPDBAppName, request.Namespace)
if err != nil {
logger.Debugf("failed to get allowed disruptions count from default osd pdb %q. Skipping reconcile. Error: %v", osdPDBAppName, err)
return reconcile.Result{}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we return the error and requeue? It seems like a rare error condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only possible error here is that the default PDB (osdPDBAppName) resource is not present. That would happen when blocking PDBs are present instead . Rook would not want to reconcile again in this situation. So I think its safe to ignore this error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if the getAllowedDisruptions() method returns the err directly instead of wrapping it, then here we could check for errors.IsNotFound(). Then it's obvious that we can safely ignore it if the pdb doesn't exist.

}

if allowedDisruptions == 0 {
logger.Info("reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If an OSD is crashlooping, but the PGs are healthy because they have been backfilled, this will cause the pdb reconcile to continue every 30s until the OSD stops crashing, right? Or will this be skipped if the PGs are healthy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When OSDs are crashlooping and PGs are not healthy, then we won't be seeing the default PDB (osdPDBAppName). Only blocking PDBs would be there. This condition won't hit because of the error in line 412

Its a good point though. I'll test it a bit more and confirm.

Copy link
Contributor Author

@sp98 sp98 Sep 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So OSD goes into CLBO and pgs become degraded. When pgs become active again after re-balancing, we end up with following:

Every 2.0s: oc get pdb -n rook-ceph                                     localhost.localdomain: Tue Sep 14 12:04:08 2021

NAME                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mon-pdb   N/A             1                 1                     37m
rook-ceph-osd       N/A             1                 0                     112s

Observe Allowed Disruptions is 0 because the OSD is still down. PDB reconciles because this is hit

Now user purges the OSD for the CLBO OSD pod. PBDs move back to the desired state again

$ oc get pdb -n rook-ceph 
NAME                MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
rook-ceph-mon-pdb   N/A             1                 1                     71m
rook-ceph-osd       N/A             1                 1                     36m

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thats true. Only way is to remove the crashed OSD. This behavior would still have been the same even before this PR. I'll a take a look (in a separate PR) if we handle this situation better.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sp98 please open up an issue to track this. Thanks

allowedDisruptions, err := r.getAllowedDisruptions(osdPDBAppName, request.Namespace)
if err != nil {
logger.Debugf("failed to get allowed disruptions count from default osd pdb %q. Skipping reconcile. Error: %v", osdPDBAppName, err)
return reconcile.Result{}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about if the getAllowedDisruptions() method returns the err directly instead of wrapping it, then here we could check for errors.IsNotFound(). Then it's obvious that we can safely ignore it if the pdb doesn't exist.

Rook checks for down OSDs by checking the `ReadyReplicas` count
in the OSD deployement. When an OSD pod goes into CBLO due to
disk failure, there is a delay before this `ReadyReplicas` count
becomes 0. The deplay is very small but may result in rook missing
OSD down event. As a result no blocking PDBs will be created and
only default PDB with `AllowedDisruptions` count as 0 is available.
This PR tries to solve this. The OSD pdb reconciler will be
reconciled again if `AllowedDisruptions` count in the main
PDB is 0.

Signed-off-by: Santosh Pillai <sapillai@redhat.com>
@sp98 sp98 requested review from leseb and travisn September 14, 2021 10:40
}

if allowedDisruptions == 0 {
logger.Info("reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.

@leseb leseb merged commit bcae365 into rook:master Sep 15, 2021
travisn added a commit that referenced this pull request Sep 15, 2021
ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)
travisn added a commit that referenced this pull request Sep 15, 2021
ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants