New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ceph: reconcile osd pdb if allowed disruption is 0 #8698
Conversation
allowedDisruptions, err := r.getAllowedDisruptions(osdPDBAppName, request.Namespace) | ||
if err != nil { | ||
logger.Debugf("failed to get allowed disruptions count from default osd pdb %q. Skipping reconcile. Error: %v", osdPDBAppName, err) | ||
return reconcile.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we return the error and requeue? It seems like a rare error condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only possible error here is that the default PDB (osdPDBAppName) resource is not present. That would happen when blocking PDBs are present instead . Rook would not want to reconcile again in this situation. So I think its safe to ignore this error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if the getAllowedDisruptions()
method returns the err directly instead of wrapping it, then here we could check for errors.IsNotFound()
. Then it's obvious that we can safely ignore it if the pdb doesn't exist.
} | ||
|
||
if allowedDisruptions == 0 { | ||
logger.Info("reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an OSD is crashlooping, but the PGs are healthy because they have been backfilled, this will cause the pdb reconcile to continue every 30s until the OSD stops crashing, right? Or will this be skipped if the PGs are healthy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When OSDs are crashlooping and PGs are not healthy, then we won't be seeing the default PDB (osdPDBAppName). Only blocking PDBs would be there. This condition won't hit because of the error in line 412
Its a good point though. I'll test it a bit more and confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So OSD goes into CLBO and pgs become degraded. When pgs become active again after re-balancing, we end up with following:
Every 2.0s: oc get pdb -n rook-ceph localhost.localdomain: Tue Sep 14 12:04:08 2021
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mon-pdb N/A 1 1 37m
rook-ceph-osd N/A 1 0 112s
Observe Allowed Disruptions
is 0 because the OSD is still down. PDB reconciles because this is hit
Now user purges the OSD for the CLBO OSD pod. PBDs move back to the desired state again
$ oc get pdb -n rook-ceph
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
rook-ceph-mon-pdb N/A 1 1 71m
rook-ceph-osd N/A 1 1 36m
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thats true. Only way is to remove the crashed OSD. This behavior would still have been the same even before this PR. I'll a take a look (in a separate PR) if we handle this situation better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sp98 please open up an issue to track this. Thanks
allowedDisruptions, err := r.getAllowedDisruptions(osdPDBAppName, request.Namespace) | ||
if err != nil { | ||
logger.Debugf("failed to get allowed disruptions count from default osd pdb %q. Skipping reconcile. Error: %v", osdPDBAppName, err) | ||
return reconcile.Result{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about if the getAllowedDisruptions()
method returns the err directly instead of wrapping it, then here we could check for errors.IsNotFound()
. Then it's obvious that we can safely ignore it if the pdb doesn't exist.
Rook checks for down OSDs by checking the `ReadyReplicas` count in the OSD deployement. When an OSD pod goes into CBLO due to disk failure, there is a delay before this `ReadyReplicas` count becomes 0. The deplay is very small but may result in rook missing OSD down event. As a result no blocking PDBs will be created and only default PDB with `AllowedDisruptions` count as 0 is available. This PR tries to solve this. The OSD pdb reconciler will be reconciled again if `AllowedDisruptions` count in the main PDB is 0. Signed-off-by: Santosh Pillai <sapillai@redhat.com>
} | ||
|
||
if allowedDisruptions == 0 { | ||
logger.Info("reconciling osd pdb reconciler as the allowed disruptions in default pdb is 0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, so the only way to return to the state of clean OSD PVCs is to remove/replace the crashing OSD. And from your link, the PDBs will continue reconciling every 15s? That might be worth looking into for a separate PR so it isn't so frequent.
ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)
ceph: reconcile osd pdb if allowed disruption is 0 (backport #8698)
Rook checks for down OSDs by checking the
ReadyReplicas
countin the OSD deployement. When an OSD pod goes into CBLO due to
disk failure, there is a delay before this
ReadyReplicas
countbecomes 0. The deplay is very small but may result in rook missing
OSD down event. As a result no blocking PDBs will be created and
only default PDB with
AllowedDisruptions
count as 0 is available.This PR tries to solve this. The OSD pdb reconciler will be
reconciled again if
AllowedDisruptions
count in the mainPDB is 0.
Signed-off-by: Santosh Pillai sapillai@redhat.com
Description of your changes:
Which issue is resolved by this Pull Request:
Resolves #
Checklist:
make codegen
) has been run to update object specifications, if necessary.