New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
osd: add option upgradeOSDRequiresHealthyPGs #14040
Conversation
d2e42c0
to
4c55120
Compare
I've test locally, but the ci failed . Can someone can run the failed job again, the error message shows docker restart error, may the ci test env not ok. |
@mmaoyu done |
@@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole | |||
If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). | |||
* `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades). | |||
* `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy. | |||
* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process. | |
* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait until the PGs status is clean before the OSD daemon upgrade process. |
if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs { | ||
pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex) | ||
if err != nil { | ||
logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err) | |
logger.Errorf("failed to check PGs status to update osd, will try it again later. %v", err) |
deploy/examples/cluster.yaml
Outdated
@@ -43,6 +43,10 @@ spec: | |||
# continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`. | |||
# The default wait timeout is 10 minutes. | |||
waitTimeoutForHealthyOSDInMinutes: 10 | |||
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy. | |||
# This configuration won't be applied if `skipUpgradeChecks` is `true`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# This configuration won't be applied if `skipUpgradeChecks` is `true`. | |
# This configuration will be ignored if `skipUpgradeChecks` is `true`. |
return | ||
} | ||
if !pgClean { | ||
logger.Infof("waiting PGs to be healthy to update osd. PGs status: %q", pgHealthMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like there is not a retry? Or where do we wait or retry?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is a loop by invoker actualy. I've changed
5c1b209
to
6bc42a7
Compare
ci failed with error msg: Everything is done, please take a look at it again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small logging suggestion and a question...
@BlaineEXE could you also review?
logger.Infof("PGs are not healthy to update osd, will try updating it again later. PGs status: %q", pgHealthMsg) | ||
return | ||
} | ||
logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg) | |
logger.Infof("PGs are healthy to proceed updating OSDs. %v", pgHealthMsg) |
deploy/examples/cluster-test.yaml
Outdated
skipUpgradeChecks: false | ||
upgradeOSDRequiresHealthyPGs: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We try to keep cluster-test.yaml
as slim as possible. Since these are both defaults and unlikely to need modified for test clusters, I think removing these is best.
@@ -121,6 +121,11 @@ cephClusterSpec: | |||
# The default wait timeout is 10 minutes. | |||
waitTimeoutForHealthyOSDInMinutes: 10 | |||
|
|||
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First instance of OSD is capitalized. Second instance is not. We should strive for consistency of docs. I see other places where this applies also. Please update other instances too.
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy. | |
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy. |
@@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole | |||
If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir). | |||
* `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades). | |||
* `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy. | |||
* `upgradeOSDRequiresHealthyPGs`: if set to true osd upgrade process won't start until PGs are healthy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `upgradeOSDRequiresHealthyPGs`: if set to true osd upgrade process won't start until PGs are healthy. | |
* `upgradeOSDRequiresHealthyPGs`: if set to true OSD upgrade process won't start until PGs are healthy. |
if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs { | ||
pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex) | ||
if err != nil { | ||
logger.Errorf("failed to check PGs status to update osd, will try updating it again later. %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an error, Rook should return an error so that it is captured and reported to the controller. Either this function should return an error, or this logic should be moved to the routine that calls updateExistingOSDs()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, let's change this to a warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had a chat with Travis about this. This function doesn't return error on purpose, so that can't be done. The given means of retuning errors is through the provisionErrors
parameter; however, that is for And it does seem like this is the appropriate place in the code to do this check. However, I think logging errors should only be done when necessary. Our general strategy has been that if there is an error we can try to continue from, we report it as a warning, so please change this to logger.Warningf
. For reference, OSDUpdateShouldCheckOkToStop()
logs warnings when there are ceph cli errors as well, for this reason.
For check if cluster PGs are healthy before osd updates Which helps gainning more confidence in the osd upgrades Signed-off-by: mmaoyu <wupiaoyu61@gmail.com>
osd: add option upgradeOSDRequiresHealthyPGs (backport #14040)
operator: add option upgradeOSDRequiresHealthyPGs
For check if cluster PGs are healthy before osd update.
Which helps gainning more confidence in the osd upgrades.
Resolves #13811
Checklist: