Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

osd: add option upgradeOSDRequiresHealthyPGs #14040

Merged
merged 1 commit into from Apr 11, 2024
Merged

Conversation

mmaoyu
Copy link
Contributor

@mmaoyu mmaoyu commented Apr 8, 2024

operator: add option upgradeOSDRequiresHealthyPGs

For check if cluster PGs are healthy before osd update.
Which helps gainning more confidence in the osd upgrades.

Resolves #13811

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

@mmaoyu mmaoyu marked this pull request as draft April 8, 2024 07:23
@mmaoyu mmaoyu force-pushed the upgrade branch 5 times, most recently from d2e42c0 to 4c55120 Compare April 8, 2024 08:32
@mmaoyu mmaoyu marked this pull request as ready for review April 8, 2024 09:08
@mmaoyu
Copy link
Contributor Author

mmaoyu commented Apr 8, 2024

I've test locally, but the ci failed . Can someone can run the failed job again, the error message shows docker restart error, may the ci test env not ok.

@zhucan
Copy link
Member

zhucan commented Apr 8, 2024

@mmaoyu done

@@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole
If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).
* `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades).
* `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.
* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process.
* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait until the PGs status is clean before the OSD daemon upgrade process.

if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs {
pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex)
if err != nil {
logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err)
logger.Errorf("failed to check PGs status to update osd, will try it again later. %v", err)

@@ -43,6 +43,10 @@ spec:
# continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
# The default wait timeout is 10 minutes.
waitTimeoutForHealthyOSDInMinutes: 10
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.
# This configuration won't be applied if `skipUpgradeChecks` is `true`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# This configuration won't be applied if `skipUpgradeChecks` is `true`.
# This configuration will be ignored if `skipUpgradeChecks` is `true`.

return
}
if !pgClean {
logger.Infof("waiting PGs to be healthy to update osd. PGs status: %q", pgHealthMsg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there is not a retry? Or where do we wait or retry?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is a loop by invoker actualy. I've changed

@mmaoyu mmaoyu force-pushed the upgrade branch 4 times, most recently from 5c1b209 to 6bc42a7 Compare April 9, 2024 07:52
@mmaoyu
Copy link
Contributor Author

mmaoyu commented Apr 9, 2024

ci failed with error msg: Connection refused may the network is unstable ,please run the job again

Everything is done, please take a look at it again.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small logging suggestion and a question...
@BlaineEXE could you also review?

logger.Infof("PGs are not healthy to update osd, will try updating it again later. PGs status: %q", pgHealthMsg)
return
}
logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg)
logger.Infof("PGs are healthy to proceed updating OSDs. %v", pgHealthMsg)

pkg/operator/ceph/cluster/osd/update.go Show resolved Hide resolved
@travisn travisn dismissed subhamkrai’s stale review April 10, 2024 17:23

feedback addressed

Comment on lines 20 to 21
skipUpgradeChecks: false
upgradeOSDRequiresHealthyPGs: false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to keep cluster-test.yaml as slim as possible. Since these are both defaults and unlikely to need modified for test clusters, I think removing these is best.

@@ -121,6 +121,11 @@ cephClusterSpec:
# The default wait timeout is 10 minutes.
waitTimeoutForHealthyOSDInMinutes: 10

# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First instance of OSD is capitalized. Second instance is not. We should strive for consistency of docs. I see other places where this applies also. Please update other instances too.

Suggested change
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy.

@@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole
If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).
* `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades).
* `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.
* `upgradeOSDRequiresHealthyPGs`: if set to true osd upgrade process won't start until PGs are healthy.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `upgradeOSDRequiresHealthyPGs`: if set to true osd upgrade process won't start until PGs are healthy.
* `upgradeOSDRequiresHealthyPGs`: if set to true OSD upgrade process won't start until PGs are healthy.

if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs {
pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex)
if err != nil {
logger.Errorf("failed to check PGs status to update osd, will try updating it again later. %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is an error, Rook should return an error so that it is captured and reported to the controller. Either this function should return an error, or this logic should be moved to the routine that calls updateExistingOSDs()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, let's change this to a warning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a chat with Travis about this. This function doesn't return error on purpose, so that can't be done. The given means of retuning errors is through the provisionErrors parameter; however, that is for And it does seem like this is the appropriate place in the code to do this check. However, I think logging errors should only be done when necessary. Our general strategy has been that if there is an error we can try to continue from, we report it as a warning, so please change this to logger.Warningf. For reference, OSDUpdateShouldCheckOkToStop() logs warnings when there are ceph cli errors as well, for this reason.

For check if cluster PGs are healthy before osd updates
Which helps gainning more confidence in the osd upgrades

Signed-off-by: mmaoyu <wupiaoyu61@gmail.com>
@BlaineEXE BlaineEXE merged commit 8998b32 into rook:master Apr 11, 2024
51 of 53 checks passed
BlaineEXE added a commit that referenced this pull request Apr 11, 2024
osd: add option upgradeOSDRequiresHealthyPGs (backport #14040)
@mmaoyu mmaoyu deleted the upgrade branch April 12, 2024 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RFE: options for OSD restart gate
5 participants