osd: add option upgradeOSDRequiresHealthyPGs #14040

mmaoyu · 2024-04-08T07:07:22Z

operator: add option upgradeOSDRequiresHealthyPGs

For check if cluster PGs are healthy before osd update.
Which helps gainning more confidence in the osd upgrades.

Resolves #13811

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

mmaoyu · 2024-04-08T09:11:07Z

I've test locally, but the ci failed . Can someone can run the failed job again, the error message shows docker restart error, may the ci test env not ok.

zhucan · 2024-04-08T09:41:47Z

@mmaoyu done

subhamkrai · 2024-04-08T12:42:04Z

Documentation/CRDs/Cluster/ceph-cluster-crd.md

@@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole
 If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).
 * `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades).
 * `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.
+* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process.


Suggested change

* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process.

* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait until the PGs status is clean before the OSD daemon upgrade process.

subhamkrai · 2024-04-08T12:55:37Z

pkg/operator/ceph/cluster/osd/update.go

+	if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs {
+		pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex)
+		if err != nil {
+			logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err)


Suggested change

logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err)

logger.Errorf("failed to check PGs status to update osd, will try it again later. %v", err)

travisn · 2024-04-08T18:52:04Z

deploy/examples/cluster.yaml

@@ -43,6 +43,10 @@ spec:
  # continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
  # The default wait timeout is 10 minutes.
  waitTimeoutForHealthyOSDInMinutes: 10
+  # Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.
+  # This configuration won't be applied if `skipUpgradeChecks` is `true`.


Suggested change

# This configuration won't be applied if `skipUpgradeChecks` is `true`.

# This configuration will be ignored if `skipUpgradeChecks` is `true`.

travisn · 2024-04-08T18:55:12Z

pkg/operator/ceph/cluster/osd/update.go

+			return
+		}
+		if !pgClean {
+			logger.Infof("waiting PGs to be healthy to update osd. PGs status: %q", pgHealthMsg)


Seems like there is not a retry? Or where do we wait or retry?

Yes, it is a loop by invoker actualy. I've changed

mmaoyu · 2024-04-09T08:52:15Z

ci failed with error msg: Connection refused may the network is unstable ,please run the job again

Everything is done, please take a look at it again.

travisn

Just a small logging suggestion and a question...
@BlaineEXE could you also review?

travisn · 2024-04-09T13:30:12Z

pkg/operator/ceph/cluster/osd/update.go

+			logger.Infof("PGs are not healthy to update osd, will try updating it again later. PGs status: %q", pgHealthMsg)
+			return
+		}
+		logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg)


Suggested change

logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg)

logger.Infof("PGs are healthy to proceed updating OSDs. %v", pgHealthMsg)

pkg/operator/ceph/cluster/osd/update.go

feedback addressed

BlaineEXE · 2024-04-10T17:23:42Z

deploy/examples/cluster-test.yaml

+  skipUpgradeChecks: false
+  upgradeOSDRequiresHealthyPGs: false


We try to keep cluster-test.yaml as slim as possible. Since these are both defaults and unlikely to need modified for test clusters, I think removing these is best.

BlaineEXE · 2024-04-10T17:24:21Z

deploy/charts/rook-ceph-cluster/values.yaml

@@ -121,6 +121,11 @@ cephClusterSpec:
  # The default wait timeout is 10 minutes.
  waitTimeoutForHealthyOSDInMinutes: 10

+  # Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.


First instance of OSD is capitalized. Second instance is not. We should strive for consistency of docs. I see other places where this applies also. Please update other instances too.

Suggested change

# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.

# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy.

BlaineEXE · 2024-04-10T17:24:55Z

Documentation/CRDs/Cluster/ceph-cluster-crd.md

@@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole
 If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).
 * `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades).
 * `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.
+* `upgradeOSDRequiresHealthyPGs`: if set to true osd upgrade process won't start until PGs are healthy.


Suggested change

* `upgradeOSDRequiresHealthyPGs`: if set to true osd upgrade process won't start until PGs are healthy.

* `upgradeOSDRequiresHealthyPGs`: if set to true OSD upgrade process won't start until PGs are healthy.

BlaineEXE · 2024-04-10T17:27:58Z

pkg/operator/ceph/cluster/osd/update.go

+	if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs {
+		pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex)
+		if err != nil {
+			logger.Errorf("failed to check PGs status to update osd, will try updating it again later. %v", err)


If there is an error, Rook should return an error so that it is captured and reported to the controller. Either this function should return an error, or this logic should be moved to the routine that calls updateExistingOSDs()

Agreed, let's change this to a warning

Had a chat with Travis about this. This function doesn't return error on purpose, so that can't be done. The given means of retuning errors is through the provisionErrors parameter; however, that is for And it does seem like this is the appropriate place in the code to do this check. However, I think logging errors should only be done when necessary. Our general strategy has been that if there is an error we can try to continue from, we report it as a warning, so please change this to logger.Warningf. For reference, OSDUpdateShouldCheckOkToStop() logs warnings when there are ceph cli errors as well, for this reason.

For check if cluster PGs are healthy before osd updates Which helps gainning more confidence in the osd upgrades Signed-off-by: mmaoyu <wupiaoyu61@gmail.com>

osd: add option upgradeOSDRequiresHealthyPGs (backport #14040)

mmaoyu marked this pull request as draft April 8, 2024 07:23

mmaoyu force-pushed the upgrade branch 5 times, most recently from d2e42c0 to 4c55120 Compare April 8, 2024 08:32

mmaoyu marked this pull request as ready for review April 8, 2024 09:08

subhamkrai previously requested changes Apr 8, 2024

View reviewed changes

travisn reviewed Apr 8, 2024

View reviewed changes

mmaoyu force-pushed the upgrade branch 4 times, most recently from 5c1b209 to 6bc42a7 Compare April 9, 2024 07:52

travisn reviewed Apr 9, 2024

View reviewed changes

mmaoyu force-pushed the upgrade branch from 6bc42a7 to 7786e8a Compare April 10, 2024 09:00

travisn approved these changes Apr 10, 2024

View reviewed changes

BlaineEXE requested changes Apr 10, 2024

View reviewed changes

osd: add option upgradeOSDRequiresHealthyPGs

6d4afe2

For check if cluster PGs are healthy before osd updates Which helps gainning more confidence in the osd upgrades Signed-off-by: mmaoyu <wupiaoyu61@gmail.com>

mmaoyu force-pushed the upgrade branch from 7786e8a to 6d4afe2 Compare April 11, 2024 02:32

BlaineEXE approved these changes Apr 11, 2024

View reviewed changes

BlaineEXE merged commit 8998b32 into rook:master Apr 11, 2024
51 of 53 checks passed

travisn added the backport-release-1.14 label Apr 11, 2024

mergify bot mentioned this pull request Apr 11, 2024

osd: add option upgradeOSDRequiresHealthyPGs (backport #14040) #14061

Merged

6 tasks

BlaineEXE added a commit that referenced this pull request Apr 11, 2024

Merge pull request #14061 from rook/mergify/bp/release-1.14/pr-14040

6c134c0

osd: add option upgradeOSDRequiresHealthyPGs (backport #14040)

mmaoyu deleted the upgrade branch April 12, 2024 01:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

osd: add option upgradeOSDRequiresHealthyPGs #14040

osd: add option upgradeOSDRequiresHealthyPGs #14040

mmaoyu commented Apr 8, 2024 •

edited

mmaoyu commented Apr 8, 2024

zhucan commented Apr 8, 2024

subhamkrai Apr 8, 2024

subhamkrai Apr 8, 2024

travisn Apr 8, 2024

travisn Apr 8, 2024

mmaoyu Apr 9, 2024

mmaoyu commented Apr 9, 2024

travisn left a comment

travisn Apr 9, 2024

BlaineEXE Apr 10, 2024

BlaineEXE Apr 10, 2024

BlaineEXE Apr 10, 2024

BlaineEXE Apr 10, 2024

travisn Apr 10, 2024

BlaineEXE Apr 10, 2024

	* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait util the PGs status is clean before the OSD daemon upgrade process.
	* `upgradeOSDRequiresHealthyPGs`: if set to true Rook will wait until the PGs status is clean before the OSD daemon upgrade process.

	logger.Errorf("failed to check PGs status to update osd,will try it again later. %v", err)
	logger.Errorf("failed to check PGs status to update osd, will try it again later. %v", err)

	# This configuration won't be applied if `skipUpgradeChecks` is `true`.
	# This configuration will be ignored if `skipUpgradeChecks` is `true`.

	logger.Infof("PGs are healthy to update osd. %v", pgHealthMsg)
	logger.Infof("PGs are healthy to proceed updating OSDs. %v", pgHealthMsg)

	# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` osd upgrade process won't start until PGs are healthy.
	# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy.

osd: add option upgradeOSDRequiresHealthyPGs #14040

osd: add option upgradeOSDRequiresHealthyPGs #14040

Conversation

mmaoyu commented Apr 8, 2024 • edited

mmaoyu commented Apr 8, 2024

zhucan commented Apr 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmaoyu commented Apr 9, 2024

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mmaoyu commented Apr 8, 2024 •

edited