Skip to content

Commit

Permalink
osd: add option upgradeOSDRequiresHealthyPGs
Browse files Browse the repository at this point in the history
For check if cluster PGs are healthy before osd updates
Which helps gainning more confidence in the osd upgrades

Signed-off-by: mmaoyu <wupiaoyu61@gmail.com>
  • Loading branch information
mmaoyu committed Apr 11, 2024
1 parent f0a793c commit 6d4afe2
Show file tree
Hide file tree
Showing 9 changed files with 112 additions and 0 deletions.
1 change: 1 addition & 0 deletions Documentation/CRDs/Cluster/ceph-cluster-crd.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ Settings can be specified at the global level to apply to the cluster as a whole
If this value is empty, each pod will get an ephemeral directory to store their config files that is tied to the lifetime of the pod running on that node. More details can be found in the Kubernetes [empty dir docs](https://kubernetes.io/docs/concepts/storage/volumes/#emptydir).
* `skipUpgradeChecks`: if set to true Rook won't perform any upgrade checks on Ceph daemons during an upgrade. Use this at **YOUR OWN RISK**, only if you know what you're doing. To understand Rook's upgrade process of Ceph, read the [upgrade doc](../../Upgrade/rook-upgrade.md#ceph-version-upgrades).
* `continueUpgradeAfterChecksEvenIfNotHealthy`: if set to true Rook will continue the OSD daemon upgrade process even if the PGs are not clean, or continue with the MDS upgrade even the file system is not healthy.
* `upgradeOSDRequiresHealthyPGs`: if set to true OSD upgrade process won't start until PGs are healthy.
* `dashboard`: Settings for the Ceph dashboard. To view the dashboard in your browser see the [dashboard guide](../../Storage-Configuration/Monitoring/ceph-dashboard.md).
* `enabled`: Whether to enable the dashboard to view cluster status
* `urlPrefix`: Allows to serve the dashboard under a subpath (useful when you are accessing the dashboard via a reverse proxy)
Expand Down
28 changes: 28 additions & 0 deletions Documentation/CRDs/specification.md
Original file line number Diff line number Diff line change
Expand Up @@ -945,6 +945,20 @@ The default wait timeout is 10 minutes.</p>
</tr>
<tr>
<td>
<code>upgradeOSDRequiresHealthyPGs</code><br/>
<em>
bool
</em>
</td>
<td>
<em>(Optional)</em>
<p>UpgradeOSDRequiresHealthyPGs defines if OSD upgrade requires PGs are clean. If set to <code>true</code> OSD upgrade process won&rsquo;t start until PGs are healthy.
This configuration will be ignored if <code>skipUpgradeChecks</code> is <code>true</code>.
Default is false.</p>
</td>
</tr>
<tr>
<td>
<code>disruptionManagement</code><br/>
<em>
<a href="#ceph.rook.io/v1.DisruptionManagementSpec">
Expand Down Expand Up @@ -4418,6 +4432,20 @@ The default wait timeout is 10 minutes.</p>
</tr>
<tr>
<td>
<code>upgradeOSDRequiresHealthyPGs</code><br/>
<em>
bool
</em>
</td>
<td>
<em>(Optional)</em>
<p>UpgradeOSDRequiresHealthyPGs defines if OSD upgrade requires PGs are clean. If set to <code>true</code> OSD upgrade process won&rsquo;t start until PGs are healthy.
This configuration will be ignored if <code>skipUpgradeChecks</code> is <code>true</code>.
Default is false.</p>
</td>
</tr>
<tr>
<td>
<code>disruptionManagement</code><br/>
<em>
<a href="#ceph.rook.io/v1.DisruptionManagementSpec">
Expand Down
5 changes: 5 additions & 0 deletions deploy/charts/rook-ceph-cluster/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,11 @@ cephClusterSpec:
# The default wait timeout is 10 minutes.
waitTimeoutForHealthyOSDInMinutes: 10

# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy.
# This configuration will be ignored if `skipUpgradeChecks` is `true`.
# Default is false.
upgradeOSDRequiresHealthyPGs: false

mon:
# Set the number of mons to be started. Generally recommended to be 3.
# For highest availability, an odd number of mons should be specified.
Expand Down
6 changes: 6 additions & 0 deletions deploy/charts/rook-ceph/templates/resources.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5004,6 +5004,12 @@ spec:
type: object
type: array
type: object
upgradeOSDRequiresHealthyPGs:
description: |-
UpgradeOSDRequiresHealthyPGs defines if OSD upgrade requires PGs are clean. If set to `true` OSD upgrade process won't start until PGs are healthy.
This configuration will be ignored if `skipUpgradeChecks` is `true`.
Default is false.
type: boolean
waitTimeoutForHealthyOSDInMinutes:
description: |-
WaitTimeoutForHealthyOSDInMinutes defines the time the operator would wait before an OSD can be stopped for upgrade or restart.
Expand Down
4 changes: 4 additions & 0 deletions deploy/examples/cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,10 @@ spec:
# continue with the upgrade of an OSD even if its not ok to stop after the timeout. This timeout won't be applied if `skipUpgradeChecks` is `true`.
# The default wait timeout is 10 minutes.
waitTimeoutForHealthyOSDInMinutes: 10
# Whether or not requires PGs are clean before an OSD upgrade. If set to `true` OSD upgrade process won't start until PGs are healthy.
# This configuration will be ignored if `skipUpgradeChecks` is `true`.
# Default is false.
upgradeOSDRequiresHealthyPGs: false
mon:
# Set the number of mons to be started. Generally recommended to be 3.
# For highest availability, an odd number of mons should be specified.
Expand Down
6 changes: 6 additions & 0 deletions deploy/examples/crds.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5002,6 +5002,12 @@ spec:
type: object
type: array
type: object
upgradeOSDRequiresHealthyPGs:
description: |-
UpgradeOSDRequiresHealthyPGs defines if OSD upgrade requires PGs are clean. If set to `true` OSD upgrade process won't start until PGs are healthy.
This configuration will be ignored if `skipUpgradeChecks` is `true`.
Default is false.
type: boolean
waitTimeoutForHealthyOSDInMinutes:
description: |-
WaitTimeoutForHealthyOSDInMinutes defines the time the operator would wait before an OSD can be stopped for upgrade or restart.
Expand Down
6 changes: 6 additions & 0 deletions pkg/apis/ceph.rook.io/v1/types.go
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,12 @@ type ClusterSpec struct {
// +optional
WaitTimeoutForHealthyOSDInMinutes time.Duration `json:"waitTimeoutForHealthyOSDInMinutes,omitempty"`

// UpgradeOSDRequiresHealthyPGs defines if OSD upgrade requires PGs are clean. If set to `true` OSD upgrade process won't start until PGs are healthy.
// This configuration will be ignored if `skipUpgradeChecks` is `true`.
// Default is false.
// +optional
UpgradeOSDRequiresHealthyPGs bool `json:"upgradeOSDRequiresHealthyPGs,omitempty"`

// A spec for configuring disruption management.
// +nullable
// +optional
Expand Down
13 changes: 13 additions & 0 deletions pkg/operator/ceph/cluster/osd/update.go
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,19 @@ func (c *updateConfig) updateExistingOSDs(errs *provisionErrors) {
if c.doneUpdating() {
return // no more OSDs to update
}
if !c.cluster.spec.SkipUpgradeChecks && c.cluster.spec.UpgradeOSDRequiresHealthyPGs {
pgHealthMsg, pgClean, err := cephclient.IsClusterClean(c.cluster.context, c.cluster.clusterInfo, c.cluster.spec.DisruptionManagement.PGHealthyRegex)
if err != nil {
logger.Warningf("failed to check PGs status to update OSDs, will try updating it again later. %v", err)
return
}
if !pgClean {
logger.Infof("PGs are not healthy to update OSDs, will try updating it again later. PGs status: %q", pgHealthMsg)
return
}
logger.Infof("PGs are healthy to proceed updating OSDs. %v", pgHealthMsg)
}

osdIDQuery, _ := c.queue.Pop()

var osdIDs []int
Expand Down
43 changes: 43 additions & 0 deletions pkg/operator/ceph/cluster/osd/update_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,8 @@ func Test_updateExistingOSDs(t *testing.T) {
updateInjectFailures k8sutil.Failures // return failures from mocked updateDeploymentAndWaitFunc
returnOkToStopIDs []int // return these IDs are ok-to-stop (or not ok to stop if empty)
forceUpgradeIfUnhealthy bool
requiresHealthyPGs bool
cephStatus string
)

// intermediates (created from inputs)
Expand Down Expand Up @@ -108,6 +110,7 @@ func Test_updateExistingOSDs(t *testing.T) {
clusterInfo.OwnerInfo = cephclient.NewMinimumOwnerInfo(t)
spec := cephv1.ClusterSpec{
ContinueUpgradeAfterChecksEvenIfNotHealthy: forceUpgradeIfUnhealthy,
UpgradeOSDRequiresHealthyPGs: requiresHealthyPGs,
}
c = New(ctx, clusterInfo, spec, "rook/rook:master")
config := c.newProvisionConfig()
Expand Down Expand Up @@ -164,6 +167,9 @@ func Test_updateExistingOSDs(t *testing.T) {
return cephclientfake.OSDDeviceClassOutput(args[3]), nil
}
}
if args[0] == "status" {
return cephStatus, nil
}
panic(fmt.Sprintf("unexpected command %q with args %v", command, args))
},
}
Expand Down Expand Up @@ -361,6 +367,43 @@ func Test_updateExistingOSDs(t *testing.T) {
assert.Equal(t, 0, updateQueue.Len()) // the OSD should now have been removed from the queue
})

t.Run("PGs not clean to upgrade OSD", func(t *testing.T) {
clientset = fake.NewSimpleClientset()
updateQueue = newUpdateQueueWithIDs(2)
existingDeployments = newExistenceListWithIDs(2)
requiresHealthyPGs = true
cephStatus = unHealthyCephStatus
updateInjectFailures = k8sutil.Failures{}
doSetup()

osdToBeQueried = 2
updateConfig.updateExistingOSDs(errs)
assert.Zero(t, errs.len())
assert.ElementsMatch(t, deploymentsUpdated, []string{})
assert.Equal(t, 1, updateQueue.Len()) // the OSD should remain

})

t.Run("PGs clean to upgrade OSD", func(t *testing.T) {
clientset = fake.NewSimpleClientset()
updateQueue = newUpdateQueueWithIDs(0)
existingDeployments = newExistenceListWithIDs(0)
requiresHealthyPGs = true
cephStatus = healthyCephStatus
forceUpgradeIfUnhealthy = true // FORCE UPDATES
updateInjectFailures = k8sutil.Failures{}
doSetup()
addDeploymentOnNode("node0", 0)

osdToBeQueried = 0
returnOkToStopIDs = []int{0}
updateConfig.updateExistingOSDs(errs)
assert.Zero(t, errs.len())
assert.ElementsMatch(t, deploymentsUpdated, []string{deploymentName(0)})
assert.Equal(t, 0, updateQueue.Len()) // should be done with updates

})

t.Run("continueUpgradesAfterChecksEvenIfUnhealthy = true", func(t *testing.T) {
clientset = fake.NewSimpleClientset()
updateQueue = newUpdateQueueWithIDs(2)
Expand Down

0 comments on commit 6d4afe2

Please sign in to comment.