Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2615

malayparida2000 · 2024-05-16T13:48:17Z

BZ-https://bugzilla.redhat.com/show_bug.cgi?id=2274175

Set the deviceClasses of the pools to avoid wrong data placement

Earlier when replica-1 was getting enabled on an existing cluster the pool was setting it's deviceClass to replicated but there were no osds with this deviceClass as osd's during normal creation were given SSD deviceClass if blank, and now they can't change. To avoid & solve this problem we have to intelligently set the deviceClass for the pools, Depending on whether replica-1 is enabled & the number of deviceClasses found in the cephCluster CR status.

To do this we added a DefaultCephDeviceClass field in StorageCluster Status which will hold the default ceph device class to be used for the pools. Depending on whether replica-1 is enabled & the number of deviceClasses found in the status this value is determined and set in the status.

malayparida2000 · 2024-05-20T10:27:40Z

The unit test failure should be fixed by this #2619

controllers/storagecluster/generate.go

iamniting · 2024-05-31T07:38:59Z

/hold

iamniting

Approved mistakenly

controllers/storagecluster/cephblockpools.go

controllers/storagecluster/cephcluster.go

metrics/vendor/github.com/red-hat-storage/ocs-operator/api/v4/v1/storagecluster_types.go

controllers/storagecluster/cephcluster.go

controllers/storagecluster/generate.go

controllers/storagecluster/cephcluster.go

Earlier when replica-1 was getting enabled on an existing cluster the pool was setting it's deviceClass to replicated but there were no osds with this deviceClass as osd's during normal creation were given ssd deviceClass if blank, and now they can't change. To avoid & solve this problem we have to intelligently set the deviceClass for the pools, Depending on whether replica-1 is enabled & the number of deviceClasses found in the cephCluster CR status. To do this we added a DefaultCephDeviceClass field in StorageCluster Status which will hold the default ceph device class to be used for the pools. Depending on whether replica-1 is enabled & the number of deviceClasses found in the status this value is determined and set in the status. This also includes the unit test as well. Signed-off-by: Malay Kumar Parida <mparida@redhat.com>

Signed-off-by: Malay Kumar Parida <mparida@redhat.com>

controllers/storagecluster/cephcluster.go

travisn

/approve
/lgtm

malayparida2000 · 2024-06-10T06:12:50Z

/test ocs-operator-bundle-e2e-aws

agarwal-mudit · 2024-06-10T16:09:12Z

/cherry-pick release-4.16

openshift-cherrypick-robot · 2024-06-10T16:09:15Z

@agarwal-mudit: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

malayparida2000 · 2024-06-11T05:37:29Z

2024-06-10T07:34:24.303690686Z time="2024-06-10T07:34:24Z" level=info msg="creating bucket nb.1718003653158.makestoragegreatagain.com" sys=openshift-storage/noobaa
2024-06-10T07:34:24.669995832Z time="2024-06-10T07:34:24Z" level=error msg="got error when trying to create bucket nb.1718003653158.makestoragegreatagain.com. error: RequestError: send request failed\ncaused by: Put \"https://s3.us-east-2.amazonaws.com/nb.1718003653158.makestoragegreatagain.com\": tls: failed to verify certificate: x509: certificate signed by unknown authority" sys=openshift-storage/noobaa
2024-06-10T07:34:24.669995832Z time="2024-06-10T07:34:24Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
2024-06-10T07:34:24.670049595Z time="2024-06-10T07:34:24Z" level=warning msg="⏳ Temporary Error: RequestError: send request failed\ncaused by: Put \"https://s3.us-east-2.amazonaws.com/nb.1718003653158.makestoragegreatagain.com\": tls: failed to verify certificate: x509: certificate signed by unknown authority" sys=openshift-storage/noobaa
2024-06-10T07:34:24.678836665Z time="2024-06-10T07:34:24Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa
2024-06-10T07:34:24.679320223Z time="2024-06-10T07:34:24Z" level=info msg="Update event detected for noobaa (openshift-storage), queuing Reconcile"

The noobaa problem keeps failing the e2e test again n again

malayparida2000 · 2024-06-11T05:37:45Z

/retest

malayparida2000 · 2024-06-11T10:09:43Z

/retest

malayparida2000 · 2024-06-11T10:14:56Z

I did extensive testing for the PR with the below scenarios
Case 1: Cluster was created with old code, then upgraded to new code
Case 2: Cluster was created & replica 1 was enabled on the old code, then upgraded to new code
Case 3: Cluster was created with replica-1 with old code, then upgraded to new code
Case 4: Cluster gets created & replica-1 gets enabled both on new code
Case 5: Cluster is created from scratch with replica-1 with new code
here is the complete testing results https://hackmd.io/@Yh4a4hAATcW2BNYBJVSx4w/ryLe_4kHR

Happy to state that the code works for all the above cases & solves the bug.

openshift-ci · 2024-06-11T12:37:45Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, malayparida2000, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [iamniting]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-cherrypick-robot · 2024-06-11T12:42:15Z

@agarwal-mudit: #2615 failed to apply on top of branch "release-4.16":

Applying: Set the deviceClasses of the pools to avoid wrong data placement
Using index info to reconstruct a base tree...
M	api/v1/storagecluster_types.go
M	controllers/storagecluster/cephcluster.go
M	controllers/storagecluster/cephcluster_test.go
M	controllers/storagecluster/generate.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/storagecluster/generate.go
Auto-merging controllers/storagecluster/cephcluster_test.go
CONFLICT (content): Merge conflict in controllers/storagecluster/cephcluster_test.go
Auto-merging controllers/storagecluster/cephcluster.go
Auto-merging api/v1/storagecluster_types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Set the deviceClasses of the pools to avoid wrong data placement
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

malayparida2000 force-pushed the deviceclass branch from ce29437 to c8d29e2 Compare May 20, 2024 10:16

malayparida2000 force-pushed the deviceclass branch from c8d29e2 to ccc800e Compare May 23, 2024 05:12

travisn requested changes May 23, 2024

View reviewed changes

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved

openshift-ci bot assigned travisn May 23, 2024

malayparida2000 force-pushed the deviceclass branch from ccc800e to c3bc328 Compare May 27, 2024 11:52

iamniting approved these changes May 31, 2024

View reviewed changes

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved

openshift-ci bot assigned iamniting May 31, 2024

openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 31, 2024

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2024

iamniting requested changes May 31, 2024

View reviewed changes

openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 31, 2024

iamniting removed approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 31, 2024

travisn reviewed May 31, 2024

View reviewed changes

controllers/storagecluster/cephblockpools.go Outdated Show resolved Hide resolved

malayparida2000 force-pushed the deviceclass branch 6 times, most recently from 9152002 to 9831a85 Compare June 6, 2024 15:13

travisn reviewed Jun 6, 2024

View reviewed changes

controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved

travisn reviewed Jun 6, 2024

View reviewed changes

controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved

metrics/vendor/github.com/red-hat-storage/ocs-operator/api/v4/v1/storagecluster_types.go Outdated Show resolved Hide resolved

malayparida2000 force-pushed the deviceclass branch from 9831a85 to 855be93 Compare June 7, 2024 06:19

travisn reviewed Jun 7, 2024

View reviewed changes

controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved

controllers/storagecluster/cephcluster.go Show resolved Hide resolved

controllers/storagecluster/cephcluster.go Show resolved Hide resolved

malayparida2000 force-pushed the deviceclass branch from 855be93 to 7a0939a Compare June 7, 2024 16:28

malayparida2000 added 2 commits June 7, 2024 22:04

Add generated changes from update-generated, gen-latest-csv, deps-update

f81d225

Signed-off-by: Malay Kumar Parida <mparida@redhat.com>

malayparida2000 force-pushed the deviceclass branch from 7a0939a to f81d225 Compare June 7, 2024 16:34

travisn reviewed Jun 7, 2024

View reviewed changes

controllers/storagecluster/cephcluster.go Show resolved Hide resolved

travisn approved these changes Jun 7, 2024

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 7, 2024

iamniting approved these changes Jun 11, 2024

View reviewed changes

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024

openshift-merge-bot bot merged commit 0f9b7cd into red-hat-storage:main Jun 11, 2024
11 checks passed

malayparida2000 mentioned this pull request Jun 11, 2024

Bug 2274175: [release-4.16] Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2662

Merged

malayparida2000 deleted the deviceclass branch June 11, 2024 17:25

malayparida2000 mentioned this pull request Jun 11, 2024

Bug 2291321: [release-4.15] Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2663

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2615

Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2615

malayparida2000 commented May 16, 2024 •

edited

malayparida2000 commented May 20, 2024

iamniting commented May 31, 2024

iamniting left a comment

travisn left a comment

malayparida2000 commented Jun 10, 2024

agarwal-mudit commented Jun 10, 2024

openshift-cherrypick-robot commented Jun 10, 2024

malayparida2000 commented Jun 11, 2024

malayparida2000 commented Jun 11, 2024

malayparida2000 commented Jun 11, 2024

malayparida2000 commented Jun 11, 2024

openshift-ci bot commented Jun 11, 2024

openshift-cherrypick-robot commented Jun 11, 2024

Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2615

Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2615

Conversation

malayparida2000 commented May 16, 2024 • edited

malayparida2000 commented May 20, 2024

iamniting commented May 31, 2024

iamniting left a comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

malayparida2000 commented Jun 10, 2024

agarwal-mudit commented Jun 10, 2024

openshift-cherrypick-robot commented Jun 10, 2024

malayparida2000 commented Jun 11, 2024

malayparida2000 commented Jun 11, 2024

malayparida2000 commented Jun 11, 2024

malayparida2000 commented Jun 11, 2024

openshift-ci bot commented Jun 11, 2024

openshift-cherrypick-robot commented Jun 11, 2024

malayparida2000 commented May 16, 2024 •

edited