Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set deviceClasses to avoid replicated pool spreading PGs across all OSDs #2615

Merged
merged 2 commits into from
Jun 11, 2024

Conversation

malayparida2000
Copy link
Contributor

@malayparida2000 malayparida2000 commented May 16, 2024

BZ-https://bugzilla.redhat.com/show_bug.cgi?id=2274175

Set the deviceClasses of the pools to avoid wrong data placement

Earlier when replica-1 was getting enabled on an existing cluster the pool was setting it's deviceClass to replicated but there were no osds with this deviceClass as osd's during normal creation were given SSD deviceClass if blank, and now they can't change. To avoid & solve this problem we have to intelligently set the deviceClass for the pools, Depending on whether replica-1 is enabled & the number of deviceClasses found in the cephCluster CR status.

To do this we added a DefaultCephDeviceClass field in StorageCluster Status which will hold the default ceph device class to be used for the pools. Depending on whether replica-1 is enabled & the number of deviceClasses found in the status this value is determined and set in the status.

@malayparida2000
Copy link
Contributor Author

The unit test failure should be fixed by this #2619

controllers/storagecluster/generate.go Outdated Show resolved Hide resolved
controllers/storagecluster/generate.go Outdated Show resolved Hide resolved
controllers/storagecluster/generate.go Outdated Show resolved Hide resolved
controllers/storagecluster/generate.go Outdated Show resolved Hide resolved
controllers/storagecluster/generate.go Outdated Show resolved Hide resolved
@openshift-ci openshift-ci bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels May 31, 2024
@iamniting
Copy link
Member

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 31, 2024
Copy link
Member

@iamniting iamniting left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved mistakenly

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label May 31, 2024
@iamniting iamniting removed approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels May 31, 2024
@malayparida2000 malayparida2000 force-pushed the deviceclass branch 6 times, most recently from 9152002 to 9831a85 Compare June 6, 2024 15:13
controllers/storagecluster/cephcluster.go Outdated Show resolved Hide resolved
controllers/storagecluster/generate.go Outdated Show resolved Hide resolved
controllers/storagecluster/cephcluster.go Show resolved Hide resolved
controllers/storagecluster/cephcluster.go Show resolved Hide resolved
Earlier when replica-1 was getting enabled on an existing cluster the
pool was setting it's deviceClass to replicated but there were no osds
with this deviceClass as osd's during normal creation were given ssd
deviceClass if blank, and now they can't change. To avoid & solve this
problem we have to intelligently set the deviceClass for the pools,
Depending on whether replica-1 is enabled & the number of deviceClasses
found in the cephCluster CR status.

To do this we added a DefaultCephDeviceClass field in StorageCluster
Status which will  hold the default ceph device class to be used for
the pools. Depending  on whether replica-1 is enabled & the number of
deviceClasses found in  the status this value is determined and set in
the status.

This also includes the unit test as well.

Signed-off-by: Malay Kumar Parida <mparida@redhat.com>
Signed-off-by: Malay Kumar Parida <mparida@redhat.com>
Copy link
Contributor

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 7, 2024
@malayparida2000
Copy link
Contributor Author

/test ocs-operator-bundle-e2e-aws

@agarwal-mudit
Copy link
Member

/cherry-pick release-4.16

@openshift-cherrypick-robot

@agarwal-mudit: once the present PR merges, I will cherry-pick it on top of release-4.16 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@malayparida2000
Copy link
Contributor Author

2024-06-10T07:34:24.303690686Z time="2024-06-10T07:34:24Z" level=info msg="creating bucket nb.1718003653158.makestoragegreatagain.com" sys=openshift-storage/noobaa
2024-06-10T07:34:24.669995832Z time="2024-06-10T07:34:24Z" level=error msg="got error when trying to create bucket nb.1718003653158.makestoragegreatagain.com. error: RequestError: send request failed\ncaused by: Put \"https://s3.us-east-2.amazonaws.com/nb.1718003653158.makestoragegreatagain.com\": tls: failed to verify certificate: x509: certificate signed by unknown authority" sys=openshift-storage/noobaa
2024-06-10T07:34:24.669995832Z time="2024-06-10T07:34:24Z" level=info msg="SetPhase: temporary error during phase \"Configuring\"" sys=openshift-storage/noobaa
2024-06-10T07:34:24.670049595Z time="2024-06-10T07:34:24Z" level=warning msg="⏳ Temporary Error: RequestError: send request failed\ncaused by: Put \"https://s3.us-east-2.amazonaws.com/nb.1718003653158.makestoragegreatagain.com\": tls: failed to verify certificate: x509: certificate signed by unknown authority" sys=openshift-storage/noobaa
2024-06-10T07:34:24.678836665Z time="2024-06-10T07:34:24Z" level=info msg="UpdateStatus: Done generation 1" sys=openshift-storage/noobaa
2024-06-10T07:34:24.679320223Z time="2024-06-10T07:34:24Z" level=info msg="Update event detected for noobaa (openshift-storage), queuing Reconcile"

The noobaa problem keeps failing the e2e test again n again

@malayparida2000
Copy link
Contributor Author

/retest

1 similar comment
@malayparida2000
Copy link
Contributor Author

/retest

@malayparida2000
Copy link
Contributor Author

I did extensive testing for the PR with the below scenarios
Case 1: Cluster was created with old code, then upgraded to new code
Case 2: Cluster was created & replica 1 was enabled on the old code, then upgraded to new code
Case 3: Cluster was created with replica-1 with old code, then upgraded to new code
Case 4: Cluster gets created & replica-1 gets enabled both on new code
Case 5: Cluster is created from scratch with replica-1 with new code
here is the complete testing results https://hackmd.io/@Yh4a4hAATcW2BNYBJVSx4w/ryLe_4kHR

Happy to state that the code works for all the above cases & solves the bug.

Copy link
Contributor

openshift-ci bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: iamniting, malayparida2000, travisn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 0f9b7cd into red-hat-storage:main Jun 11, 2024
11 checks passed
@openshift-cherrypick-robot

@agarwal-mudit: #2615 failed to apply on top of branch "release-4.16":

Applying: Set the deviceClasses of the pools to avoid wrong data placement
Using index info to reconstruct a base tree...
M	api/v1/storagecluster_types.go
M	controllers/storagecluster/cephcluster.go
M	controllers/storagecluster/cephcluster_test.go
M	controllers/storagecluster/generate.go
Falling back to patching base and 3-way merge...
Auto-merging controllers/storagecluster/generate.go
Auto-merging controllers/storagecluster/cephcluster_test.go
CONFLICT (content): Merge conflict in controllers/storagecluster/cephcluster_test.go
Auto-merging controllers/storagecluster/cephcluster.go
Auto-merging api/v1/storagecluster_types.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Set the deviceClasses of the pools to avoid wrong data placement
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-4.16

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants