Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block creation of StorageClass until prerequisites are met #1224

Merged

Conversation

umangachapagain
Copy link
Contributor

CephBlockPool waits for CephCluster to be ready.
Similarly, StorageClass needs to wait for CephBlockPool to be ready.

Same for CephFilesystem.

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this doesn't seem to take into account that either of CephBlockPool or CephFilesystem could be disabled. Would it make more sense to extend StorageClusterConfiguration to include a "required kind" or something so that we can just do the check on a per-SC basis here? https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/storageclasses.go#L88-L93

Also, CI failures look valid. Check make ocs-operator-ci output.

@umangachapagain
Copy link
Contributor Author

Unless I'm missing something, this doesn't seem to take into account that either of CephBlockPool or CephFilesystem could be disabled. Would it make more sense to extend StorageClusterConfiguration to include a "required kind" or something so that we can just do the check on a per-SC basis here? https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/storageclasses.go#L88-L93

Don't we always create both the StorageClasses?

I don't think a "required kind" is necessary since we know what is required.

Also, CI failures look valid. Check make ocs-operator-ci output.

Yes, they are valid and I give up on fixing it. It requires re-factoring entire unit test framework (which is actually running integration test internally).
Unless someone has a better way to fix it.

@jarrpa
Copy link
Member

jarrpa commented Jun 23, 2021

Unless I'm missing something, this doesn't seem to take into account that either of CephBlockPool or CephFilesystem could be disabled. Would it make more sense to extend StorageClusterConfiguration to include a "required kind" or something so that we can just do the check on a per-SC basis here? https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/storageclasses.go#L88-L93

Don't we always create both the StorageClasses?

No. If there's no CephBlockPool created it makes no sense to create a StorageClass pointing to something that doesn't exist.

I don't think a "required kind" is necessary since we know what is required.

It would be a cleaner implementation. Rather than having an if/else reading the StorageClass name and checking for the appropriate Kind based on that, We could just read the same field for all configs. But... eh, you're right, it's not strictly necessary, so I won't block on it.

Also, CI failures look valid. Check make ocs-operator-ci output.

Yes, they are valid and I give up on fixing it. It requires re-factoring entire unit test framework (which is actually running integration test internally).
Unless someone has a better way to fix it.

The ocs-operator-ci fixes are straightforward linting problems, so please at least deal with that. :P

The e2e failures look like CI infra issues, so we just need to retest. That can wait until LGTM.

@umangachapagain umangachapagain force-pushed the storageclass-wait branch 2 times, most recently from da259e7 to 6fb50fc Compare June 28, 2021 13:38
@umangachapagain
Copy link
Contributor Author

Unless I'm missing something, this doesn't seem to take into account that either of CephBlockPool or CephFilesystem could be disabled. Would it make more sense to extend StorageClusterConfiguration to include a "required kind" or something so that we can just do the check on a per-SC basis here? https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/storageclasses.go#L88-L93

Updated to check if particular StorageClass is enabled before trying to wait. But, it seems like StorageClass creation doesn't care about the setting anyway and that's a different PR.

Also, CI failures look valid. Check make ocs-operator-ci output.

Fixed the annoying uppercase/lowercase issue.

@umangachapagain
Copy link
Contributor Author

/retest

Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, this doesn't seem to take into account that either of CephBlockPool or CephFilesystem could be disabled. Would it make more sense to extend StorageClusterConfiguration to include a "required kind" or something so that we can just do the check on a per-SC basis here? https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/storageclasses.go#L88-L93

Updated to check if particular StorageClass is enabled before trying to wait. But, it seems like StorageClass creation doesn't care about the setting anyway and that's a different PR.

Oh it absolutely does!! It's embedded into the StorageClassConfiguration: https://github.com/openshift/ocs-operator/blob/master/controllers/storagecluster/storageclasses.go#L81-L83 That said, I understand it may be difficult (and a bad idea) to try and generalize the logic of checking for valid resources. As such I recommend just moving the existing logic you have to either the new*StorageClassConfiguration() functions, or newStorageClassConfigurations(), or maybe even in createStorageClasses() itself. Don't block the creation of all StorageClasses if one fails, but do return err so we block further reconciliation of dependent resources.

Also, CI failures look valid. Check make ocs-operator-ci output.

Fixed the annoying uppercase/lowercase issue.

You know I'm like this. 😛 Apologies nonetheless. 😂

controllers/storagecluster/storageclasses.go Outdated Show resolved Hide resolved
controllers/storagecluster/cephblockpools.go Show resolved Hide resolved
@jarrpa jarrpa added mvp Required for the next minimum viable product. priority/2-medium labels Aug 26, 2021
Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 26, 2021
@jarrpa
Copy link
Member

jarrpa commented Aug 26, 2021

Gonna run this one more time just to be sure....

/test ocs-operator-bundle-e2e-aws

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 26, 2021
@jarrpa
Copy link
Member

jarrpa commented Aug 26, 2021

/test ocs-operator-bundle-e2e-aws

if cephFilesystem.Status == nil {
return fmt.Errorf("cephFilesystem %q is not reporting status", key)
}
r.Log.Info("CephFilesystem %q is in phase %q", key, cephFilesystem.Status.Phase)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 months of updates and these tests still won't pass. I'm outta here.

throw-pc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You go with my blessing. 🙏 🙏 🙏

@jarrpa
Copy link
Member

jarrpa commented Aug 26, 2021

/hold

@jarrpa
Copy link
Member

jarrpa commented Aug 26, 2021

/lgtm cancel

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 26, 2021
@openshift-ci openshift-ci bot removed approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. labels Aug 26, 2021
@jarrpa
Copy link
Member

jarrpa commented Aug 27, 2021

...test failed on RPC timeout. BECAUSE OF COURSE IT DID. 😠 Anyway, code looks solid, TYVM. 🙇

/lgtm
/hold cancel

@jarrpa jarrpa added approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 27, 2021
@jarrpa
Copy link
Member

jarrpa commented Aug 27, 2021

/test ocs-operator-bundle-e2e-aws

@jarrpa
Copy link
Member

jarrpa commented Aug 27, 2021

/retest-required

Comment on lines +109 to +113
if err != nil || cephFilesystem.Status == nil || cephFilesystem.Status.Phase != cephv1.ConditionType(util.PhaseReady) {
r.Log.Info("Waiting for CephFilesystem to be Ready. Skip reconciling StorageClass",
"CephFilesystem", klog.KRef(key.Name, key.Namespace),
"StorageClass", klog.KRef("", sc.Name),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

StorageCluster Conditions show the following:

    Last Heartbeat Time:   2021-08-27T18:33:40Z
    Last Transition Time:  2021-08-27T18:12:08Z
    Message:               Error while reconciling: some StorageClasses [test-storagecluster-cephfs] were skipped while waiting for pre-requisites to be met
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete

ocs-operator logs show the following:

2021-08-27T18:33:40.740590363Z {"level":"info","ts":1630089220.7405756,"logger":"controllers.StorageCluster","msg":"Waiting for CephFilesystem to be Ready. Skip reconciling StorageClass","Request.Namespace":"openshift-storage","Request.Name":"test-storagecluster","CephFilesystem":"test-storagecluster-cephfilesystem/openshift-storage","StorageClass":"test-storagecluster-cephfs"}
2021-08-27T18:33:40.756182902Z {"level":"error","ts":1630089220.7561336,"logger":"controller-runtime.manager.controller.storagecluster","msg":"Reconciler error","reconciler group":"ocs.openshift.io","reconciler kind":"StorageCluster","name":"test-storagecluster","namespace":"openshift-storage","error":"some StorageClasses [test-storagecluster-cephfs] were skipped while waiting for pre-requisites to be met","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:253\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/src/github.com/openshift/ocs-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:214"}

Looking at the if conditions you have, I checked the CephFilesystem and indeed it has a nil Status: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ocs-operator/1224/pull-ci-openshift-ocs-operator-master-ocs-operator-bundle-e2e-aws/1431303805407858688/artifacts/ocs-operator-bundle-e2e-aws/ocs-must-gather/artifacts/ocs-must-gather/registry-build02-ci-openshift-org-ci-op-s10lq9fp-stable-sha256-2cabcf61a08025e75eba8b82626357ddb48a44bf609f7d1483242ad5d0b1f5f0/ceph/namespaces/openshift-storage/ceph.rook.io/cephfilesystems/test-storagecluster-cephfilesystem.yaml

The rook-ceph-operator logs don't show anything obvious to me, but maybe I'm missing something: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ocs-operator/1224/pull-ci-openshift-ocs-operator-master-ocs-operator-bundle-e2e-aws/1431303805407858688/artifacts/ocs-operator-bundle-e2e-aws/ocs-must-gather/artifacts/ocs-must-gather/registry-build02-ci-openshift-org-ci-op-s10lq9fp-stable-sha256-2cabcf61a08025e75eba8b82626357ddb48a44bf609f7d1483242ad5d0b1f5f0/namespaces/openshift-storage/pods/rook-ceph-operator-8c8468d9d-mwwvh/rook-ceph-operator/rook-ceph-operator/logs/current.log

All Pods show Ready and Running: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_ocs-operator/1224/pull-ci-openshift-ocs-operator-master-ocs-operator-bundle-e2e-aws/1431303805407858688/artifacts/ocs-operator-bundle-e2e-aws/ocs-must-gather/artifacts/ocs-must-gather/registry-build02-ci-openshift-org-ci-op-s10lq9fp-stable-sha256-2cabcf61a08025e75eba8b82626357ddb48a44bf609f7d1483242ad5d0b1f5f0/namespaces/openshift-storage/oc_output/pods_-owide

So... offhand, I'm stuck. @travisn any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a test cluster I don't see any status ever updated on the cephfilesystem either. You're just now relying on it for the first time, or perhaps there was a regression? Still looking in the code...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, there was a bug such that the filesystem status was not being updated if mirroring was not enabled on the filesystem. See rook/rook#8609

@openshift-bot
Copy link

/retest-required

Please review the full test history for this PR and help us cut down flakes.

@openshift-ci openshift-ci bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Aug 28, 2021
Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
rearranging resourveManager instances to preserve the order
in which they need to be processed. This helps in providing
a visual cue about the sequence of tasks. It also reduces
chances of deadlock due to tasks waiting for each other to
complete.

Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
CephBlockPool and CephFilesystem updates were incorrectly
handled. This commit fixes it by updating only the spec and
not resetting the Status.

This also helps in unit testing.

Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
StorageClass should only be created after the CephBlockPool
and CephFilesystem are Ready. So, added a prereq check to
skip reconciling if prereq is not met.

Also, added a field to flag external storage clusters. This
is required to skip prereq checks on external mode clusters.

Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
@openshift-ci openshift-ci bot removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. lgtm Indicates that a PR is ready to be merged. labels Aug 30, 2021
Signed-off-by: Umanga Chapagain <chapagainumanga@gmail.com>
Copy link
Member

@jarrpa jarrpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ONE. MORE. TIME.

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Aug 30, 2021
@jarrpa
Copy link
Member

jarrpa commented Aug 30, 2021

/override ci/prow/ci-index-dev-master-dependencies

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jarrpa

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 30, 2021

@jarrpa: Overrode contexts on behalf of jarrpa: ci/prow/ci-index-dev-master-dependencies

In response to this:

/override ci/prow/ci-index-dev-master-dependencies

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-merge-robot openshift-merge-robot merged commit 45b1cd6 into red-hat-storage:master Aug 30, 2021
@umangachapagain umangachapagain deleted the storageclass-wait branch August 30, 2021 19:08
vavuthu added a commit to vavuthu/ocs-ci that referenced this pull request Aug 31, 2021
Issue: Due to PR red-hat-storage/ocs-operator#1224, creation of
ocs-storagecluster-ceph-rbd storage class is taking moretime

Signed-off-by: vavuthu <vavuthu@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. mvp Required for the next minimum viable product. priority/2-medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants