Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FIx storagecluster deletion is stuck forever #1819

Conversation

malayparida2000
Copy link
Contributor

Resolves-https://bugzilla.redhat.com/show_bug.cgi?id=2060897

Currently, during deletion of a storagecluster, the storageclasses are being deleted before the cephcluster. The PVs are not getting deleted as they need the storageclasses for that. And the PVs in turn are blocking the deletion of PVCs, Cephblockpool/cephfilesystem and finally the cephcluster. This PR moves the deletions of storageclasses to after the cephcluster deletion.

Signed-off-by: Malay Kumar Parida mparida@redhat.com

Currently during deletion of a storagecluster, the storageclasses are
being deleted before the cephcluster. The PVs are not
getting deleted as they need the storageclasses for that. And the PVs in
turn are blocking the deletion of PVCs, Cephblockpool/cephfilesystem and
finally the cephcluster. This PR moves the deletions of storageclasses
to after the cephcluster deletion.

Signed-off-by: Malay Kumar Parida <mparida@redhat.com>
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 17, 2022

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: malayparida2000
Once this PR has been reviewed and has the lgtm label, please assign nbalacha for approval by writing /assign @nbalacha in a comment. For more information see:The Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@Rakshith-R
Copy link
Member

/hold

I'd prefer we add a check for each storageclass deletion to check for existence of PV objects of respective driver and error out with the pv/pvc name so user is aware what to delete.
Keeping the current order of deletion in place.
@malayparida2000

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 19, 2022
@malayparida2000
Copy link
Contributor Author

/hold

I'd prefer we add a check for each storageclass deletion to check for existence of PV objects of respective driver and error out with the pv/pvc name so user is aware what to delete. Keeping the current order of deletion in place. @malayparida2000

Is there any particular reason why we can't alter the deletion order?

/cc @travisn , Can you also please take a look

@Rakshith-R
Copy link
Member

/hold
I'd prefer we add a check for each storageclass deletion to check for existence of PV objects of respective driver and error out with the pv/pvc name so user is aware what to delete. Keeping the current order of deletion in place. @malayparida2000

Is there any particular reason why we can't alter the deletion order?

There can be instances when PVC creation starts after cephcluster deletion step is completed. This will leave stale resources of PVC/PV pair with no backend cephcluster and corresponding SC when storageclass deletion step is completed.
The finalizer on PV/PVC may still cause the same bug during uninstall.
We should avoid this ugly situation by implementing a check for PV existence before SC deletion.

@travisn
Copy link
Contributor

travisn commented Sep 21, 2022

When the CephCluster CR is deleted, Rook will check for the existence of the PVs, so it won't continue with the deletion until those PVs are cleaned up. I believe the new ordering will work as we need it to.
@BlaineEXE Any concerns with the reordering?

@Rakshith-R
Copy link
Member

When the CephCluster CR is deleted, Rook will check for the existence of the PVs, so it won't continue with the deletion until those PVs are cleaned up. I believe the new ordering will work as we need it to. @BlaineEXE Any concerns with the reordering?

  • due to the current cleanup policy and deletion of cephclust being triggered at the beginning, pv check by rook will not occur.
  • moving SC deletion after cluster deletion. We're again risking pvc creation by user in between these two steps.

Simplest is the following.

I'd prefer we add a check for each storageclass deletion to check for existence of PV objects of respective driver and error out with the pv/pvc name so user is aware what to delete. Keeping the current order of deletion in place.

The error thrown by ocs-op will very clearly state why deletion is blocked and what the user has to do.

@BlaineEXE
Copy link
Contributor

But if we delete the StorageClass before the cluster is deleted, users won't be able to create new PVCs, and that can prevent the race condition you mention. If we wait to delete them until after, PVCs may or may not fail to delete, and they may be creating while the cluster is deleting. Or is there something else going on as well?

@travisn
Copy link
Contributor

travisn commented Sep 22, 2022

I think the comment in the BZ here explains the needed approach. I don't actually think there is an ordering problem, the ordering looks correct before this PR. The problem must be in properly waiting during deletion before continuing. As in that comment, we need to first delete the blockpool, objectstore, and filesystem CRs, and then wait for them to be deleted. Rook will not allow them to be deleted until all their PVs and buckets are deleted. Then the OCS operator can continue with the deletion of the storage classes and CephCluster CR, which should no longer have any PVs to wait on.

As long as the block/object/file CRs are first deleted, we don't need to worry about a race condition of PVs being created before the storage class and CephCluster CR is deleted. If someone tries to create a PVC, the PV creation will fail because the block/object/file CRs are deleted.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jul 17, 2023

@malayparida2000: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/ocs-operator-bundle-e2e-aws 459e650 link true /test ocs-operator-bundle-e2e-aws

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@malayparida2000 malayparida2000 deleted the cluster_clean_uninstall branch September 19, 2023 08:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants