csi: allow force disabling holder pods #13890

BlaineEXE · 2024-03-06T23:22:25Z

Add new CSI_DISABLE_HOLDER_PODS option for rook-ceph-operator. This option will disable holder pods when set to "true".

In the long term, Rook plans to deprecate the holder pods entirely. This new option will allow users to choose to migrate their clusters to non-holder clusters when they are ready and able, giving them time to gracefully migrate before the holders are permanently removed.

This option is set to "false" by default so that upgrading users don't have their CSI pods modified unexpectedly.
Example manifests are modified to set this value to true so that new clusters will not deploy holder pods.

Migrating users are provided with documentation to instruct them about the new requirements they need to satisfy to successfully remove holder pods, a procedure for migrating pods from holder to non-holder mounts, and a way to delete holder pods once they are no longer in use.

When users set CSI_DISABLE_HOLDER_PODS="true", the CSI controller will no longer deploy or update the holder pod Daemonsets, but it does not delete any existing Daemonsets. This allows already-attached PVCs to continue operating normally with their network connection continuing to exist in the current holder pod. This is critical to avoid causing ia cluster-wide storage outage.

More info: #13055

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

BlaineEXE · 2024-03-06T23:23:57Z

pkg/operator/ceph/csi/controller.go

+		// is holder enabled for this cluster?
+		thisHolderEnabled := (!csiHostNetworkEnabled || cluster.Spec.Network.IsMultus()) && !csiDisableHolders

+		// Do we have a multus cluster or csi host network disabled?
+		// If so deploy the plugin holder with the fsid attached
+		if thisHolderEnabled {
+			logger.Debugf("cluster %q: deploying the ceph-csi plugin holder", cluster.Name)
 			r.clustersWithHolder = append(r.clustersWithHolder, ClusterDetail{cluster: &cephClusters.Items[i], clusterInfo: clusterInfo})
+
+			// holder pods are enabled globally if any cluster needs a holder pod
+			holderEnabled = true


Logic for manipulating the global holderEnabled config has moved here, up from spec.go.

BlaineEXE · 2024-03-06T23:24:15Z

pkg/operator/ceph/csi/spec.go

-	holderEnabled = !CSIParam.EnableCSIHostNetwork
-
-	for i := range r.clustersWithHolder {
-		if r.clustersWithHolder[i].cluster.Spec.Network.IsMultus() {
-			holderEnabled = true
-			break
-		}
-	}


This logic has moved up into the parent controller.go logic here: https://github.com/rook/rook/pull/13890/files#r1515280100

pkg/operator/ceph/csi/cluster_config.go

BlaineEXE · 2024-03-08T22:57:45Z

I think I still have a few unit test failures that I need to resolve, and there are still some doc updates I need to make. However, I believe the code is good and the docs are mostly complete.

I have been unable to test cephfs due to the issue reported here: #13739. But RBD-related testing I have done has gone well.

Note that testing on single-node Minikube is much more forgiving than testing on multi-node Minikube.

BlaineEXE · 2024-03-08T23:16:06Z

I think the fix to make sure these CI tests pass is to have them set CSI_DISABLE_HOLDER_PODS: "false".

PendingReleaseNotes.md

Madhu-1

changes LGTM, left some comments

Documentation/CRDs/Cluster/network-providers.md

deploy/charts/rook-ceph/templates/configmap.yaml

pkg/operator/ceph/csi/cluster_config.go

Madhu-1 · 2024-03-12T11:14:18Z

pkg/operator/ceph/csi/controller.go

+
+		// if holder pods were disabled, the controller needs to update the configmap for each
+		// cephcluster to remove the net namespace file path
+		err = SaveCSIDriverOptions(r.context.Clientset, cluster.Namespace, clusterInfo)


SaveClusterConfig will not remove the netnamespace from the configmap (from my code reading i could be wrong but we need to check on this)

we have multiple functions acting the same configmap in different ways which can easily introduce bugs, we need to refractor it (not as part of this PR but later for sure)

I added updateNetNamespaces() to that function so I could use it here.

https://github.com/rook/rook/pull/13890/files#r1521688482

BlaineEXE · 2024-03-12T15:33:12Z

pkg/operator/ceph/csi/cluster_config.go

@@ -341,6 +343,7 @@ func updateCSIDriverOptions(curr, clusterKey string,
 		}
 	}

+	updateNetNamespaceFilePath(clusterKey, cc)


Added to SaveClusterConfig here.

mergify · 2024-03-12T17:31:40Z

This pull request has merge conflicts that must be resolved before it can be merged. @BlaineEXE please rebase it. https://rook.io/docs/rook/latest/Contributing/development-flow/#updating-your-fork

BlaineEXE · 2024-03-12T17:32:49Z

.github/workflows/canary-integration-test.yml

+      - name: allow holder pod deployment
+        run: sed -i "s|CSI_DISABLE_HOLDER_PODS|# CSI_DISABLE_HOLDER_PODS|g" "deploy/examples/operator.yaml"
+


For now, let's allow holder pods to get the CI working. This will make sure legacy behavior is preserved. Let's follow up in a new PR to ensure the new behavior works greenfield beyond the manual testing I've done.

BlaineEXE · 2024-03-12T17:35:52Z

Documentation/CRDs/Cluster/network-providers.md

+If the scenario does not apply, skip ahead to the
+[Disabling Holder Pods](#disabling-holder-pods) section below.
+
+**Step 1**


Added simple Step # headers in response to a note Subham gave about having a hard time following.

subhamkrai

LGTM, my questions are answered. Thanks

travisn

Just a minor suggestion on the upgrade guide

travisn · 2024-03-13T17:30:24Z

Documentation/Upgrade/rook-upgrade.md

+CSI "holder" pods are frequently reported objects of confusion and struggle in Rook. Because of
+this, they are being deprecated and will be removed in Rook v1.16.
+
+If there are any CephClusters that use `network.provider: "multus"`, or if the operator config


The default is to use host networking, so we don't expect many users to be affected, right? Should we add a note that they would only be affected it they changed those default network settings?

Suggested change

If there are any CephClusters that use `network.provider: "multus"`, or if the operator config

If there are any CephClusters that use the non-default network setting `network.provider: "multus"`, or if the operator config

Seems like a good note. Added 👍

Madhu-1

LGTM

Madhu-1 · 2024-03-14T07:08:43Z

deploy/examples/operator.yaml

+  # Deprecation note: Rook uses "holder" pods to allow CSI to connect to the multus public network
+  # without needing hosts to the network. Holder pods are being deprecated. See issue for details:
+  # https://github.com/rook/rook/issues/13055. New Rook deployments should set this to "true".
+  CSI_DISABLE_HOLDER_PODS: "true"


should we call it as HOLDER_PODS or HOLDER_DAEMONSET or just HOLDER?

I thought about this and settled on "holder pods" (versus "daemonset" or just "holders") since most users seem to refer to the pods themselves and take on a pod-centric view.

Add new CSI_DISABLE_HOLDER_PODS option for rook-ceph-operator. This option will disable holder pods when set to "true". In the long term, Rook plans to deprecate the holder pods entirely. This new option will allow users to choose to migrate their clusters to non-holder clusters when they are ready and able, giving them time to gracefully migrate before the holders are permanently removed. This option is set to "false" by default so that upgrading users don't have their CSI pods modified unexpectedly. Example manifests are modified to set this value to true so that new clusters will not deploy holder pods. Migrating users are provided with documentation to instruct them about the new requirements they need to satisfy to successfully remove holder pods, a procedure for migrating pods from holder to non-holder mounts, and a way to delete holder pods once they are no longer in use. When users set CSI_DISABLE_HOLDER_PODS="true", the CSI controller will no longer deploy or update the holder pod Daemonsets, but it does not delete any existing Daemonsets. This allows already-attached PVCs to continue operating normally with their network connection continuing to exist in the current holder pod. This is critical to avoid causing ia cluster-wide storage outage. More info: rook#13055 Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>

When deploying new StorageClusters, ocs-operator should apply the new Rook operator config `CSI_REMOVE_HOLDER_PODS: "true"`. This reflects the new config and default value that Rook specifies in example manifests here: rook/rook#13890 Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>

BlaineEXE commented Mar 6, 2024

View reviewed changes

pkg/operator/ceph/csi/cluster_config.go Outdated Show resolved Hide resolved

BlaineEXE mentioned this pull request Mar 6, 2024

CSI config centry.Namespace is left empty #13891

Closed

BlaineEXE force-pushed the csi-disable-holder branch 4 times, most recently from 25f0485 to 0fe41fa Compare March 8, 2024 22:55

BlaineEXE marked this pull request as ready for review March 8, 2024 22:55

BlaineEXE requested review from travisn, Madhu-1 and subhamkrai March 8, 2024 22:59

subhamkrai requested changes Mar 11, 2024

View reviewed changes

PendingReleaseNotes.md Show resolved Hide resolved

BlaineEXE requested a review from subhamkrai March 11, 2024 15:09

BlaineEXE force-pushed the csi-disable-holder branch 4 times, most recently from 716c50c to a387e59 Compare March 11, 2024 22:11

Madhu-1 reviewed Mar 12, 2024

View reviewed changes

BlaineEXE commented Mar 12, 2024

View reviewed changes

BlaineEXE force-pushed the csi-disable-holder branch from a387e59 to 7dc0957 Compare March 12, 2024 17:30

BlaineEXE commented Mar 12, 2024

View reviewed changes

BlaineEXE force-pushed the csi-disable-holder branch from 7dc0957 to 56e7e85 Compare March 12, 2024 17:34

BlaineEXE commented Mar 12, 2024

View reviewed changes

BlaineEXE force-pushed the csi-disable-holder branch 2 times, most recently from 2f22d2b to 30e301a Compare March 12, 2024 20:10

BlaineEXE mentioned this pull request Mar 12, 2024

Add CI for disabled holder pods #13922

Open

subhamkrai approved these changes Mar 13, 2024

View reviewed changes

travisn approved these changes Mar 13, 2024

View reviewed changes

Madhu-1 reviewed Mar 14, 2024

View reviewed changes

BlaineEXE force-pushed the csi-disable-holder branch from 30e301a to 4f555db Compare March 14, 2024 15:53

BlaineEXE merged commit e41366b into rook:master Mar 14, 2024
50 of 51 checks passed

BlaineEXE deleted the csi-disable-holder branch March 14, 2024 16:27

BlaineEXE mentioned this pull request Mar 14, 2024

disable holder pods in greenfield clusters red-hat-storage/ocs-operator#2508

Closed

travisn mentioned this pull request Mar 27, 2024

[Multus] improve csi holder design #13055

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csi: allow force disabling holder pods #13890

csi: allow force disabling holder pods #13890

BlaineEXE commented Mar 6, 2024

BlaineEXE Mar 6, 2024

BlaineEXE Mar 6, 2024

BlaineEXE commented Mar 8, 2024

BlaineEXE commented Mar 8, 2024

Madhu-1 left a comment

Madhu-1 Mar 12, 2024

BlaineEXE Mar 12, 2024

BlaineEXE Mar 12, 2024

BlaineEXE Mar 12, 2024

mergify bot commented Mar 12, 2024

BlaineEXE Mar 12, 2024

BlaineEXE Mar 12, 2024

subhamkrai left a comment

travisn left a comment

travisn Mar 13, 2024

BlaineEXE Mar 14, 2024

Madhu-1 left a comment

Madhu-1 Mar 14, 2024

BlaineEXE Mar 14, 2024

		- name: allow holder pod deployment
		run: sed -i "s\|CSI_DISABLE_HOLDER_PODS\|# CSI_DISABLE_HOLDER_PODS\|g" "deploy/examples/operator.yaml"

	If there are any CephClusters that use `network.provider: "multus"`, or if the operator config
	If there are any CephClusters that use the non-default network setting `network.provider: "multus"`, or if the operator config

csi: allow force disabling holder pods #13890

csi: allow force disabling holder pods #13890

Conversation

BlaineEXE commented Mar 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE commented Mar 8, 2024

BlaineEXE commented Mar 8, 2024

Madhu-1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mergify bot commented Mar 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

subhamkrai left a comment

Choose a reason for hiding this comment

travisn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment