Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csi: allow force disabling holder pods #13890

Merged
merged 1 commit into from
Mar 14, 2024

Conversation

BlaineEXE
Copy link
Member

Add new CSI_DISABLE_HOLDER_PODS option for rook-ceph-operator. This option will disable holder pods when set to "true".

In the long term, Rook plans to deprecate the holder pods entirely. This new option will allow users to choose to migrate their clusters to non-holder clusters when they are ready and able, giving them time to gracefully migrate before the holders are permanently removed.

This option is set to "false" by default so that upgrading users don't have their CSI pods modified unexpectedly.
Example manifests are modified to set this value to true so that new clusters will not deploy holder pods.

Migrating users are provided with documentation to instruct them about the new requirements they need to satisfy to successfully remove holder pods, a procedure for migrating pods from holder to non-holder mounts, and a way to delete holder pods once they are no longer in use.

When users set CSI_DISABLE_HOLDER_PODS="true", the CSI controller will no longer deploy or update the holder pod Daemonsets, but it does not delete any existing Daemonsets. This allows already-attached PVCs to continue operating normally with their network connection continuing to exist in the current holder pod. This is critical to avoid causing ia cluster-wide storage outage.

More info: #13055

Checklist:

  • Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
  • Reviewed the developer guide on Submitting a Pull Request
  • Pending release notes updated with breaking and/or notable changes for the next minor release.
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.

Comment on lines +231 to +244
// is holder enabled for this cluster?
thisHolderEnabled := (!csiHostNetworkEnabled || cluster.Spec.Network.IsMultus()) && !csiDisableHolders

// Do we have a multus cluster or csi host network disabled?
// If so deploy the plugin holder with the fsid attached
if thisHolderEnabled {
logger.Debugf("cluster %q: deploying the ceph-csi plugin holder", cluster.Name)
r.clustersWithHolder = append(r.clustersWithHolder, ClusterDetail{cluster: &cephClusters.Items[i], clusterInfo: clusterInfo})

// holder pods are enabled globally if any cluster needs a holder pod
holderEnabled = true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic for manipulating the global holderEnabled config has moved here, up from spec.go.

Comment on lines -422 to -429
holderEnabled = !CSIParam.EnableCSIHostNetwork

for i := range r.clustersWithHolder {
if r.clustersWithHolder[i].cluster.Spec.Network.IsMultus() {
holderEnabled = true
break
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has moved up into the parent controller.go logic here: https://github.com/rook/rook/pull/13890/files#r1515280100

@BlaineEXE BlaineEXE force-pushed the csi-disable-holder branch 4 times, most recently from 25f0485 to 0fe41fa Compare March 8, 2024 22:55
@BlaineEXE BlaineEXE marked this pull request as ready for review March 8, 2024 22:55
@BlaineEXE
Copy link
Member Author

I think I still have a few unit test failures that I need to resolve, and there are still some doc updates I need to make. However, I believe the code is good and the docs are mostly complete.

I have been unable to test cephfs due to the issue reported here: #13739. But RBD-related testing I have done has gone well.

Note that testing on single-node Minikube is much more forgiving than testing on multi-node Minikube.

@BlaineEXE
Copy link
Member Author

I think the fix to make sure these CI tests pass is to have them set CSI_DISABLE_HOLDER_PODS: "false".

PendingReleaseNotes.md Show resolved Hide resolved
@BlaineEXE BlaineEXE force-pushed the csi-disable-holder branch 4 times, most recently from 716c50c to a387e59 Compare March 11, 2024 22:11
Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes LGTM, left some comments

Documentation/CRDs/Cluster/network-providers.md Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/network-providers.md Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/network-providers.md Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/network-providers.md Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/network-providers.md Outdated Show resolved Hide resolved
Documentation/CRDs/Cluster/network-providers.md Outdated Show resolved Hide resolved
deploy/charts/rook-ceph/templates/configmap.yaml Outdated Show resolved Hide resolved
pkg/operator/ceph/csi/cluster_config.go Outdated Show resolved Hide resolved

// if holder pods were disabled, the controller needs to update the configmap for each
// cephcluster to remove the net namespace file path
err = SaveCSIDriverOptions(r.context.Clientset, cluster.Namespace, clusterInfo)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SaveClusterConfig will not remove the netnamespace from the configmap (from my code reading i could be wrong but we need to check on this)

we have multiple functions acting the same configmap in different ways which can easily introduce bugs, we need to refractor it (not as part of this PR but later for sure)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added updateNetNamespaces() to that function so I could use it here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -341,6 +343,7 @@ func updateCSIDriverOptions(curr, clusterKey string,
}
}

updateNetNamespaceFilePath(clusterKey, cc)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to SaveClusterConfig here.

Copy link

mergify bot commented Mar 12, 2024

This pull request has merge conflicts that must be resolved before it can be merged. @BlaineEXE please rebase it. https://rook.io/docs/rook/latest/Contributing/development-flow/#updating-your-fork

Comment on lines 1515 to 1523
- name: allow holder pod deployment
run: sed -i "s|CSI_DISABLE_HOLDER_PODS|# CSI_DISABLE_HOLDER_PODS|g" "deploy/examples/operator.yaml"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, let's allow holder pods to get the CI working. This will make sure legacy behavior is preserved. Let's follow up in a new PR to ensure the new behavior works greenfield beyond the manual testing I've done.

If the scenario does not apply, skip ahead to the
[Disabling Holder Pods](#disabling-holder-pods) section below.

**Step 1**
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added simple Step # headers in response to a note Subham gave about having a hard time following.

@BlaineEXE BlaineEXE force-pushed the csi-disable-holder branch 2 times, most recently from 2f22d2b to 30e301a Compare March 12, 2024 20:10
Copy link
Contributor

@subhamkrai subhamkrai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, my questions are answered. Thanks

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor suggestion on the upgrade guide

CSI "holder" pods are frequently reported objects of confusion and struggle in Rook. Because of
this, they are being deprecated and will be removed in Rook v1.16.

If there are any CephClusters that use `network.provider: "multus"`, or if the operator config
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default is to use host networking, so we don't expect many users to be affected, right? Should we add a note that they would only be affected it they changed those default network settings?

Suggested change
If there are any CephClusters that use `network.provider: "multus"`, or if the operator config
If there are any CephClusters that use the non-default network setting `network.provider: "multus"`, or if the operator config

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a good note. Added 👍

Copy link
Member

@Madhu-1 Madhu-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

# Deprecation note: Rook uses "holder" pods to allow CSI to connect to the multus public network
# without needing hosts to the network. Holder pods are being deprecated. See issue for details:
# https://github.com/rook/rook/issues/13055. New Rook deployments should set this to "true".
CSI_DISABLE_HOLDER_PODS: "true"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we call it as HOLDER_PODS or HOLDER_DAEMONSET or just HOLDER?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this and settled on "holder pods" (versus "daemonset" or just "holders") since most users seem to refer to the pods themselves and take on a pod-centric view.

Add new CSI_DISABLE_HOLDER_PODS option for rook-ceph-operator.
This option will disable holder pods when set to "true".

In the long term, Rook plans to deprecate the holder pods entirely.
This new option will allow users to choose to migrate their clusters to
non-holder clusters when they are ready and able, giving them time to
gracefully migrate before the holders are permanently removed.

This option is set to "false" by default so that upgrading users don't
have their CSI pods modified unexpectedly.
Example manifests are modified to set this value to true so that new
clusters will not deploy holder pods.

Migrating users are provided with documentation to instruct them about
the new requirements they need to satisfy to successfully remove holder
pods, a procedure for migrating pods from holder to non-holder mounts,
and a way to delete holder pods once they are no longer in use.

When users set CSI_DISABLE_HOLDER_PODS="true", the CSI controller will
no longer deploy or update the holder pod Daemonsets, but it does not
delete any existing Daemonsets. This allows already-attached PVCs to
continue operating normally with their network connection continuing to
exist in the current holder pod. This is critical to avoid causing
ia cluster-wide storage outage.

More info: rook#13055

Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>
@BlaineEXE BlaineEXE merged commit e41366b into rook:master Mar 14, 2024
50 of 51 checks passed
@BlaineEXE BlaineEXE deleted the csi-disable-holder branch March 14, 2024 16:27
BlaineEXE added a commit to BlaineEXE/ocs-operator that referenced this pull request Mar 14, 2024
When deploying new StorageClusters, ocs-operator should apply the new
Rook operator config `CSI_REMOVE_HOLDER_PODS: "true"`. This reflects
the new config and default value that Rook specifies in example
manifests here: rook/rook#13890

Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants