csi: fix missing namespace bug in csi cluster config map #14154

BlaineEXE · 2024-05-02T19:17:04Z

Someone testing the Multus holder pod removal feature encountered an issue where the migration process failed to lead to a system state where PVCs could be created successfully.

The root cause was found to be a ceph csi config map wherein the primary CephCluster entry was lacking a value for the "namespace" field.

I observed this once in my development on the holder pod removal feature, but I was unable to reproduce and assumed it was my own error. Since this has been seen in a user environment, it must be that the error is a race condition, and I am unable to determine the exact source of the bug.

I do not believe this bug would be present if the code that updates the CSI configmap were properly idempotent, but it has many conditions based on prior states, and I was unable to determine how to resolve this underlying impelementation pattern issue.

Instead, I opted to separate the clusterKey parameter into two clear parts:

clusterID for when clusterKey is used as an analogue for clusterID
clusterNamespace for when clusterKey is used as an analogue for clusterNamespace

I added unit tests to ensure that SaveClusterConfig() will be able to detect when the namespace is currently missing, and using the new clusterNamespace field, it should always know what value to use as input when correcting the bug in already-installed systems.

I also verified that this update works when the function simultaneously removes netNamespaceFilePath entries, and that those entries are removed properly.

Finally, manual testing also verifies the change.

If any users find that PVCs don't work after following steps to remove multus holder pods, users should upgrade to Rook v1.14.3 (upcoming) to get this bug fix. It should resolve the issue for them. Users can determine if they are affected by this bug at any time using this command:

❯ kubectl -n rook-ceph get cm rook-ceph-csi-config -oyaml                           ⎈ minikube
apiVersion: v1
data:
  csi-cluster-config-json: '[{"clusterID":"rook-ceph","monitors":["10.104.66.98:6789"],"cephFS":{"netNamespaceFilePath":"","subvolumeGroup":"","kernelMountOptions":"","fuseMountOptions":""},"rbd":{"netNamespaceFilePath":"","radosNamespace":""},"nfs":{"netNamespaceFilePath":""},"readAffinity":{"enabled":false,"crushLocationLabels":["kubernetes.io/hostname","topology.kubernetes.io/region","topology.kubernetes.io/zone","topology.rook.io/chassis","topology.rook.io/rack","topology.rook.io/row","topology.rook.io/pdu","topology.rook.io/pod","topology.rook.io/room","topology.rook.io/datacenter"]},"namespace":""}]'
kind: ConfigMap
metadata:
  creationTimestamp: "2024-05-02T19:37:37Z"
  name: rook-ceph-csi-config
  namespace: rook-ceph
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: false
    controller: true
    kind: Deployment
    name: rook-ceph-operator
    uid: 7359139c-4149-4fba-8096-d77bfe018800
  resourceVersion: "2750"
  uid: a91c133b-5883-4745-9c0d-f2fc864b8b03

In the example output above notice that the final entry in the config data shows "namespace":"". The bug is present in this cluster. Users should not follow steps to disable holder pods until the issue is resolved. Users can manually resolve the issue by editing the configmap and inserting the CephCluster namespace into the values, like this "namespace":"rook-ceph" (assuming the namespace is rook-ceph).

Checklist:

Commit Message Formatting: Commit titles and messages follow guidelines in the developer guide.
Reviewed the developer guide on Submitting a Pull Request
Pending release notes updated with breaking and/or notable changes for the next minor release.
Documentation has been updated, if necessary.
Unit tests have been added, if necessary.
Integration tests have been added, if necessary.

BlaineEXE · 2024-05-02T19:25:21Z

pkg/operator/ceph/csi/cluster_config.go

-			centry.ClusterID = clusterKey
-			centry.Namespace = newCsiClusterConfigEntry.Namespace
+			centry.ClusterID = clusterID
+			centry.Namespace = clusterNamespace


I can't confirm this, but I suspect that the root cause of this issue is that there is a place in the code where (currently or in the past) newCsiClusterConfigEntry.Namespace was empty, which is leading to the CSI configmap to have an empty value for the field.

Then, because this value isn't modified or updated after the initial creation, the issue persisted forever.

This behavior is non-idempotent, and I want to stress that operator code should always be idempotent (and simple) with very few exceptions. Non-idempotent code in operators is prone to bugs like these.

The only exceptions should be if the idempotent implementation adds significant (more than a second) of latency to reconcile loops.

I would suggest that the SaveClusterConfig() method should be reworked to focus on simple, idempotent logic.

I also notice that this function is doing 2 (or more) jobs, which also contributes to complexity and risk of bugs. I suggest splitting this into SaveClusterConfig() and RemoveClusterConfig() so that the add and remove behaviors present aren't adding unnecessary test cases (and room for bugs).

I also want to remind everyone that even though I believe I may be fixing the root cause by making this particular change, this won't resolve the issue in already-installed clusters.

I also want to remind everyone that even though I believe I may be fixing the root cause by making this particular change, this won't resolve the issue in already-installed clusters.

This won't be resolved by the change on lines 169-175 above?

Lines 169-175 fix the problem if it exists in an already-installed cluster, but they don't fix the root cause of the bug.

I think (but I'm not very certain) that this might be the root cause fix. However, this line's fix only applies when the entry is first created, so it won't do anything for brownfield clusters that have already experienced the error.

This is also helping to illustrate my point about creating proper idempotency in reconciliations. The code used here to create the first instance is different from the code used above to update existing instances. When there are two branch flows for create and update, any discrepancy between the two can lead to bugs like this.

travisn · 2024-05-02T19:44:10Z

pkg/operator/ceph/csi/cluster_config.go

-			centry.ClusterID = clusterKey
-			centry.Namespace = newCsiClusterConfigEntry.Namespace
+			centry.ClusterID = clusterID
+			centry.Namespace = clusterNamespace


I also want to remind everyone that even though I believe I may be fixing the root cause by making this particular change, this won't resolve the issue in already-installed clusters.

This won't be resolved by the change on lines 169-175 above?

travisn · 2024-05-02T19:45:24Z

pkg/operator/ceph/csi/cluster_config.go

@@ -302,11 +310,9 @@ func updateCsiConfigMapOwnerRefs(ctx context.Context, namespace string, clientse
 // SaveClusterConfig updates the config map used to provide ceph-csi with
 // basic cluster configuration. The clusterNamespace and clusterInfo are
 // used to determine what "cluster" in the config map will be updated and
-// the clusterNamespace value is expected to match the clusterID
-// value that is provided to ceph-csi uses in the storage class.
 // The locker l is typically a mutex and is used to prevent the config


Independently from this PR... Is "locker l" obsolete? I don't see it here anymore. Or it must be referring to the package var configMutex. Maybe we can just remove this comment.

Yeah, that makes sense too 👍

BlaineEXE · 2024-05-02T19:49:49Z

I tested this locally by:

install rook
manually edit the rook-ceph-csi-config to remove the namespace value
restart the rook operator
observe that the namespace is re-added to the cluster config in the configmap

Someone testing the Multus holder pod removal feature encountered an issue where the migration process failed to lead to a system state where PVCs could be created successfully. The root cause was found to be a ceph csi config map wherein the primary CephCluster entry was lacking a value for the "namespace" field. I observed this once in my development on the holder pod removal feature, but I was unable to reproduce and assumed it was my own error. Since this has been seen in a user environment, it must be that the error is a race condition, and I am unable to determine the exact source of the bug. I do not believe this bug would be present if the code that updates the CSI configmap were properly idempotent, but it has many conditions based on prior states, and I was unable to determine how to resolve this underlying impelementation pattern issue. Instead, I opted to separate the `clusterKey` parameter into two clear parts: 1. `clusterID` for when `clusterKey` is used as an analogue for `clusterID` 2. `clusterNamespace` for when `clusterKey` is used as an analogue for `clusterNamespace` I added unit tests to ensure that SaveClusterConfig() will be able to detect when the namespace is currently missing, and using the new `clusterNamespace` field, it should always know what value to use as input when correcting the bug in already-installed systems. I also verified that this update works when the function simultaneously removes netNamespaceFilePath entries, and that those entries are removed properly. Finally, manual testing also verifies the change. Signed-off-by: Blaine Gardner <blaine.gardner@ibm.com>

Madhu-1 · 2024-05-03T07:00:22Z

pkg/operator/ceph/csi/cluster_config.go

@@ -178,7 +186,7 @@ func updateCsiClusterConfig(curr, clusterKey string, newCsiClusterConfigEntry *C
 		}
 	}
 	for i, centry := range cc {
-		if centry.ClusterID == clusterKey {


@BlaineEXE i have a couple of other alternatives

~~can we move the above fix from 169 to 176 here? it ensures that it works for all the clusters even in case if newCsiClusterConfigEntry is also nil.~~

We can have empty namespace check along clusterID here

rook/pkg/operator/ceph/csi/cluster_config.go

Line 140 in 8ec9f94

if cc[i].Namespace == clusterNamespace {

if cc[i].Namespace == clusterNamespace || cc[i].Namespace== "" && cc[i].ClusterID=clusterNamespace{ ... }

I also think might be an issue with the IsHolderEnabled() check in the updateNetNamespaceFilePath function. This is because other controllers use this function to update the configmap and it is dependent on the holder variable set by the CSI controller. we can look at it later.

The current change looks good to be but this is having more changes, ~~we can also have option 3 which is one line of change~~, i will let you and @travisn do decide on it.

I think I'd prefer to keep the "fix" separate since it can then log when it is fixing the condition. I'd like to be able to look for that log line in must-gathers from any future multus issues that users find.

As for your comment here...

I also think might be an issue with the IsHolderEnabled() check in the updateNetNamespaceFilePath function. This is because other controllers use this function to update the configmap and it is dependent on the holder variable set by the CSI controller. we can look at it later.

I tend to agree. Having IsHolderEnabled() use a global variable seems risky to me. I considered changing that when I implemented the multus changes several weeks back, but I found that it required more code changes than seemed reasonable, and I also didn't have a high confidence that I wouldn't introduce new bugs by making the changes. Ultimately, since we are going to remove holder pods entirely from the code base in a couple releases, I decided that it was best to leave it as-is.

From a technical standpoint, I also don't believe users will see any race conditions related to the global variable, for reasons below.

When the CSI controller sees a change in multus/holder enable state, it updates the global variable before updating the csi config and before making pod changes. Unless I missed something (always possible), I don't believe there are race conditions in the CSI controller itself. Additionally, the rados namespace and subvolume reconcilers don't use the global variable when deciding to set the netns file path, so they should be immune to race conditions related to the global var.

The only case where I think a race condition could exist is if a user has multiple multus-enabled CephClusters. However, I don't think Rook actually supports that case because it can't apply multiple sets of network selection annotations to the CSI pods. If any users were trying to do that today, I think we would see a bug report from them.

All of that said, there are no guarantees that users won't see bugs, but I have done as much investigation and due diligence as possible to ensure that users won't have issues.

parth-gr · 2024-05-03T07:49:24Z

pkg/operator/ceph/cluster/mon/mon.go

@@ -1112,7 +1112,9 @@ func (c *Cluster) saveMonConfig() error {
 			Monitors: monEndpoints,
 		},
 	}
-	if err := csi.SaveClusterConfig(c.context.Clientset, c.Namespace, c.ClusterInfo, csiConfigEntry); err != nil {
+
+	clusterId := c.Namespace // cluster id is same as cluster namespace for CephClusters


for rados namespace it can be different....

This is the mon code though. The rados namespace code uses a separate clusterID and namespace

csi: fix missing namespace bug in csi cluster config map (backport #14154)

BlaineEXE requested review from travisn, Madhu-1, iPraveenParihar and Rakshith-R May 2, 2024 19:17

BlaineEXE commented May 2, 2024

View reviewed changes

BlaineEXE marked this pull request as ready for review May 2, 2024 19:48

travisn reviewed May 2, 2024

View reviewed changes

BlaineEXE force-pushed the csi-config-map-fix-empty-namespace branch from 61d13e7 to 4ff8812 Compare May 2, 2024 19:58

travisn mentioned this pull request May 2, 2024

build: update release version to v1.14.3 #14153

Merged

6 tasks

Madhu-1 reviewed May 3, 2024

View reviewed changes

parth-gr reviewed May 3, 2024

View reviewed changes

Madhu-1 approved these changes May 3, 2024

View reviewed changes

BlaineEXE added backport-release-1.14 csi multus labels May 3, 2024

BlaineEXE merged commit 1af97d0 into rook:master May 3, 2024
53 checks passed

mergify bot mentioned this pull request May 3, 2024

csi: fix missing namespace bug in csi cluster config map (backport #14154) #14157

Merged

6 tasks

BlaineEXE deleted the csi-config-map-fix-empty-namespace branch May 3, 2024 18:18

mergify bot added a commit that referenced this pull request May 3, 2024

Merge pull request #14157 from rook/mergify/bp/release-1.14/pr-14154

b681296

csi: fix missing namespace bug in csi cluster config map (backport #14154)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

csi: fix missing namespace bug in csi cluster config map #14154

csi: fix missing namespace bug in csi cluster config map #14154

BlaineEXE commented May 2, 2024 •

edited

Loading

BlaineEXE May 2, 2024

travisn May 2, 2024

BlaineEXE May 2, 2024

travisn May 2, 2024

travisn May 2, 2024

BlaineEXE May 2, 2024

BlaineEXE commented May 2, 2024

Madhu-1 May 3, 2024 •

edited

Loading

BlaineEXE May 3, 2024

parth-gr May 3, 2024

BlaineEXE May 3, 2024

csi: fix missing namespace bug in csi cluster config map #14154

csi: fix missing namespace bug in csi cluster config map #14154

Conversation

BlaineEXE commented May 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE commented May 2, 2024

Madhu-1 May 3, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BlaineEXE commented May 2, 2024 •

edited

Loading

Madhu-1 May 3, 2024 •

edited

Loading