CSI: set pod anti affinity to provisioner pod #5462

Madhu-1 · 2020-05-12T11:51:51Z

Description of your changes:
sometimes the kube schedular will schedule the provisioner pod on the same node. it doesn't make
sense to have both provisioner pod running on the same node, we need to set the pod anti-affinity to make sure that no provisioner pods runnings on the same node.

Signed-off-by: Madhu Rajanna madhupr007@gmail.com

Which issue is resolved by this Pull Request:
Resolves #5271

Checklist:

Madhu-1 · 2020-05-12T11:52:34Z

@travisn PTAL if the approach looks good I will go ahead and make required changes and test this out.

travisn

@Madhu-1 I'm thinking we should always set the RequiredDuringSchedulingIgnoredDuringExecution and there is no need for it to be configurable. Any production cluster will always need at least three nodes anyway, and we're only requiring two nodes. And test clusters (minikube) can just have one provisioner instance stay in pending. It shouldn't hurt anything, right?

Madhu-1 · 2020-05-12T18:36:21Z

Yes it should not hurt anything, will make changes and update the PR. Thanks for the feedback @travisn

Madhu-1 · 2020-05-12T18:56:53Z

@travisn do we need an option to enable and disable this one?

travisn · 2020-05-12T19:13:39Z

@travisn do we need an option to enable and disable this one?

I don't see a need to disable the anti-affinity. The only downside I see is that minikube will have a pod stay pending forever. IMO we can live with that in minikube.

Madhu-1 · 2020-05-13T09:11:29Z

pkg/operator/k8sutil/deployment.go

 	_, err := clientset.AppsV1().Deployments(namespace).Create(dep)
 	if err != nil {
 		if k8serrors.IsAlreadyExists(err) {
-			_, err = clientset.AppsV1().Deployments(namespace).Update(dep)
+			// deleting the deployment if its already exists to avoid issues


@travisn the pod will stay in a pending state if I don't delete and create the deployment, let me know is there any other way to fix it without deleting and recreating

[🎩︎]mrajanna@localhost rook $]kuberc get po -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES csi-cephfsplugin-fh5f4 3/3 Running 0 12m 192.168.121.14 worker0 <none> <none> csi-cephfsplugin-provisioner-59b9cb4464-s6gvv 5/5 Running 0 12m 10.44.0.10 worker1 <none> <none> csi-cephfsplugin-provisioner-59b9cb4464-w96ld 5/5 Running 0 11m 10.36.0.3 worker0 <none> <none> csi-cephfsplugin-provisioner-6f8fd7854f-qpn6h 0/5 Pending 0 5m51s <none> <none> <none> <none> Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity. Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity.

This is when you were testing on a single-node cluster? Or did the CI hit an issue during upgrade? I'd rather not do something special for upgrade on a single-node cluster since we really don't need to support upgrades on these clusters. Perhaps a simple solution is for the operator to query the number of nodes in the cluster. If 1, set the replica of the deployment to 1 instead of 2. This won't be perfect, for example, if there are taints on nodes, but it will keep the scenario of single-node cluster simple and then we won't see pending pods either.

This is multinode cluster you can check node name for pods in above list.
The pods were in pending state even in multi node cluster

Interesting, the deployment must be trying to bring up a new pod before deleting the old one, so the anti-affinity can't be satisfied. There may be a different upgrade setting on the deployment that would allow for old pods to be deleted before starting the new ones. But we may need to keep the delete-and-create strategy. In that case we would want to make sure we only update if there is a change in the pod spec. See this example.

sure I can make this check and make its only delete-and-create if there is a difference

Madhu-1 · 2020-05-13T15:29:55Z

@travisn PTAL if changes look good and I will do final testing and remove DNM and WIP label from this PR

travisn · 2020-05-13T16:29:25Z

pkg/operator/k8sutil/deployment.go

-		if err != nil {
-			return fmt.Errorf("failed to start %s deployment: %+v\n%+v", name, err, dep)
+			// Check whether the current deployement and newly generated one are identical
+			patchResult, err := patch.DefaultPatchMaker.Calculate(currentDeployment, modifiedDeployment)


Let's make this a new method rather than affecting all callers of the k8sutil.CreateDeployment() method. Only the CSI pods need to worry about this. Other places that need to worry about comparing for the spec diff are already doing it. But by default everything else should rely on just updating the deployment and let K8s update if needed.

Currently csi is the only consumer of this function, how about passing a extra argument? If required I can create a new function for this one, but this function will be left unused

Maybe just renaming it to CreateOrUpdateDeployment()

Instead of implementing the deletion and creation, looks like the Recreate update strategy on the setting may work for us. Then the pods from the previous spec will be updated before the new ones are created.

Thanks for the pointer, this works as expected

travisn · 2020-05-13T16:37:57Z

pkg/operator/k8sutil/deployment.go

-		if err != nil {
-			return fmt.Errorf("failed to start %s deployment: %+v\n%+v", name, err, dep)
+			// Check whether the current deployement and newly generated one are identical
+			patchResult, err := patch.DefaultPatchMaker.Calculate(currentDeployment, modifiedDeployment)


Maybe just renaming it to CreateOrUpdateDeployment()

travisn · 2020-05-13T16:38:43Z

pkg/operator/ceph/csi/spec.go

@@ -407,7 +407,9 @@ func StartCSIDrivers(namespace string, clientset kubernetes.Interface, ver *vers
 		// apply resource request and limit to rbd provisioner containers
 		applyResourcesToContainers(clientset, rbdProvisionerResource, &rbdProvisionerSTS.Spec.Template.Spec)
 		k8sutil.SetOwnerRef(&rbdProvisionerSTS.ObjectMeta, ownerRef)
-		err = k8sutil.CreateStatefulSet("csi-rbdplugin-provisioner", namespace, clientset, rbdProvisionerSTS)


When is the statefulset ever created? I've never seen it created instead of the deployment.

If the kubernetes version is 1.13.x the provisioner will be deployed as statefulset

If we are planning to remove support for kube 1.13.x we can remove the same. Lot of features are not supported in kube 1.13 example resize snapshot clone etc

1.13 support isn't being removed from Rook yet, sounds good to keep it if needed, I just didn't remember when the statefulset was used. Would it really not work as a deployment in 1.13? Anyway, this question is really independent from this PR.

we could have used deployment with replica:1 I need to check why we used statefulset, not deployment?
one thing I can remember is the suggestion or example from the sidecar to use statefulset and later it moved to deployment

@ShyamsundarR any idea why it was statefulset instead of deployment

sometimes the kube schedular will schedule the provisioner pod on the same node. it doesnot make sense to have both provisioner pod running on the same node, we need to set the pod anti affinity to make sure that no provisioner pods runnings on the same node. Fixes: rook#5271 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

To have parity with other functions and also client should be the first argument to the function. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

updated deployment strategy of provisioner deployment to Recreate to avoid issue during update, Recreate strategy will Kill all existing pods before creating new ones. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

Madhu-1 · 2020-05-14T05:59:37Z

@travisn ready for review PTAL

underyx · 2020-08-09T16:18:13Z

For what it's worth, I just set up rook on a single-node cluster, and went down an hour-long rabbit hole trying to figure out why I have pending provisioners, assuming it was a problem.

travisn · 2020-08-10T18:57:58Z

For what it's worth, I just set up rook on a single-node cluster, and went down an hour-long rabbit hole trying to figure out why I have pending provisioners, assuming it was a problem.

If this is causing pain we should fix it, even if it doesn't affect the cluster health. Want to open an issue?

Madhu-1 added do-not-merge DO NOT MERGE :) WIP Work in Progress labels May 12, 2020

travisn requested changes May 12, 2020

View reviewed changes

Madhu-1 force-pushed the dep-anti-aff branch from f6fa9fa to 769f79e Compare May 13, 2020 08:16

Madhu-1 changed the title ~~[WIP] CSI: set pod anti affinity to provisioner pod~~ CSI: set pod anti affinity to provisioner pod May 13, 2020

Madhu-1 force-pushed the dep-anti-aff branch from 769f79e to 8d5f2c6 Compare May 13, 2020 09:06

Madhu-1 commented May 13, 2020

View reviewed changes

travisn reviewed May 13, 2020

View reviewed changes

Madhu-1 added 3 commits May 14, 2020 11:01

k8sutil: Make client as the first argument

a16d3bd

To have parity with other functions and also client should be the first argument to the function. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

CSI: set deployment strategy to Recreate

7478652

updated deployment strategy of provisioner deployment to Recreate to avoid issue during update, Recreate strategy will Kill all existing pods before creating new ones. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>

Madhu-1 force-pushed the dep-anti-aff branch from 78df2a5 to 7478652 Compare May 14, 2020 05:57

Madhu-1 removed WIP Work in Progress do-not-merge DO NOT MERGE :) labels May 14, 2020

travisn added the ceph main ceph tag label May 14, 2020

travisn approved these changes May 14, 2020

View reviewed changes

travisn added the backport-release-1.3 label May 14, 2020

travisn merged commit 4c38495 into rook:master May 14, 2020

travisn mentioned this pull request May 18, 2020

ceph: Adding support for admission controllers #5045

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CSI: set pod anti affinity to provisioner pod #5462

CSI: set pod anti affinity to provisioner pod #5462

Madhu-1 commented May 12, 2020 •

edited

Madhu-1 commented May 12, 2020

travisn left a comment

Madhu-1 commented May 12, 2020

Madhu-1 commented May 12, 2020

travisn commented May 12, 2020

Madhu-1 May 13, 2020

travisn May 13, 2020

Madhu-1 May 13, 2020

travisn May 13, 2020

Madhu-1 May 13, 2020

Madhu-1 commented May 13, 2020

travisn May 13, 2020

Madhu-1 May 13, 2020

travisn May 13, 2020

travisn May 13, 2020

Madhu-1 May 14, 2020

travisn May 13, 2020

travisn May 13, 2020

Madhu-1 May 13, 2020

Madhu-1 May 13, 2020

travisn May 13, 2020

Madhu-1 May 14, 2020

Madhu-1 commented May 14, 2020

underyx commented Aug 9, 2020 •

edited

travisn commented Aug 10, 2020

CSI: set pod anti affinity to provisioner pod #5462

CSI: set pod anti affinity to provisioner pod #5462

Conversation

Madhu-1 commented May 12, 2020 • edited

Madhu-1 commented May 12, 2020

travisn left a comment

Choose a reason for hiding this comment

Madhu-1 commented May 12, 2020

Madhu-1 commented May 12, 2020

travisn commented May 12, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 commented May 13, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Madhu-1 commented May 14, 2020

underyx commented Aug 9, 2020 • edited

travisn commented Aug 10, 2020

Madhu-1 commented May 12, 2020 •

edited

underyx commented Aug 9, 2020 •

edited