Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CSI: set pod anti affinity to provisioner pod #5462

Merged
merged 3 commits into from
May 14, 2020

Conversation

Madhu-1
Copy link
Member

@Madhu-1 Madhu-1 commented May 12, 2020

Description of your changes:
sometimes the kube schedular will schedule the provisioner pod on the same node. it doesn't make
sense to have both provisioner pod running on the same node, we need to set the pod anti-affinity to make sure that no provisioner pods runnings on the same node.

Signed-off-by: Madhu Rajanna madhupr007@gmail.com

Which issue is resolved by this Pull Request:
Resolves #5271

Checklist:

  • CommitLint Bot: Commit titles and messages follow guidelines in the developer guide.
  • Skip Tests for Docs: Add the flag for skipping the build if this is only a documentation change. See here for the flag.
  • Skip Unrelated Tests: Add a flag to run tests for a specific storage provider. See test options.
  • Reviewed the developer guide on Submitting a Pull Request
  • Documentation has been updated, if necessary.
  • Unit tests have been added, if necessary.
  • Integration tests have been added, if necessary.
  • Pending release notes updated with breaking and/or notable changes, if necessary.
  • Upgrade from previous release is tested and upgrade user guide is updated, if necessary.
  • Code generation (make codegen) has been run to update object specifications, if necessary.
    [test ceph]

@Madhu-1 Madhu-1 added do-not-merge DO NOT MERGE :) WIP Work in Progress labels May 12, 2020
@Madhu-1
Copy link
Member Author

Madhu-1 commented May 12, 2020

@travisn PTAL if the approach looks good I will go ahead and make required changes and test this out.

Copy link
Member

@travisn travisn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Madhu-1 I'm thinking we should always set the RequiredDuringSchedulingIgnoredDuringExecution and there is no need for it to be configurable. Any production cluster will always need at least three nodes anyway, and we're only requiring two nodes. And test clusters (minikube) can just have one provisioner instance stay in pending. It shouldn't hurt anything, right?

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 12, 2020

Yes it should not hurt anything, will make changes and update the PR. Thanks for the feedback @travisn

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 12, 2020

@travisn do we need an option to enable and disable this one?

@travisn
Copy link
Member

travisn commented May 12, 2020

@travisn do we need an option to enable and disable this one?

I don't see a need to disable the anti-affinity. The only downside I see is that minikube will have a pod stay pending forever. IMO we can live with that in minikube.

@Madhu-1 Madhu-1 changed the title [WIP] CSI: set pod anti affinity to provisioner pod CSI: set pod anti affinity to provisioner pod May 13, 2020
_, err := clientset.AppsV1().Deployments(namespace).Create(dep)
if err != nil {
if k8serrors.IsAlreadyExists(err) {
_, err = clientset.AppsV1().Deployments(namespace).Update(dep)
// deleting the deployment if its already exists to avoid issues
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@travisn the pod will stay in a pending state if I don't delete and create the deployment, let me know is there any other way to fix it without deleting and recreating

[🎩︎]mrajanna@localhost rook $]kuberc get po -owide
NAME                                            READY   STATUS      RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
csi-cephfsplugin-fh5f4                          3/3     Running     0          12m     192.168.121.14   worker0   <none>           <none>
csi-cephfsplugin-provisioner-59b9cb4464-s6gvv   5/5     Running     0          12m     10.44.0.10       worker1   <none>           <none>
csi-cephfsplugin-provisioner-59b9cb4464-w96ld   5/5     Running     0          11m     10.36.0.3        worker0   <none>           <none>
csi-cephfsplugin-provisioner-6f8fd7854f-qpn6h   0/5     Pending     0          5m51s   <none>           <none>    <none>           <none>

 Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when you were testing on a single-node cluster? Or did the CI hit an issue during upgrade? I'd rather not do something special for upgrade on a single-node cluster since we really don't need to support upgrades on these clusters. Perhaps a simple solution is for the operator to query the number of nodes in the cluster. If 1, set the replica of the deployment to 1 instead of 2. This won't be perfect, for example, if there are taints on nodes, but it will keep the scenario of single-node cluster simple and then we won't see pending pods either.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is multinode cluster you can check node name for pods in above list.
The pods were in pending state even in multi node cluster

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, the deployment must be trying to bring up a new pod before deleting the old one, so the anti-affinity can't be satisfied. There may be a different upgrade setting on the deployment that would allow for old pods to be deleted before starting the new ones. But we may need to keep the delete-and-create strategy. In that case we would want to make sure we only update if there is a change in the pod spec. See this example.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I can make this check and make its only delete-and-create if there is a difference

@Madhu-1
Copy link
Member Author

Madhu-1 commented May 13, 2020

@travisn PTAL if changes look good and I will do final testing and remove DNM and WIP label from this PR

if err != nil {
return fmt.Errorf("failed to start %s deployment: %+v\n%+v", name, err, dep)
// Check whether the current deployement and newly generated one are identical
patchResult, err := patch.DefaultPatchMaker.Calculate(currentDeployment, modifiedDeployment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a new method rather than affecting all callers of the k8sutil.CreateDeployment() method. Only the CSI pods need to worry about this. Other places that need to worry about comparing for the spec diff are already doing it. But by default everything else should rely on just updating the deployment and let K8s update if needed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently csi is the only consumer of this function, how about passing a extra argument? If required I can create a new function for this one, but this function will be left unused

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just renaming it to CreateOrUpdateDeployment()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing the deletion and creation, looks like the Recreate update strategy on the setting may work for us. Then the pods from the previous spec will be updated before the new ones are created.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer, this works as expected

if err != nil {
return fmt.Errorf("failed to start %s deployment: %+v\n%+v", name, err, dep)
// Check whether the current deployement and newly generated one are identical
patchResult, err := patch.DefaultPatchMaker.Calculate(currentDeployment, modifiedDeployment)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe just renaming it to CreateOrUpdateDeployment()

@@ -407,7 +407,9 @@ func StartCSIDrivers(namespace string, clientset kubernetes.Interface, ver *vers
// apply resource request and limit to rbd provisioner containers
applyResourcesToContainers(clientset, rbdProvisionerResource, &rbdProvisionerSTS.Spec.Template.Spec)
k8sutil.SetOwnerRef(&rbdProvisionerSTS.ObjectMeta, ownerRef)
err = k8sutil.CreateStatefulSet("csi-rbdplugin-provisioner", namespace, clientset, rbdProvisionerSTS)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is the statefulset ever created? I've never seen it created instead of the deployment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the kubernetes version is 1.13.x the provisioner will be deployed as statefulset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are planning to remove support for kube 1.13.x we can remove the same. Lot of features are not supported in kube 1.13 example resize snapshot clone etc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.13 support isn't being removed from Rook yet, sounds good to keep it if needed, I just didn't remember when the statefulset was used. Would it really not work as a deployment in 1.13? Anyway, this question is really independent from this PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could have used deployment with replica:1 I need to check why we used statefulset, not deployment?
one thing I can remember is the suggestion or example from the sidecar to use statefulset and later it moved to deployment

@ShyamsundarR any idea why it was statefulset instead of deployment

sometimes the kube schedular will schedule the
provisioner pod on the same node. it doesnot make
sense to have both provisioner pod running on the same
node, we need to set the pod anti affinity to make
sure that no provisioner pods runnings on the same
node.

Fixes: rook#5271

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
To have parity with other functions and
also client should be the first argument to
the function.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
updated deployment strategy of provisioner
deployment to Recreate to avoid issue during
update, Recreate strategy will Kill all existing
pods before creating new ones.

Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@Madhu-1 Madhu-1 removed WIP Work in Progress do-not-merge DO NOT MERGE :) labels May 14, 2020
@Madhu-1
Copy link
Member Author

Madhu-1 commented May 14, 2020

@travisn ready for review PTAL

@travisn travisn added the ceph main ceph tag label May 14, 2020
@underyx
Copy link

underyx commented Aug 9, 2020

For what it's worth, I just set up rook on a single-node cluster, and went down an hour-long rabbit hole trying to figure out why I have pending provisioners, assuming it was a problem.

@travisn
Copy link
Member

travisn commented Aug 10, 2020

For what it's worth, I just set up rook on a single-node cluster, and went down an hour-long rabbit hole trying to figure out why I have pending provisioners, assuming it was a problem.

If this is causing pain we should fix it, even if it doesn't affect the cluster health. Want to open an issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ceph main ceph tag
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Podantiaffinity for csi-rbdplugin-provisioner
3 participants