-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CSI: set pod anti affinity to provisioner pod #5462
Conversation
@travisn PTAL if the approach looks good I will go ahead and make required changes and test this out. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Madhu-1 I'm thinking we should always set the RequiredDuringSchedulingIgnoredDuringExecution
and there is no need for it to be configurable. Any production cluster will always need at least three nodes anyway, and we're only requiring two nodes. And test clusters (minikube) can just have one provisioner instance stay in pending. It shouldn't hurt anything, right?
Yes it should not hurt anything, will make changes and update the PR. Thanks for the feedback @travisn |
@travisn do we need an option to enable and disable this one? |
I don't see a need to disable the anti-affinity. The only downside I see is that minikube will have a pod stay pending forever. IMO we can live with that in minikube. |
pkg/operator/k8sutil/deployment.go
Outdated
_, err := clientset.AppsV1().Deployments(namespace).Create(dep) | ||
if err != nil { | ||
if k8serrors.IsAlreadyExists(err) { | ||
_, err = clientset.AppsV1().Deployments(namespace).Update(dep) | ||
// deleting the deployment if its already exists to avoid issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@travisn the pod will stay in a pending state if I don't delete and create the deployment, let me know is there any other way to fix it without deleting and recreating
[🎩︎]mrajanna@localhost rook $]kuberc get po -owide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
csi-cephfsplugin-fh5f4 3/3 Running 0 12m 192.168.121.14 worker0 <none> <none>
csi-cephfsplugin-provisioner-59b9cb4464-s6gvv 5/5 Running 0 12m 10.44.0.10 worker1 <none> <none>
csi-cephfsplugin-provisioner-59b9cb4464-w96ld 5/5 Running 0 11m 10.36.0.3 worker0 <none> <none>
csi-cephfsplugin-provisioner-6f8fd7854f-qpn6h 0/5 Pending 0 5m51s <none> <none> <none> <none>
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity.
Warning FailedScheduling <unknown> default-scheduler 0/3 nodes are available: 1 node(s) had taints that the pod didn't tolerate, 2 node(s) didn't match pod affinity/anti-affinity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when you were testing on a single-node cluster? Or did the CI hit an issue during upgrade? I'd rather not do something special for upgrade on a single-node cluster since we really don't need to support upgrades on these clusters. Perhaps a simple solution is for the operator to query the number of nodes in the cluster. If 1, set the replica of the deployment to 1 instead of 2. This won't be perfect, for example, if there are taints on nodes, but it will keep the scenario of single-node cluster simple and then we won't see pending pods either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is multinode cluster you can check node name for pods in above list.
The pods were in pending state even in multi node cluster
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, the deployment must be trying to bring up a new pod before deleting the old one, so the anti-affinity can't be satisfied. There may be a different upgrade setting on the deployment that would allow for old pods to be deleted before starting the new ones. But we may need to keep the delete-and-create strategy. In that case we would want to make sure we only update if there is a change in the pod spec. See this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I can make this check and make its only delete-and-create if there is a difference
@travisn PTAL if changes look good and I will do final testing and remove DNM and WIP label from this PR |
pkg/operator/k8sutil/deployment.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to start %s deployment: %+v\n%+v", name, err, dep) | ||
// Check whether the current deployement and newly generated one are identical | ||
patchResult, err := patch.DefaultPatchMaker.Calculate(currentDeployment, modifiedDeployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make this a new method rather than affecting all callers of the k8sutil.CreateDeployment() method. Only the CSI pods need to worry about this. Other places that need to worry about comparing for the spec diff are already doing it. But by default everything else should rely on just updating the deployment and let K8s update if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently csi is the only consumer of this function, how about passing a extra argument? If required I can create a new function for this one, but this function will be left unused
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just renaming it to CreateOrUpdateDeployment()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing the deletion and creation, looks like the Recreate update strategy on the setting may work for us. Then the pods from the previous spec will be updated before the new ones are created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the pointer, this works as expected
pkg/operator/k8sutil/deployment.go
Outdated
if err != nil { | ||
return fmt.Errorf("failed to start %s deployment: %+v\n%+v", name, err, dep) | ||
// Check whether the current deployement and newly generated one are identical | ||
patchResult, err := patch.DefaultPatchMaker.Calculate(currentDeployment, modifiedDeployment) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe just renaming it to CreateOrUpdateDeployment()
@@ -407,7 +407,9 @@ func StartCSIDrivers(namespace string, clientset kubernetes.Interface, ver *vers | |||
// apply resource request and limit to rbd provisioner containers | |||
applyResourcesToContainers(clientset, rbdProvisionerResource, &rbdProvisionerSTS.Spec.Template.Spec) | |||
k8sutil.SetOwnerRef(&rbdProvisionerSTS.ObjectMeta, ownerRef) | |||
err = k8sutil.CreateStatefulSet("csi-rbdplugin-provisioner", namespace, clientset, rbdProvisionerSTS) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is the statefulset ever created? I've never seen it created instead of the deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the kubernetes version is 1.13.x the provisioner will be deployed as statefulset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are planning to remove support for kube 1.13.x we can remove the same. Lot of features are not supported in kube 1.13 example resize snapshot clone etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.13 support isn't being removed from Rook yet, sounds good to keep it if needed, I just didn't remember when the statefulset was used. Would it really not work as a deployment in 1.13? Anyway, this question is really independent from this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could have used deployment with replica:1 I need to check why we used statefulset, not deployment?
one thing I can remember is the suggestion or example from the sidecar to use statefulset and later it moved to deployment
@ShyamsundarR any idea why it was statefulset instead of deployment
sometimes the kube schedular will schedule the provisioner pod on the same node. it doesnot make sense to have both provisioner pod running on the same node, we need to set the pod anti affinity to make sure that no provisioner pods runnings on the same node. Fixes: rook#5271 Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
To have parity with other functions and also client should be the first argument to the function. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
updated deployment strategy of provisioner deployment to Recreate to avoid issue during update, Recreate strategy will Kill all existing pods before creating new ones. Signed-off-by: Madhu Rajanna <madhupr007@gmail.com>
@travisn ready for review PTAL |
For what it's worth, I just set up rook on a single-node cluster, and went down an hour-long rabbit hole trying to figure out why I have pending provisioners, assuming it was a problem. |
If this is causing pain we should fix it, even if it doesn't affect the cluster health. Want to open an issue? |
Description of your changes:
sometimes the kube schedular will schedule the provisioner pod on the same node. it doesn't make
sense to have both provisioner pod running on the same node, we need to set the pod anti-affinity to make sure that no provisioner pods runnings on the same node.
Signed-off-by: Madhu Rajanna madhupr007@gmail.com
Which issue is resolved by this Pull Request:
Resolves #5271
Checklist:
make codegen
) has been run to update object specifications, if necessary.[test ceph]