Uninstalling CSI Driver orphans all pods using the csidriver #440

drewwells · 2023-08-09T20:01:11Z

Expanding on #330, spire still has an issue with uninstalling the CSI Driver preventing any pods using it to terminate. Since the workflow is exposed over a csidriver socket, this is a widespread problem. We can't expect users to solve for this and need a general approach to cleanly handle removing csidriver.

Possible Solutions:

mutating webhook that sets ownerReferences on pods requesting the spiffe csidriver. CSIDriver could not be uninstalled until dependent pods are deleted: https://github.com/kubernetes-csi/external-provisioner/blob/master/README.md#capacity-support
CRDs have a similar dependency problem. Helm handles these by installing them in a hook, hooks are not removed when helm uninstall runs. So uninstalling spire csidriver would not delete the csidriver object.

kfox1111 · 2023-08-09T20:10:30Z

This problem is common to all csi drivers I think. This likely will require conversation with the Kubernetes sig-storage team to really fix properly.

In the mean time, the simplest thing might be to add a note to the FAQ stating how to resolve the issue, which is, reinstall the chart and remove all pods using the csi driver before removing the chart again.

The ownerReference thing I think would cause all pods using the driver to be deleted suddenly, but wouldn't get the controlling objects such as statefulsets so the pods would get recreated and race the deletion. Also, there may be some benefits to being able to completely remove the csi driver, and re-add it while leaving existing pods alone. It does work properly.

As for orphaning the driver on uninstall, that could work too, but seems undesirable. I'm not sure reinstalling the chart could reclaim the management of the objects or what affect it would have on upgrades.

This is a very tricky issue and I'm not convinced this project should be the one to solve it. Probably should be sig-storage, but maybe alternately spiffe-csi-driver, as its a problem for static manifests too.

mauriciopoppe · 2023-08-09T20:51:43Z

While workarounds aren't great here's a workaround kubernetes-csi/external-attacher#463 (comment). The idea is that when you uninstall both your Deployment and the CSI Driver you give some time to the CSI driver to complete the teardown of volumes in the node before actually removing it.

I think in practice we always separate the deployment of CSI Drivers as a cluster admin task, uninstalling the CSI Driver usually mean tearing down the node too so we usually do kubectl drain node which removes running workloads and then we safely uninstall the CSI Driver.

drewwells · 2023-08-10T01:11:29Z

Path #2 may ultimately be what we go with. As mentioned in helm w.r.t. CRDs

There is no support at this time for upgrading or deleting CRDs using Helm.

We'd have a better time if we did not attempt to delete a csidriver until better hooks are built into k8s.

I'll make some time to investigate the ownerReference option to see if there's any viability there. Like @kfox1111 mentioned, it may work just like pods getting deleted when a deployment gets deleted. Or, CRs being removed when a CRD is deleted.

kfox1111 · 2023-08-29T17:17:47Z

We discussed this at the weekly meeting and the consensus was that unfortunately this is not a problem this chart can solve. Its a Kubernetes issue and a solution should come from that project. We'd be happy to adopt whatever solution they come up with.

drewwells · 2023-08-29T17:32:52Z

👎🏻 This is a major issue. Helm uninstall orphans all of the workloads using spire. I even offered two solutions. I'd file a PR to implement #2 but getting code merged into this project just isn't worth the time investment it requires https://github.com/spiffe/helm-charts/blob/main/charts/spire/charts/spiffe-csi-driver/templates/spiffe-csi-driver.yaml

kfox1111 · 2023-08-29T18:16:44Z

Suggestion number two is an incomplete recommendation. Its not just the CSIDriver object being missing, but all bits of the driver need to still be in place on all the nodes where workloads exist so that kubelet can talk to it to unmount the volumes. Leaking out the entire csi driver causes other uninstall issues so its then a tradeoff on which problems are harder for the user to deal with. Reinstalling spire to finish the workload removal, or cleaning up the mess after uninstalling spire and it leaving behind a bunch of objects. We've mitigated the issue for now in the FAQ. The real solution probably involves a bunch of code above and beyond what a simple helm chart should provide, into the realm of a new Kubernetes controller. This repo is the wrong place for that. If it was created somewhere, we could use it.

drewwells · 2023-08-29T21:01:55Z

Can workloads mount the hostPath to communicate with spire-agent?

      - hostPath:
          path: /run/spire/agent-sockets
          type: DirectoryOrCreate
        name: spire-agent-socket-dir

kfox1111 · 2023-08-29T21:05:38Z

No. Under PSS (and most PSP setups) that requires Privilege IE https://kubernetes.io/docs/concepts/security/pod-security-standards/#privileged. It should not be used for any regular user workload. OpenShift also doesn't allow it.

The whole reason for the CSI driver is to allow the socket to be used with pods without needing Privilege.

drewwells · 2023-08-29T21:11:52Z

Is this in reference to the namespace label in the install instructions?

kfox1111 · 2023-08-29T21:14:12Z

Yeah, that's how PSS is configured. Previous to 1.25, it was often done with the PSP api instead which involved Roles/RoleBindings, but accomplished basically the same thing.

OpenShift I think has its own system but functions similarly to PSS from what I've heard.

drewwells changed the title ~~Termianting CSI Driver orphans all pods using the CSI Driver~~ Terminating CSI Driver orphans all pods using the CSI Driver Aug 9, 2023

drewwells changed the title ~~Terminating CSI Driver orphans all pods using the CSI Driver~~ Uninstalling CSI Driver orphans all pods using the csidriver Aug 9, 2023

kfox1111 mentioned this issue Aug 9, 2023

Add a FAQ and switch rare issue from README to it #437

Merged

kfox1111 closed this as completed Aug 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uninstalling CSI Driver orphans all pods using the csidriver #440

Uninstalling CSI Driver orphans all pods using the csidriver #440

drewwells commented Aug 9, 2023

kfox1111 commented Aug 9, 2023

mauriciopoppe commented Aug 9, 2023

drewwells commented Aug 10, 2023 •

edited

Loading

kfox1111 commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

Uninstalling CSI Driver orphans all pods using the csidriver #440

Uninstalling CSI Driver orphans all pods using the csidriver #440

Comments

drewwells commented Aug 9, 2023

kfox1111 commented Aug 9, 2023

mauriciopoppe commented Aug 9, 2023

drewwells commented Aug 10, 2023 • edited Loading

kfox1111 commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

drewwells commented Aug 29, 2023

kfox1111 commented Aug 29, 2023

drewwells commented Aug 10, 2023 •

edited

Loading