Skip to content
This repository has been archived by the owner on Mar 22, 2024. It is now read-only.

Uninstalling CSI Driver orphans all pods using the csidriver #440

Closed
drewwells opened this issue Aug 9, 2023 · 10 comments
Closed

Uninstalling CSI Driver orphans all pods using the csidriver #440

drewwells opened this issue Aug 9, 2023 · 10 comments

Comments

@drewwells
Copy link
Contributor

Expanding on #330, spire still has an issue with uninstalling the CSI Driver preventing any pods using it to terminate. Since the workflow is exposed over a csidriver socket, this is a widespread problem. We can't expect users to solve for this and need a general approach to cleanly handle removing csidriver.

Possible Solutions:

  1. mutating webhook that sets ownerReferences on pods requesting the spiffe csidriver. CSIDriver could not be uninstalled until dependent pods are deleted: https://github.com/kubernetes-csi/external-provisioner/blob/master/README.md#capacity-support

  2. CRDs have a similar dependency problem. Helm handles these by installing them in a hook, hooks are not removed when helm uninstall runs. So uninstalling spire csidriver would not delete the csidriver object.

@drewwells drewwells changed the title Termianting CSI Driver orphans all pods using the CSI Driver Terminating CSI Driver orphans all pods using the CSI Driver Aug 9, 2023
@drewwells drewwells changed the title Terminating CSI Driver orphans all pods using the CSI Driver Uninstalling CSI Driver orphans all pods using the csidriver Aug 9, 2023
@kfox1111
Copy link
Contributor

kfox1111 commented Aug 9, 2023

This problem is common to all csi drivers I think. This likely will require conversation with the Kubernetes sig-storage team to really fix properly.

In the mean time, the simplest thing might be to add a note to the FAQ stating how to resolve the issue, which is, reinstall the chart and remove all pods using the csi driver before removing the chart again.

The ownerReference thing I think would cause all pods using the driver to be deleted suddenly, but wouldn't get the controlling objects such as statefulsets so the pods would get recreated and race the deletion. Also, there may be some benefits to being able to completely remove the csi driver, and re-add it while leaving existing pods alone. It does work properly.

As for orphaning the driver on uninstall, that could work too, but seems undesirable. I'm not sure reinstalling the chart could reclaim the management of the objects or what affect it would have on upgrades.

This is a very tricky issue and I'm not convinced this project should be the one to solve it. Probably should be sig-storage, but maybe alternately spiffe-csi-driver, as its a problem for static manifests too.

@mauriciopoppe
Copy link

While workarounds aren't great here's a workaround kubernetes-csi/external-attacher#463 (comment). The idea is that when you uninstall both your Deployment and the CSI Driver you give some time to the CSI driver to complete the teardown of volumes in the node before actually removing it.

I think in practice we always separate the deployment of CSI Drivers as a cluster admin task, uninstalling the CSI Driver usually mean tearing down the node too so we usually do kubectl drain node which removes running workloads and then we safely uninstall the CSI Driver.

@drewwells
Copy link
Contributor Author

drewwells commented Aug 10, 2023

Path #2 may ultimately be what we go with. As mentioned in helm w.r.t. CRDs

There is no support at this time for upgrading or deleting CRDs using Helm.

We'd have a better time if we did not attempt to delete a csidriver until better hooks are built into k8s.

I'll make some time to investigate the ownerReference option to see if there's any viability there. Like @kfox1111 mentioned, it may work just like pods getting deleted when a deployment gets deleted. Or, CRs being removed when a CRD is deleted.

@kfox1111
Copy link
Contributor

We discussed this at the weekly meeting and the consensus was that unfortunately this is not a problem this chart can solve. Its a Kubernetes issue and a solution should come from that project. We'd be happy to adopt whatever solution they come up with.

@drewwells
Copy link
Contributor Author

👎🏻 This is a major issue. Helm uninstall orphans all of the workloads using spire. I even offered two solutions. I'd file a PR to implement #2 but getting code merged into this project just isn't worth the time investment it requires https://github.com/spiffe/helm-charts/blob/main/charts/spire/charts/spiffe-csi-driver/templates/spiffe-csi-driver.yaml

@kfox1111
Copy link
Contributor

Suggestion number two is an incomplete recommendation. Its not just the CSIDriver object being missing, but all bits of the driver need to still be in place on all the nodes where workloads exist so that kubelet can talk to it to unmount the volumes. Leaking out the entire csi driver causes other uninstall issues so its then a tradeoff on which problems are harder for the user to deal with. Reinstalling spire to finish the workload removal, or cleaning up the mess after uninstalling spire and it leaving behind a bunch of objects. We've mitigated the issue for now in the FAQ. The real solution probably involves a bunch of code above and beyond what a simple helm chart should provide, into the realm of a new Kubernetes controller. This repo is the wrong place for that. If it was created somewhere, we could use it.

@drewwells
Copy link
Contributor Author

Can workloads mount the hostPath to communicate with spire-agent?

      - hostPath:
          path: /run/spire/agent-sockets
          type: DirectoryOrCreate
        name: spire-agent-socket-dir

@kfox1111
Copy link
Contributor

No. Under PSS (and most PSP setups) that requires Privilege IE https://kubernetes.io/docs/concepts/security/pod-security-standards/#privileged. It should not be used for any regular user workload. OpenShift also doesn't allow it.

The whole reason for the CSI driver is to allow the socket to be used with pods without needing Privilege.

@drewwells
Copy link
Contributor Author

Is this in reference to the namespace label in the install instructions?

@kfox1111
Copy link
Contributor

Yeah, that's how PSS is configured. Previous to 1.25, it was often done with the PSP api instead which involved Roles/RoleBindings, but accomplished basically the same thing.

OpenShift I think has its own system but functions similarly to PSS from what I've heard.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants