Skip to content
This repository has been archived by the owner on Mar 22, 2024. It is now read-only.

Delete of spire-oidc might get stuck if the csi driver got removed first #330

Closed
marcofranssen opened this issue Jun 2, 2023 · 6 comments · Fixed by #353
Closed

Delete of spire-oidc might get stuck if the csi driver got removed first #330

marcofranssen opened this issue Jun 2, 2023 · 6 comments · Fixed by #353

Comments

@marcofranssen
Copy link
Contributor

When doing a helm uninstall in some situations the spire-oidc pod gets stuck in a terminating state.
This also prevents you from deleting the namespace.

This could happen already when you do a simple helm uninstall.

The best approach to prevent this issue is to delete the oidc deployment first.

kubectl delete --namespace spire-server deployment spire-spiffe-oidc-discovery-provider
helm uninstall --namespace spire-server spire
kubectl delete ns spire-server

In case you already got stuck your namespace removal you can do following:

Run following in a separate terminal

kubectl proxy

Then run following to cleanup the namespace

namespace=spire-system
kubectl get namespace "${namespace}" -o json | jq 'del(.spec.finalizers)' >tmp.json
curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json "http://127.0.0.1:8001/api/v1/namespaces/${namespace}/finalize"

We probably should add a piece of documentation to inform the users of our chart. It could basically happen with any pod/namespace that depends on the csi driver.

marcofranssen added a commit that referenced this issue Jun 2, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
@kfox1111
Copy link
Contributor

kfox1111 commented Jun 2, 2023

Any pod using a csi driver will behave badly when you pull the csi driver out from under it...

It wont just affect spire-spiffe-oidc-discovery-provider, but any workload using the regular spiffe csi driver.

If nested spire is being used, spire-server will malfunction if the upstream csi driver is pulled too.

@kfox1111
Copy link
Contributor

kfox1111 commented Jun 2, 2023

Here is a command to find out if the driver is being used by any pod:

kubectl get pods --all-namespaces -o go-template='{{range .items}}{{$nn := printf "%s %s" .metadata.namespace .metadata.name}}{{range .spec.volumes}}{{if .csi.driver}}{{if eq .csi.driver "csi.spiffe.io"}}{{printf "%s\n" $nn}}{{end}}{{end}}{{end}}{{end}}'

@kfox1111
Copy link
Contributor

kfox1111 commented Jun 2, 2023

I don't know how safe it is to pull the kubernetes finalizer. There may be other things that get affected by that, not just the spiffe csi driver....

If the user gets into that state with their own workload, I think we should probably recommend reinstalling the csi driver so it can properly delete the other pods before removal rather then recommend they remove finalizers.

As for the workloads within our chart, while I generally don't like helm hooks much, a pre-delete hook feels pretty safe and reasonable to remove the resources that will get deleted anyway to make sure they are deleted before the csi driver gets removed. We need to test that carefully though to make sure its safe in all conditions.

marcofranssen added a commit that referenced this issue Jun 2, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
@kfox1111
Copy link
Contributor

kfox1111 commented Jun 2, 2023

I just tried purposefully failing a pre-delete hook.

$ helm delete test
Error: pod hook-preinstall failed

It leaves everything in place and the job in Error state, so the logs can be looked at.

So, I think we can have 3 jobs, weighted in the right order:

  1. delete spiffe-oidc-discovery-provider in the spiffe-oidc-discovery-provider chart
  2. delete spire-server if nested spire in the spire-server chart
  3. check for any pod still using the csi driver. fail without retry if ones still found. This will block the driver removal.

@edwbuck
Copy link
Contributor

edwbuck commented Jun 3, 2023

I don't know how safe it is to pull the kubernetes finalizer. There may be other things that get affected by that, not just the spiffe csi driver....

It is generally not safe at all, the best case scenario involves the admin being fully aware of all the tasks the finalizer would have performed, and then deciding which ones still need to be performed, and then manually performing them.

Part of this blocked one of my enterprise customers from using the CSI driver, as they don't expose the ability to perform such administrative operations to the same developers they permit deployment and removal of Helm Charts, and the authorized administrative team doesn't have the tolerance to perform this manual cleanup frequently (aka once).

If the user gets into that state with their own workload, I think we should probably recommend reinstalling the CSI driver so it can properly delete the other pods before removal rather then recommend they remove finalizers.

The pre-delete hook could ensure the driver is installed, if values.yaml is configured to support it. The post-delete hook could ensure the driver is removed. This ordering should help, and will minimize future failed deployment removals.

As for the workloads within our chart, while I generally don't like helm hooks much, a pre-delete hook feels pretty safe and reasonable to remove the resources that will get deleted anyway to make sure they are deleted before the csi driver gets removed. We need to test that carefully though to make sure its safe in all conditions.

I know the hooks aren't very elegant, but this is the use case that created them. I would not recommend removing the pods in pre-delete, as ensuring the CSI is installed / reinstalled in pre-delete has the "bulk" of the application removal located where it is expected.

Once the work is done here, generally it should become easier to expand the pattern to all of the other helm hooks. The CSI driver then becomes part of the pre-install hook, and is "ensured to be present" in the post-install hook" While some of these steps might seem like busy work, they make it far easier to handle Helm rollbacks, including rollbacks from uninstallation, or rollbacks of installation that effectively remove the deployment.

Note: sub-chart ordering complicates this process significantly, because sub-charts are intended to manage independent application dependencies, not components of the same micro-service. Much like I wouldn't put a service into a sub-chart and its ConfigMap into a different sub-chart and then attempt to figure out how to get them ordered, the CSI driver should not be in a different sub-chart from its users.

@kfox1111
Copy link
Contributor

kfox1111 commented Jun 3, 2023

I dont follow. What does csi install/reinstall have to do with deleting the chart? The issue described is when deleting the chart and helm's lack of ordering. Also, I don't think you can rollback a chart deletion? I could be wrong though.

I dont think subcharts have much effect on hooks either. All the same type of hooks are ran at the same time regardless of nesting, unless you specify an explicit ordering, in which case you have control even over multiple subcharts via priority annotation.

marcofranssen added a commit that referenced this issue Jun 5, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 6, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 12, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 12, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 13, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 13, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 13, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 16, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 16, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 16, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 16, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 17, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 17, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 19, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 19, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 20, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 22, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 22, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 22, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 22, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 23, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 29, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 30, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jun 30, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 3, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 3, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 4, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 4, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 7, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 13, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 19, 2023
Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
marcofranssen added a commit that referenced this issue Jul 19, 2023
* 94a2b72 Merge pull request #324 from spiffe/enable-testing-multiple-charts
* e426bc0 Downgrade chart-testing tool to 3.8.0
* 09b4664 Utilize ct install --github-groups in ci workflow
* 42086bd Run example tests also on all k8s versions
* 4848c48 Improve Makefile help and implementation
* db06038 Increase some timeouts, trying to fix the tests
* 54ed71f Add back tests for examples
* 4b4cef1 Skip namespace-override test because of #330
* 380979c Prevent ci folder ending up in Helm package
* 06ddb7c Use chart-testing ci/*-values.yaml for testing
* a4c1de7 Add basic unit test framework (#390)
* 088b296 Improve tornjak service API to have object structure (#392)
* 522066e Align tornjak clientCA naming convention (#393)
* f05cb4f Add option to configure TLS/mTLS endpoint for Tornjak (#338)
* f461a01 Bump test chart dependencies (#386)
* ce39e82 Bump test chart dependencies (#382)
* 19d3208 Fix oidc provider config change not rolling out (#383)
* 0197621 Bump helm/kind-action from 1.7.0 to 1.8.0 (#384)
* 3ed1859 Add missing tolerations config to daemonsets (#381)
* 4ad68b1 Add namespace to spiffe-oidc-discovery-provider RBAC definitions (#379)
* c1b1dd3 Add additional domains to JWT issued items. (#230)
* 3405e13 Bump test chart dependencies
* 81452d5 Fix missing spiffe-csi-driver imagePullSecrets template (#376)

Signed-off-by: Marco Franssen <marco.franssen@gmail.com>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants