New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clusters, apps/workloads can get stuck in deleting (waiting on foregroundDeletion) #11991

Closed
tfiduccia opened this Issue Mar 6, 2018 · 17 comments

Comments

@tfiduccia

tfiduccia commented Mar 6, 2018

Rancher versions: v2.0.0-alpha17

Steps to Reproduce:

  1. Create a custom cluster
  2. Try to add a couple of nodes that have 16.04 version of ubuntu from DO
  3. When nodes are stuck in adding kubernetes delete cluster

Results: Custom Cluster stuck in removing state.
image

@StrongMonkey

This comment has been minimized.

Member

StrongMonkey commented Mar 8, 2018

we didn't fix the root cause of this garbage collector bug. Need to discuss with darren as this needs change from k8s upstream.

@deniseschannon deniseschannon modified the milestones: v2.0 - Beta, v2.0 - GA Mar 21, 2018

@benyanke

This comment has been minimized.

benyanke commented Apr 9, 2018

Not sure what the current status of this is, but thought it would be worth mentioning that it still occurs for me.
image

How I got here:

  1. Added node 2 to the cluster
  2. Realized I configured the VM incorrectly, and deleted the VM
  3. AFTER deleting the VM, I deleted it using the rancher console
  4. state stuck in screenshot above for 12+ hours (I assume stuck indefinitely).

Would it have been better for me to remove the node from rancher before deleting the VM? Also, is there any way in the short term, before the root cause is fixed, to fix this without destroying and rebuilding the entire cluster?

@fkollmann

This comment has been minimized.

fkollmann commented May 2, 2018

How can I remove the node manually?

@yoke88

This comment has been minimized.

yoke88 commented May 4, 2018

try below procedures:

  • kubectl delete node nodename
  • in UI ,select the node ,then click edit, choose view in API ,then click delete button and using API to delete the node.
  • then clean the node .
@jevin36

This comment has been minimized.

jevin36 commented Jun 6, 2018

@yoke88 that didnt help me, cause the nodes weren t initialized in the first place and kubernetes didn t know about them

what I did on a my single node instance with the embedded etcd (not HA rancher !!!!) was:

  1. docker exec into the rancher container
  2. kill the rancher process
@stefanvangastel

This comment has been minimized.

stefanvangastel commented Jul 9, 2018

Having the same issue with deleting workloads from Rancher 2.0.4 UI on custom offline on-prem cluster:

Deployment does not have minimum availability; waiting on foregroundDeletion

Running kubectl delete deploy/faultyworkload -n mynamespace does remove the workload / deployment.

@antoniojtorres

This comment has been minimized.

antoniojtorres commented Jul 11, 2018

Happened to me as well. I couldn't get kubectl from the UI so I just power cycled the nodes. It's old timey, but sometimes it works!

@alena1108 alena1108 modified the milestones: v2.1, v2.0.7 Jul 17, 2018

@alena1108

This comment has been minimized.

Member

alena1108 commented Jul 17, 2018

@StrongMonkey the last time I've seen it happening was on @tfiduccia setup, for cluster object. K8s vendor is v1.10.5. But looks like it happens for both types of objects:

  • management cluster layer - CRDs created on the rancher side
  • user cluster layer - k8s "native" objects

Here is the cluster object stuck in finalizer snippet.

{
    "apiVersion": "management.cattle.io/v3",
    "kind": "Cluster",
    "metadata": {
        "annotations": {
            "authz.management.cattle.io/creator-role-bindings": "{\"created\":[\"cluster-owner\"],\"required\":[\"cluster-owner\"]}",
            "field.cattle.io/creatorId": "user-6shtq",
            "lifecycle.cattle.io/create.cluster-agent-controller-cleanup": "true",
            "lifecycle.cattle.io/create.cluster-provisioner-controller": "true",
            "lifecycle.cattle.io/create.cluster-scoped-gc": "true"
        },
        "clusterName": "",
        "creationTimestamp": "2018-07-17T18:26:30Z",
        "deletionGracePeriodSeconds": 0,
        "deletionTimestamp": "2018-07-17T18:46:42Z",
        **"finalizers": [
            "foregroundDeletion"
        ],**
        "generation": 2,
        "name": "c-g2bt6",
        "namespace": "",
        "resourceVersion": "7058",
        "selfLink": "/apis/management.cattle.io/v3/clusters/c-g2bt6",
        "uid": "ed2a7092-89ee-11e8-8cc0-0242ac110002"
    },

The cluster doesn't seem to have any objects referencing it via OwnerReference, but we need to double check on that.

Current workaround is (apply with caution, only when the object is really stuck, and after ensuring that all the objects referencing the removing one, are really removed):

  • connect to cluster using kubectl (if the object is on the management layer, login to rancher docker container, and run kubectl)
  • kubectl edit
  • remove the foregroundDeletion line, and save changes

@deniseschannon deniseschannon changed the title from Custom cluster stuck in deleting (waiting on foregroundDeletion) to Clusters, apps/workloads can get stuck in deleting (waiting on foregroundDeletion) Jul 18, 2018

@StrongMonkey

This comment has been minimized.

Member

StrongMonkey commented Jul 25, 2018

For anyone who has encounter this problem(cluster get stuck deleting), can you do the following to help me identify the problem?

  1. Exec into rancher/rancher container.
  2. Run kubectl get cluster $clusterNameWhichGetStuck -o yaml.
  3. Run kubectl get clusterRole
    Can you print out the result in here? (please redact sensitive data)

And also, try to restart rancher/rancher container to see if the stuck resource goes away. (We are investigating a race condition in the garbage collector so want to prove if restart would solve it)

@miclefebvre

This comment has been minimized.

miclefebvre commented Jul 26, 2018

@StrongMonkey refering to this issue: #14760

Restarting the Rancher Server deleted the Catalog App but all pods created with the Catalog App stayed stuck.
I then saw that all pods stucked where on the same worker node, so I removed the node and then all the pods were removed.

I'm sure the problem is not with my node because there were other pods on this node that were working perfectly.

@stieler-it

This comment has been minimized.

stieler-it commented Aug 7, 2018

I had a a stuck node that even was displayed although it's pool wasn't visible any more. Restarting Rancher server fixed it.

@cjellick cjellick assigned moelsayed and unassigned sangeethah Aug 10, 2018

@moelsayed

This comment has been minimized.

Member

moelsayed commented Aug 10, 2018

Tested using v2.0.7-rc5:

  • Deleted a cluster while nodes are being added.
  • Deleted nodes and deleted the clusters while the node is being deleted.
  • Deleted a cluster after removing the nodes VMs
    I was able to delete resources (clusters, nodes, workloads) successfully.
@timfallmk

This comment has been minimized.

timfallmk commented Aug 13, 2018

Have the node removal issue with v2.0.6. Updated to v2.0.7 and issue remains.

@frekele

This comment has been minimized.

frekele commented Aug 13, 2018

I have a similar issue. #15039

@workXMH

This comment has been minimized.

workXMH commented Aug 14, 2018

Have the app removal issue with v2.0.6

@moelsayed

This comment has been minimized.

Member

moelsayed commented Aug 14, 2018

@timfallmk Can you please provide steps to reproduce the issue after the upgrade ?
@workXMH Can you please try with v2.0.7 ?

@aaronchilcott

This comment has been minimized.

aaronchilcott commented Sep 14, 2018

I experienced this issue myself and I was able to fix the symptom (cluster not disappearing from the rancher GUI) by ensuring the cluster was completely removed from kubernetes, and then restarting the docker container for rancher. eg. docker restart <rancher container name>

I'm uncertain if this has been mentioned in previous comments, apologies if doubling up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment