Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deleted node resurrected and delete button is then disabled #25242

Closed
IlyaSemenov opened this issue Feb 4, 2020 · 13 comments
Closed

Deleted node resurrected and delete button is then disabled #25242

IlyaSemenov opened this issue Feb 4, 2020 · 13 comments
Assignees

Comments

@IlyaSemenov
Copy link

IlyaSemenov commented Feb 4, 2020

What kind of request is this (question/bug/enhancement/feature request): bug?

Steps to reproduce (least amount of steps as possible):

  1. Create a cluster with 4 nodes: etcd+cp, etcd+cp, all, worker-only (not sure if the exact configuration matters)
  2. "Drain" and "delete" the worker-only node via the web UI.
  3. Wait until the node is removed from the cluster.
  4. Reboot the deleted node.

Result:

After step 3, the "deleted" node still runs a few rancher-related containers, namely:

root@kube-a-x0:~# docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
8a810e7b5575        rancher/rke-tools:v0.1.27            "nginx-proxy CP_HOST…"   3 days ago          Up 3 days                               nginx-proxy
d6854585f870        rancher/hyperkube:v1.13.5-rancher1   "/opt/rke-tools/entr…"   2 months ago        Up 8 weeks                              kube-proxy
5865fa53ac96        rancher/hyperkube:v1.13.5-rancher1   "/opt/rke-tools/entr…"   3 months ago        Up 8 weeks                              kubelet

After step 4 (rebooting the node):

  • The "deleted" worker-node runs 14 rancher-related docker containers.
  • It somehow puts itself back into the cluster.
  • The "delete" button is disabled in the web UI for that node.

Screenshot 2020-02-04 at 15 47 18

Other details that may be helpful:

I am somewhat "okay" with the node failing to remove itself completely. I could trash the remaining containers myself to prevent it from raising from the dead, no big deal. But I can't even delete the node again anymore! The delete button is disabled for whatever reason.

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.3.3 (when the issue happened), 2.3.5 (now - I upgraded to see if anything changes)
  • Installation option (single install/HA): single

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): not sure what that means. cluster created from scratch with Rancher on bare metal machines.
  • Machine type (cloud/VM/metal) and specifications (CPU/memory): bare metal, the removed node is 4cpu/16 gig, the rancher server is 8cpu/64gig.
  • Kubernetes version (use kubectl version):
Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}
  • Docker version (use docker version):
Server:
 Engine:
  Version:          18.09.7
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.1
  Git commit:       2d0083d
  Built:            Wed Aug 14 19:41:23 2019
  OS/Arch:          linux/amd64
  Experimental:     false
@IlyaSemenov
Copy link
Author

I was able to delete the node by clicking 3 dots menu > Open in API > Delete. Still no idea why "Delete" button in the UI was disabled for that node while removing via the API worked...

After that, 5 rancher containers were still running on the node (so it's a reproducible problem):

root@kube-a-x0:~# docker ps
CONTAINER ID        IMAGE                                COMMAND                  CREATED             STATUS              PORTS               NAMES
792173f7663c        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   9 minutes ago       Up 9 minutes                            flamboyant_cartwright
0b7cc84c3017        rancher/rancher-agent:v2.3.3         "run.sh --server htt…"   7 hours ago         Up 9 minutes                            trusting_murdock
8a810e7b5575        rancher/rke-tools:v0.1.27            "nginx-proxy CP_HOST…"   4 days ago          Up 9 minutes                            nginx-proxy
d6854585f870        rancher/hyperkube:v1.13.5-rancher1   "/opt/rke-tools/entr…"   2 months ago        Up 9 minutes                            kube-proxy
5865fa53ac96        rancher/hyperkube:v1.13.5-rancher1   "/opt/rke-tools/entr…"   3 months ago        Up 9 minutes                            kubelet

(the duplicate rancher-agent is perhaps an another problem - I didn't notice when it came by but based on "9 minutes" it was apparently created after I ran /etc/init.d/docker stop and then start to see if disconnecting would make any difference)

I deleted all containers with docker ps -aq|xargs docker rm -f -v to prevent the node from re-registering.

@bmdepesa
Copy link
Member

bmdepesa commented Feb 4, 2020

rancher/rancher:v2.3.3

I was able to reproduce this.

  • Deploy Rancher v2.3.3
  • Create a custom cluster (1 all roles, 1 worker node)
  • Drain & delete worker node from Rancher UI
  • Reboot worker node
  • Worker node rejoins itself to cluster, but the Delete option is disabled, and removed from the 3-dot menu
  • When the node rejoins it has no role as seen in kubectl:
> kubectl get nodes
NAME              STATUS   ROLES                      AGE     VERSION
ip-172-31-3-94    Ready    <none>                     3m11s   v1.16.6
  • I attempted to add some labels and annotations to the node, but they did not save
  • A daemonset workload deployed its pod successfully to the node and appeared to work but I didn't proceed beyond that.

@paulohleal
Copy link

rancher/rancher:v2.4.5

I also was able to reproduce this. But this time I removed an ETCD node.

Deploy Rancher v2.4.5
Create a custom cluster (3 etcd, 1 control plane, 1 worker node)
Delete etcd node from Rancher UI
Reboot etcd node
Etcd node rejoins itself to cluster, but the Delete option is disabled, and removed from the 3-dot menu
When the node rejoins the cluster it has no role (on rancher interface it is a worker node).
The etcd container is still running but keeps restarting.

@jinzishuai
Copy link

jinzishuai commented Sep 10, 2020

@IlyaSemenov thank you very much for you workaround using the 3 dots menu > Open in API > Delete method.

I have almost exactly the same problem as you:

  • first the node went down and I was able to delete the node via Rancher UI
  • then the node got power cycled and came online again: the node shows up in Rancher
  • since we've already got a replacement node, we physically terminated the node in the platform but then realized that the Rancher UI no longer offers the simple delete button
  • but the propsed workaround worked
  • also confirmed that both kubectl get node and etcdctl member list report a cluster without that node

@strajansebastian
Copy link

Happened 2 times in our setup and couldn't find an explanation/pattern for this until now.
I was thinking that the upgrade from v2.3 to v2.4.8 was causing all this problems.

Confirm that the ... menu -> open in API -> delete method works

@skaven81
Copy link

We just experienced the same issue. The workaround using the API directly worked as well.

@Oats87
Copy link
Contributor

Oats87 commented Dec 18, 2020

There is a technical reason as to why this happens.

We have a v3 (rancher) representation of the v1 (kubernetes) node object, and a controller that serves to "sync" the two objects with each other.

When you delete a node from the cluster but that node isn't properly cleaned (whether due to a Rancher bug or some other reason i.e. split brain), it is still running a kubelet that can re-register with the kubernetes cluster (creating a new v1 node object) it belongs to, which triggers the recreation of the v3 node object. The easiest way to tell that this has happened is if the node object has a prefix of machine- instead of the normal m-.

The actual code for this can be found here: https://github.com/rancher/rancher/blob/v2.3.8/pkg/controllers/user/nodesyncer/nodessyncer.go#L617

@skaven81
Copy link

@Oats87 So as a workaround, for a node that gets deleted from the cluster, would nuking the kubelet container be sufficient to prevent this from happening, as it would prevent the kubelet from creating a new Node object?

Superficially, I don't think it's a particularly bad thing to have the Rancher UI "resurrect" a deleted node, particularly if the reason it's doing that is simply to stay in sync with the underlying K8s state. What is bad, is the fact that when this happens, it creates a Node (v3) object that can't be deleted in the UI. So whatever is auto-vivifying the Node (v3) object must be doing something differently from what happens when a node is added via the UI. I imagine if those two processes are reconciled, then the "resurrected" node would be deletable in the UI.

@Oats87
Copy link
Contributor

Oats87 commented Dec 18, 2020

@Oats87 So as a workaround, for a node that gets deleted from the cluster, would nuking the kubelet container be sufficient to prevent this from happening, as it would prevent the kubelet from creating a new Node object?

Superficially, I don't think it's a particularly bad thing to have the Rancher UI "resurrect" a deleted node, particularly if the reason it's doing that is simply to stay in sync with the underlying K8s state. What is bad, is the fact that when this happens, it creates a Node (v3) object that can't be deleted in the UI. So whatever is auto-vivifying the Node (v3) object must be doing something differently from what happens when a node is added via the UI. I imagine if those two processes are reconciled, then the "resurrected" node would be deletable in the UI.

I'd actually take it a step further and "clean" the node, we have some documentation around that here: https://rancher.com/docs/rancher/v2.x/en/cluster-admin/cleaning-cluster-nodes/

We have an open issue around this here: #26545

@daniol1618
Copy link

I tried to delete the nodes with some external scripts, when that happened, rancher got lost and I was not able to drain, cordon or delete the nodes. The solution was to use the CLI, connect to the cluster and perform a kubectl delete node.
Restarted the nodes, provisioned the components to the nodes and attached them once again.

@livenson
Copy link

livenson commented Feb 2, 2021

Just happened with me, 2.5.5, solution as to use API.

@bmdepesa
Copy link
Member

bmdepesa commented Feb 2, 2021

Closing in favor of #26545

@bmdepesa bmdepesa closed this as completed Feb 2, 2021
@ghost
Copy link

ghost commented Mar 7, 2022

just met the same issue.
wondering where is the 3 dots menu that contains an "open in API" option?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants