[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857

rancher-max · 2022-05-04T19:58:50Z

Environmental Info:
RKE2 Version:

v1.23.5+rke2r1, v1.22.9+rke2r1 (and it seems any rke2 version that uses etcd 3.5.x).

Node(s) CPU architecture, OS, and Version:

any

Cluster Configuration:

Minimal configuration repro'ed on: 2 servers
Initially found on: 3 servers, 1 agent

Describe the bug:

When performing a cluster-reset without cluster-reset-restore-path flag, then trying to rejoin server nodes, the cluster ends up with a split brain situation.

ETCD Info:

# Running the below command on both nodes returns the same info:
$ sudo ETCDCTL_API=3 etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379/ --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list -w table
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| 3c0e71035ef2e3ca | started | ip-172-31-30-121-c419f9ca | https://172.31.30.121:2380/ | https://172.31.30.121:2379/ |      false |
| bdce3dd9310e7e8b | started | ip-172-31-17-205-589a092b | https://172.31.17.205:2380/ | https://172.31.17.205:2379/ |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+

# Running the below command on both nodes returns DIFFERENT info (note that doing this on 1.21.12 returns all node IPs correctly:
$ sudo ETCDCTL_API=3 etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt get /registry/services/endpoints/default/kubernetes
/registry/services/endpoints/default/kubernetes
k8s

v1	Endpoints?
?

kubernetesdefault"*$f1fc0fb6-ac1d-4c63-9aee-185f15e0d5a72ʣ˓Z/
'endpointslice.kubernetes.io/skip-mirrortruez??
kube-apiserverUpdatevʣ˓FieldsV1:d
b{"f:metadata":{"f:labels":{".":{},"f:endpointslice.kubernetes.io/skip-mirror":{}}},"f:subsets":{}}B%

<REDACTED NODE IP OF THE NODE THE COMMAND IS BEING RUN FROM>
https?2TCP"

# Result of kubectl get nodes, run from the different nodes, clearly showing the split brain situation:
ubuntu@ip-172-31-30-121:~$ k get nodes
NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-17-205   NotReady   control-plane,etcd,master   14m   v1.23.6+rke2r2
ip-172-31-30-121   Ready      control-plane,etcd,master   16m   v1.23.6+rke2r2

ubuntu@ip-172-31-17-205:~$ k get nodes
NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-17-205   Ready      control-plane,etcd,master   14m   v1.23.6+rke2r2
ip-172-31-30-121   NotReady   control-plane,etcd,master   16m   v1.23.6+rke2r2

Steps To Reproduce:

Install rke2 and join a server node
Stop both server nodes: sudo systemctl stop rke2-server
On node1, run cluster-reset: sudo rke2 server --cluster-reset
After the command finishes, start rke2-server process on node1: sudo systemctl start rke2-server
On other server node, delete db directory as instructed: sudo rm -rf /var/lib/rancher/rke2/server/db
Start rke2 process on other server node: sudo systemctl start rke2-server

Expected behavior:

Other server node should successfully rejoin the cluster

Actual behavior:

Node does not successfully rejoin the cluster and instead ends up with a weird split brain

Additional context / logs:

The text was updated successfully, but these errors were encountered:

rancher-max · 2022-05-04T19:59:03Z

/backport v1.22.10+rke2r1

rancher-max · 2022-05-04T21:42:54Z

Confirmed this also continues to happen on v1.23.6+rke2r1 when adding config param etcd-image: quay.io/coreos/etcd:v3.5.4

rancher-max · 2022-05-05T20:50:42Z

We found a workaround. If in this split-brain situation, it is possible to then do the following (in my case, performed these after the steps listed in the issue):

Take a snapshot on node1: sudo rke2 etcd-snapshot save
Stop ALL the servers: sudo systemctl stop rke2-server
Restore from the snapshot that was just taken: sudo rke2 server --cluster-reset --cluster-reset-restore-path="/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-172-31-30-121-1651783213"
Start rke2-server process on node1 after that completes: sudo systemctl start rke2-server
Now, delete db directory from other server nodes and restart rke2-server process:

$ sudo rm -rf /var/lib/rancher/rke2/server/db
$ sudo systemctl start rke2-server

This cures the issue and all nodes are back in the cluster and running successfully.

Richardswe · 2022-09-06T11:32:51Z

Got something similar maybe this will help you out if your cluster is not functioning properly after a cluster restore (using the same nodes not new ones)
When performing a cluster-reset and starting the servers again, the cluster seems to have communication issues. I solved this by restarting the server nodes (virtual-machines). We were able to reach the applications that were running on the agent node but not the ones running on the server/master nodes.

Environmental Info:
RKE2 Version:

v1.23.7+rke2r2

Node(s) CPU architecture, OS, and Version:
OpenSuse 15 SP3/JeOS

Cluster Configuration:
3 servers
1 agent

sandori01 · 2022-10-11T14:57:00Z

I too have a problem similar to this. After restoring a snapshot on one of the masters, I could get the cluster working. However the cluster didn't appear in Rancher cluster management. I am able to see the workloads and authenticate using rancher server, so that means the cattle-cluster-agent can connect to rancher.
rancher-system-agent however cannot connect.
Oct 11 14:04:27 master2 rancher-system-agent[1210]: time="2022-10-11T14:04:27Z" level=error msg="[K8s] Received secret that was nil"
Oct 11 14:04:32 master2 rancher-system-agent[1210]: time="2022-10-11T14:04:32Z" level=error msg="[K8s] Received secret that was nil"
Oct 11 14:04:37 master2 rancher-system-agent[1210]: time="2022-10-11T14:04:37Z" level=error msg="[K8s] Received secret that was nil"
fleet-default does not have machine plans or tokens for the cluster nodes, so rancher2_connection_info.json is obsolete.
I failed to find a way to reconnect rancher-system-agent.

Restarting a second master results in getting a single node cluster. Clearly this is not what I want (after this I had to restore the etcd snapshot).

Environment:
RKE2 v1.23.7+rke2r2
Nodes: OpenSUSE microOS

Cluster: 3 masters (currently only 1 is working), 3 workers

caroline-suse-rancher · 2023-02-21T15:13:34Z

Closing as 1.23 is soon to be EOL, and there's a workaround above

rancher-max added kind/bug Something isn't working area/etcd-snapshot-restore labels May 4, 2022

rancher-max added this to the v1.23.7+rke2r1 milestone May 4, 2022

rancher-max added this to To Triage in Development [DEPRECATED] via automation May 4, 2022

rancherbot mentioned this issue May 4, 2022

[Backport release-1.22] Cannot join nodes back after cluster-reset when not restoring from snapshot #2858

Closed

rancher-max modified the milestones: v1.23.7+rke2r1, v1.23.8+rke2r1 May 25, 2022

rancher-max mentioned this issue May 25, 2022

Cannot join nodes back after cluster-reset when not restoring from snapshot #2991

Closed

katran001 modified the milestones: v1.23.8+rke2r1, v1.23.9+rke2r1 Jul 6, 2022

katran001 modified the milestones: v1.23.9+rke2r1, v1.23.10+rke1 Aug 1, 2022

caroline-suse-rancher modified the milestones: v1.23.10+rke2r1, v1.23.15+rke2r1 Nov 16, 2022

caroline-suse-rancher modified the milestones: v1.23.15+rke2r1, v1.23.16+rke2r1 Dec 21, 2022

caroline-suse-rancher moved this from To Triage to Next Up in Development [DEPRECATED] Dec 21, 2022

caroline-suse-rancher closed this as completed Feb 21, 2023

Development [DEPRECATED] automation moved this from Next Up to Done Issue / Merged PR Feb 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857

[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857

rancher-max commented May 4, 2022

rancher-max commented May 4, 2022

rancher-max commented May 4, 2022

rancher-max commented May 5, 2022

Richardswe commented Sep 6, 2022

sandori01 commented Oct 11, 2022

caroline-suse-rancher commented Feb 21, 2023

[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857

[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857

Comments

rancher-max commented May 4, 2022

rancher-max commented May 4, 2022

rancher-max commented May 4, 2022

rancher-max commented May 5, 2022

Richardswe commented Sep 6, 2022

sandori01 commented Oct 11, 2022

caroline-suse-rancher commented Feb 21, 2023