Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857

Closed
rancher-max opened this issue May 4, 2022 · 6 comments

Comments

@rancher-max
Copy link
Contributor

Environmental Info:
RKE2 Version:

v1.23.5+rke2r1, v1.22.9+rke2r1 (and it seems any rke2 version that uses etcd 3.5.x).

Node(s) CPU architecture, OS, and Version:

any

Cluster Configuration:

Minimal configuration repro'ed on: 2 servers
Initially found on: 3 servers, 1 agent

Describe the bug:

When performing a cluster-reset without cluster-reset-restore-path flag, then trying to rejoin server nodes, the cluster ends up with a split brain situation.

ETCD Info:

# Running the below command on both nodes returns the same info:
$ sudo ETCDCTL_API=3 etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379/ --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member list -w table
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
|        ID        | STATUS  |           NAME            |         PEER ADDRS         |        CLIENT ADDRS        | IS LEARNER |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
| 3c0e71035ef2e3ca | started | ip-172-31-30-121-c419f9ca | https://172.31.30.121:2380/ | https://172.31.30.121:2379/ |      false |
| bdce3dd9310e7e8b | started | ip-172-31-17-205-589a092b | https://172.31.17.205:2380/ | https://172.31.17.205:2379/ |      false |
+------------------+---------+---------------------------+----------------------------+----------------------------+------------+
# Running the below command on both nodes returns DIFFERENT info (note that doing this on 1.21.12 returns all node IPs correctly:
$ sudo ETCDCTL_API=3 etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt get /registry/services/endpoints/default/kubernetes
/registry/services/endpoints/default/kubernetes
k8s

v1	Endpoints?
?

kubernetesdefault"*$f1fc0fb6-ac1d-4c63-9aee-185f15e0d5a72ʣ˓Z/
'endpointslice.kubernetes.io/skip-mirrortruez??
kube-apiserverUpdatevʣ˓FieldsV1:d
b{"f:metadata":{"f:labels":{".":{},"f:endpointslice.kubernetes.io/skip-mirror":{}}},"f:subsets":{}}B%

<REDACTED NODE IP OF THE NODE THE COMMAND IS BEING RUN FROM>
https?2TCP"
# Result of kubectl get nodes, run from the different nodes, clearly showing the split brain situation:
ubuntu@ip-172-31-30-121:~$ k get nodes
NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-17-205   NotReady   control-plane,etcd,master   14m   v1.23.6+rke2r2
ip-172-31-30-121   Ready      control-plane,etcd,master   16m   v1.23.6+rke2r2

ubuntu@ip-172-31-17-205:~$ k get nodes
NAME               STATUS     ROLES                       AGE   VERSION
ip-172-31-17-205   Ready      control-plane,etcd,master   14m   v1.23.6+rke2r2
ip-172-31-30-121   NotReady   control-plane,etcd,master   16m   v1.23.6+rke2r2

Steps To Reproduce:

  • Install rke2 and join a server node
  • Stop both server nodes: sudo systemctl stop rke2-server
  • On node1, run cluster-reset: sudo rke2 server --cluster-reset
  • After the command finishes, start rke2-server process on node1: sudo systemctl start rke2-server
  • On other server node, delete db directory as instructed: sudo rm -rf /var/lib/rancher/rke2/server/db
  • Start rke2 process on other server node: sudo systemctl start rke2-server

Expected behavior:

Other server node should successfully rejoin the cluster

Actual behavior:

Node does not successfully rejoin the cluster and instead ends up with a weird split brain

Additional context / logs:

@rancher-max rancher-max added this to the v1.23.7+rke2r1 milestone May 4, 2022
@rancher-max rancher-max added this to To Triage in Development [DEPRECATED] via automation May 4, 2022
@rancher-max
Copy link
Contributor Author

/backport v1.22.10+rke2r1

@rancher-max
Copy link
Contributor Author

Confirmed this also continues to happen on v1.23.6+rke2r1 when adding config param etcd-image: quay.io/coreos/etcd:v3.5.4

@rancher-max
Copy link
Contributor Author

We found a workaround. If in this split-brain situation, it is possible to then do the following (in my case, performed these after the steps listed in the issue):

  1. Take a snapshot on node1: sudo rke2 etcd-snapshot save
  2. Stop ALL the servers: sudo systemctl stop rke2-server
  3. Restore from the snapshot that was just taken: sudo rke2 server --cluster-reset --cluster-reset-restore-path="/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-172-31-30-121-1651783213"
  4. Start rke2-server process on node1 after that completes: sudo systemctl start rke2-server
  5. Now, delete db directory from other server nodes and restart rke2-server process:
$ sudo rm -rf /var/lib/rancher/rke2/server/db
$ sudo systemctl start rke2-server

This cures the issue and all nodes are back in the cluster and running successfully.

@Richardswe
Copy link

Got something similar maybe this will help you out if your cluster is not functioning properly after a cluster restore (using the same nodes not new ones)
When performing a cluster-reset and starting the servers again, the cluster seems to have communication issues. I solved this by restarting the server nodes (virtual-machines). We were able to reach the applications that were running on the agent node but not the ones running on the server/master nodes.

Environmental Info:
RKE2 Version:

v1.23.7+rke2r2

Node(s) CPU architecture, OS, and Version:
OpenSuse 15 SP3/JeOS

Cluster Configuration:
3 servers
1 agent

@sandori01
Copy link

I too have a problem similar to this. After restoring a snapshot on one of the masters, I could get the cluster working. However the cluster didn't appear in Rancher cluster management. I am able to see the workloads and authenticate using rancher server, so that means the cattle-cluster-agent can connect to rancher.
rancher-system-agent however cannot connect.
Oct 11 14:04:27 master2 rancher-system-agent[1210]: time="2022-10-11T14:04:27Z" level=error msg="[K8s] Received secret that was nil"
Oct 11 14:04:32 master2 rancher-system-agent[1210]: time="2022-10-11T14:04:32Z" level=error msg="[K8s] Received secret that was nil"
Oct 11 14:04:37 master2 rancher-system-agent[1210]: time="2022-10-11T14:04:37Z" level=error msg="[K8s] Received secret that was nil"
fleet-default does not have machine plans or tokens for the cluster nodes, so rancher2_connection_info.json is obsolete.
I failed to find a way to reconnect rancher-system-agent.

Restarting a second master results in getting a single node cluster. Clearly this is not what I want (after this I had to restore the etcd snapshot).

Environment:
RKE2 v1.23.7+rke2r2
Nodes: OpenSUSE microOS

Cluster: 3 masters (currently only 1 is working), 3 workers

@caroline-suse-rancher
Copy link
Contributor

Closing as 1.23 is soon to be EOL, and there's a workaround above

Development [DEPRECATED] automation moved this from Next Up to Done Issue / Merged PR Feb 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Development [DEPRECATED]
Done Issue / Merged PR
Development

No branches or pull requests

5 participants