-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release-1.23] Cannot join nodes back after cluster-reset when not restoring from snapshot #2857
Comments
/backport v1.22.10+rke2r1 |
Confirmed this also continues to happen on v1.23.6+rke2r1 when adding config param |
We found a workaround. If in this split-brain situation, it is possible to then do the following (in my case, performed these after the steps listed in the issue):
This cures the issue and all nodes are back in the cluster and running successfully. |
Got something similar maybe this will help you out if your cluster is not functioning properly after a cluster restore (using the same nodes not new ones) Environmental Info: v1.23.7+rke2r2 Node(s) CPU architecture, OS, and Version: Cluster Configuration: |
I too have a problem similar to this. After restoring a snapshot on one of the masters, I could get the cluster working. However the cluster didn't appear in Rancher cluster management. I am able to see the workloads and authenticate using rancher server, so that means the cattle-cluster-agent can connect to rancher. Restarting a second master results in getting a single node cluster. Clearly this is not what I want (after this I had to restore the etcd snapshot). Environment: Cluster: 3 masters (currently only 1 is working), 3 workers |
Closing as 1.23 is soon to be EOL, and there's a workaround above |
Environmental Info:
RKE2 Version:
v1.23.5+rke2r1, v1.22.9+rke2r1 (and it seems any rke2 version that uses etcd 3.5.x).
Node(s) CPU architecture, OS, and Version:
any
Cluster Configuration:
Minimal configuration repro'ed on: 2 servers
Initially found on: 3 servers, 1 agent
Describe the bug:
When performing a cluster-reset without
cluster-reset-restore-path
flag, then trying to rejoin server nodes, the cluster ends up with a split brain situation.ETCD Info:
Steps To Reproduce:
sudo systemctl stop rke2-server
sudo rke2 server --cluster-reset
sudo systemctl start rke2-server
sudo rm -rf /var/lib/rancher/rke2/server/db
sudo systemctl start rke2-server
Expected behavior:
Other server node should successfully rejoin the cluster
Actual behavior:
Node does not successfully rejoin the cluster and instead ends up with a weird split brain
Additional context / logs:
The text was updated successfully, but these errors were encountered: