-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Etcd Backup and Restore Breaks The Cluster #6050
Comments
I cannot reproduce this. Please show the specific steps you are following to rejoin nodes 2 and 3 to the cluster after restoring the snapshot on node 1. |
I'll write out all the steps. So the following commands were performed at server 1.
After saving the snapshot and getting its real path, this following commands were performed at all three servers:
After making sure that all servers are down I then performed a cluster reset with the real path on server 1
Once server 1 is up and running the last commands were performed at server 2 and 3
I just followed the Etcd Backup and Restore guide. The only thing that server 2 and 3 have are these two extra lines in the server: https://rke2.server1:9345
token: my-shared-secret |
Ok, so where are the errors on servers 2 and 3? All I see from the brief bit of logs you posted is that they are going through the normal startup process. |
Sorry I should've put more emphasis on the startup process. Server 2 and 3 are stuck on the startup process and they never get up and running. I'll post |
In addition to the logs from journalctl, also check the etcd and apiserver pod logs under /var/log/pods/ |
Here comes the log files, These are also the etcd ids in the cluster:
Here are the logs: I also found a older issue of yours on Incompletely joined nodes are not removed from etcd cluster, could this be related? |
That issue is from like 4 years ago and is closed, so I would say no its not related. |
Oh... I tried reading the logs but it didn't become any clearer. Any ideas? |
The pod logs both end around |
Hi sorry for a late reply. I won't be able to get to it immediately but I'll get back to you as soon as possible. |
Hi! Sorry for a very very reply. At what versions have this bug been fixed? |
Its been a few months and I don't recall what specifically I was referring to. If possible, please try with the June releases and let us know if you still run into the same issue. |
Hello updating from |
Environmental Info:
rke2 version v1.27.10+rke2r1
go version go1.20.13
Node(s) CPU architecture, OS, and Version:
Linux x86_64 GNU/Linux
Cluster Configuration:
3 servers, 2 agents
Describe the bug:
Following the guide for Restoring a Snapshot to Existing Nodes breaks the cluster after restoration of snapshot from server 1.
Steps To Reproduce:
systemctl status rke2-server
andkubectl get nodes
Expected behavior:
Server 1, 2, and 3 should be working the same as with the fresh install. Both
systemctl
andkubectl
should show that the nodes are running.Actual behavior:
Server 1 works as expected but server 2 and 3 are no longer working.
systemctl
shows that server 2 and 3 are starting whilekubectl
shows that they are ready.Additional context / logs:
Server 3 logs:
Server 2 has the same error messages.
Workaround:
To enable snapshot restoration, server 2 and 3 are required to uninstall rke2 with the
rke2-uninstall.sh
script and then perform a fresh install again. After the fresh install, server 2 and 3 then join server 1 with a join agent again.The text was updated successfully, but these errors were encountered: