-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Etcd restore does not work on an RKE2 cluster #42895
Comments
Similar issue was observed earlier - #40005 |
Need more logs. Do you have logs from the |
I did not had a logs from |
@vivek-shilimkar Since 2.8 Alpha-1 is available, can we retest the backup-restore scenario? |
Testing on the latest version Also in 2.8 Alpha-2 I wasn't able to reproduce it. Additional note: on alpha i noticed that the fleet-agent of the downstream cluster was on crashing loopback. For entering into a nill map, that wasn't happening on head. .Reason: The error don't happen on a single node cluster. |
Using a cluster with: I was able to get the error. At the beginning 2 ETCD +1 CP were restored, one ETC Was stuck failing to connect to the server: rke2 journalctl:
After ~45min the ETCD was able to reconcile itself. Looking into the nodes I noticed that dev/root was almost full for all nodes instantiated with 16gb, that may be causing the problem. I'll retry with larger storage. Observation: I do believe that the 16gb could have been the cause of the rke binary no to be created, even after updating to larger nodes the problem persist. |
The upgrade behavior is also strange. I tried to create a new cluster using the 3-3-2 all of them with more storage. After doing the back-up and the upgrade I noticed that all etcd nodes were missing the kubeconfig file: /etc/rancher/rke2/rke2.yaml. Observation: After talking with Jake, this behavior is expected. After trying to restore the ETCD those nodes didn't start the rke2-server. The error is not being consistently reproducible , I tryed to do a 1-1-1 cluster and on that during the upgrade the CP node got stuck rke2-server wasn't starting cause a problem on the CA not being authorized to the IP that it was trying to use. |
Did some extra tries, wile trying to do the restore before the Kubernetes update the same problem happens, but the clusters that fail are the worker ones. |
Some extra information: This problem is related to restoring the ETCd on K8s 1.26.9 and 1.26.8 on a multi node cluster. Doing just an ETCd backup and Restore (ETCd only) is enough to reproduce it. The problem dosen't happen wile restoring K8s 1.27. Next tests that i'll do today:
|
@felipe-colussi I tested this on v2.7.8 an a |
@felipe-colussi I tested the following scenario on
|
Merged #43158 The PR fixes the problem were the restores got stuck forever with "Waiting for probe: calico". Even after this PR we still have the following known problems (wile using RKE2 on 1.26.8, 1.26.9, 1.25.13)¹: 1. Wile doing etcd restores (only etcd or all 3): Etcd node get stuck with:
2. Wile Upgrading to 1.27.6 there is a chance that a worker node get stuck with:
3. Wile restoring an ETCD (only etcd or all 3) to 1.26.8 and 1.26.9² there is a chance of it getting stuck forever with: ¹ probably also happens on 1.25.14 but wasn't intensively tested. |
I ran etcd snap/restore checks yesterday on a few Rancher server versions, in an effort to better help determine the scope of the recent rke2 snap/restore failures seen, and the results are below: On
On
On
On Rancher
Note: Calico issue on worker nodes no longer encountered w/ Felipe's fix In an effort to determine the frequency of
|
As a quick workaround for this issue we found out that a restart to rke2-server after restore resolves the problem by restarting the kubelet and containerd, it seems that kubelet is stuck with restarting the kube-controller-manager pod after it exits and mistakenly reporting that its in ready state, so as a workaround for this problem the plan can trigger a restart to rke2-server.service |
There is a report of the same behavior of static pods not starting, also on a rancher managed cluster, at rancher/rke2#4864. In this case I believe the issue was triggered by an upgrade to a newer patch release of RKE2, not a cluster restore. |
Validated the issue with Rancher 2.8.0-rc1
Issue Replication
Issue validation
Testing
|
Rancher Server Setup
d101c27
Information about the Cluster
1.27.5+rke2r1
tov1.26.8+rke2r1
RKE2User Information
Describe the bug
[BUG] Etcd restore does not work on an RKE2 cluster
To Reproduce
Deploy a downstream RKE2 node driver cluster on 1.26 RKE2 version
Take an etcd snapshot
Upgrade to 1.27 RKE2 version
Restore using
All options - config, k8s and etcd
option to the snapshot taken previouslyCluster is stuck in
Updating
state error:[INFO ] configuring etcd node(s) rke2-backup-restore-etcd-5fb5f775c6x9hzcw-rwkwm: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unknown., waiting for probes: etcd
rancher prov logs:
Note:
1.26.8+rke2r1
to1.27.5+rke2r1
works but the restore to snapshot taken on 1.26 fails.The text was updated successfully, but these errors were encountered: