Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd Backup and Restore Breaks The Cluster #6050

Closed
tripunovic-uros opened this issue May 29, 2024 · 13 comments
Closed

Etcd Backup and Restore Breaks The Cluster #6050

tripunovic-uros opened this issue May 29, 2024 · 13 comments

Comments

@tripunovic-uros
Copy link

tripunovic-uros commented May 29, 2024

Environmental Info:
rke2 version v1.27.10+rke2r1
go version go1.20.13

Node(s) CPU architecture, OS, and Version:
Linux x86_64 GNU/Linux

Cluster Configuration:
3 servers, 2 agents

Describe the bug:
Following the guide for Restoring a Snapshot to Existing Nodes breaks the cluster after restoration of snapshot from server 1.

Steps To Reproduce:

Expected behavior:
Server 1, 2, and 3 should be working the same as with the fresh install. Both systemctl and kubectl should show that the nodes are running.

Actual behavior:
Server 1 works as expected but server 2 and 3 are no longer working. systemctl shows that server 2 and 3 are starting while kubectl shows that they are ready.

Additional context / logs:

Server 3 logs:

● rke2-server.service - Rancher Kubernetes Engine v2 (server)
   Loaded: loaded (/usr/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
   Active: activating (start) since Wed 2024-05-29 12:15:31 UTC; 6s ago
     Docs: https://github.com/rancher/rke2#readme
  Process: 2225358 ExecStopPost=/bin/sh -c systemd-cgls /system.slice/rke2-server.service | grep -Eo '[0-9]+ (containerd|kubelet)' | awk '{print $1}' | xargs -r kill (code=exited, status=0/SUCCESS)
  Process: 2225388 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
  Process: 2225387 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
  Process: 2225385 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
 Main PID: 2225389 (rke2)
    Tasks: 143
   Memory: 3.7G
   CGroup: /system.slice/rke2-server.service
           ├─2034936 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 5cbfbfabff711632b861f8cf52dda8612e9baaea75b882633d1570b18311e16d -address /run/k3s/containerd/containerd.sock
           ├─2035505 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1d383817327dfa7c0eba90069fb71372cffb50d7a9814ee5cffddff2f692cf9b -address /run/k3s/containerd/containerd.sock
           ├─2036191 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 970522cf848015042f90207648b5828a99cc8c6f05eda415c31ab4eeeeba46a3 -address /run/k3s/containerd/containerd.sock
           ├─2037957 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id e6d1c8b1f906e30ba1ea399708148a059a79ccbad2a51e133d5fdbb0c0c19270 -address /run/k3s/containerd/containerd.sock
           ├─2089307 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id d10be6db6fc10e293cd97d80ff0d6c56fe2f433f339f63a0ed57b6853eb409f3 -address /run/k3s/containerd/containerd.sock
           ├─2089524 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id a8c218833bb001ca324c2960a99f12c145911d385b4e9b49def0ebd46a88fd25 -address /run/k3s/containerd/containerd.sock
           ├─2090110 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id a9c9272842f8775c653e56548851a811749f94a300468b9eb2c779e6b5913667 -address /run/k3s/containerd/containerd.sock
           ├─2090116 /var/lib/rancher/rke2/data/v1.27.10-rke2r1-31de34f39de5/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1bcf16f4d5ec83bd00802f06de80e384de04eda1c728f581e4fdd6a408c2f94b -address /run/k3s/containerd/containerd.sock
           ├─2225389 /usr/bin/rke2 server
           ├─2225421 containerd -c /var/lib/rancher/rke2/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/rke2/agent/containerd
           └─2225544 kubelet --volume-plugin-dir=/var/lib/kubelet/volumeplugins --file-check-frequency=5s --sync-frequency=30s --address=0.0.0.0 --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-log-max-files=5 --container-log-max-size=10Mi --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --feature-gates=RotateKubeletServerCertificate=true --healthz-bind-address=127.0.0.1 --hostname-override=server3.rke2 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --node-ip=192.168.0.83 --node-labels= --pod-infra-container-image=someurl/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --pod-max-pids=32768 --protect-kernel-defaults=true --read-only-port=0 --register-with-taints=node-role.kubernetes.io/etcd=true:NoExecute,node-role.kubernetes.io/control-plane:NoSchedule --resolv-conf=/etc/resolv.conf --rotate-certificates=true --serialize-image-pulls=false --streaming-connection-idle-timeout=5m --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key

May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=error msg="Error encountered while importing /var/lib/rancher/rke2/agent/images/runtime-image.txt: failed to pull images from /var/lib/rancher/rke2/agent/images/runtime-image.txt: image \"someurl/rancher/rke2-runtime:v1.27.10-rke2r1\": not found"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=debug msg="Kubelet image credential provider bin directory check failed: stat /var/lib/rancher/credentialprovider/bin: no such file or directory" 
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Running kubelet --address=0.0.0.0 --alsologtostderr=false --anonymous-auth=false --authentication-token-webhook=true --authorization-mode=Webhook --cgroup-driver=systemd --client-ca-file=/var/lib/rancher/rke2/agent/client-ca.crt --cloud-provider=external --cluster-dns=10.43.0.10 --cluster-domain=cluster.local --container-log-max-files=5 --container-log-max-size=10Mi --container-runtime-endpoint=unix:///run/k3s/containerd/containerd.sock --containerd=/run/k3s/containerd/containerd.sock --eviction-hard=imagefs.available<5%,nodefs.available<5% --eviction-minimum-reclaim=imagefs.available=10%,nodefs.available=10% --fail-swap-on=false --feature-gates=CloudDualStackNodeIPs=true --feature-gates=RotateKubeletServerCertificate=true --healthz-bind-address=127.0.0.1 --hostname-override=server3.rke2 --kubeconfig=/var/lib/rancher/rke2/agent/kubelet.kubeconfig --log-file=/var/lib/rancher/rke2/agent/logs/kubelet.log --log-file-max-size=50 --logtostderr=false --node-ip=192.168.0.83 --node-labels= --pod-infra-container-image=someurl/rancher/mirrored-pause:3.6 --pod-manifest-path=/var/lib/rancher/rke2/agent/pod-manifests --pod-max-pids=32768 --protect-kernel-defaults=true --read-only-port=0 --register-with-taints=node-role.kubernetes.io/etcd=true:NoExecute,node-role.kubernetes.io/control-plane:NoSchedule --resolv-conf=/etc/resolv.conf --rotate-certificates=true --serialize-image-pulls=false --stderrthreshold=FATAL --streaming-connection-idle-timeout=5m --tls-cert-file=/var/lib/rancher/rke2/agent/serving-kubelet.crt --tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256 --tls-private-key-file=/var/lib/rancher/rke2/agent/serving-kubelet.key"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=debug msg="Tunnel server handing HTTP/1.1 GET request for /v1-rke2/connect from 127.0.0.1:58288"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Handling backend connection request [server3.rke2]"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Cluster-Http-Server 2024/05/29 12:15:37 http: TLS handshake error from 127.0.0.1:58290: remote error: tls: bad certificate"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Annotations and labels have already set on node: server3.rke2"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=debug msg="Waiting for Ready condition to be updated for Kubelet Port assignment"
May 29 12:15:37 server3.rke2 rke2-server[2225389]: time="2024-05-29T12:15:37Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

Server 2 has the same error messages.

Workaround:

To enable snapshot restoration, server 2 and 3 are required to uninstall rke2 with the rke2-uninstall.sh script and then perform a fresh install again. After the fresh install, server 2 and 3 then join server 1 with a join agent again.

@brandond
Copy link
Member

I cannot reproduce this. Please show the specific steps you are following to rejoin nodes 2 and 3 to the cluster after restoring the snapshot on node 1.

@tripunovic-uros
Copy link
Author

tripunovic-uros commented May 29, 2024

I'll write out all the steps. So the following commands were performed at server 1.

  1. rke2 etcd-snapshot save --name test

After saving the snapshot and getting its real path, this following commands were performed at all three servers:

  1. systemctl stop rke2-server
  2. rke2-killall.sh

After making sure that all servers are down I then performed a cluster reset with the real path on server 1

  1. rke2 server --cluster-reset --cluster-reset-restore-path=<PATH-TO-SNAPSHOT>
  2. systemctl start rke2-server

Once server 1 is up and running the last commands were performed at server 2 and 3

  1. rm -rf /var/lib/rancher/rke2/server/db
  2. systemctl start rke2-server

I just followed the Etcd Backup and Restore guide. The only thing that server 2 and 3 have are these two extra lines in the config.yaml file:

server: https://rke2.server1:9345
token: my-shared-secret

@brandond
Copy link
Member

Ok, so where are the errors on servers 2 and 3? All I see from the brief bit of logs you posted is that they are going through the normal startup process.

@tripunovic-uros
Copy link
Author

tripunovic-uros commented May 29, 2024

Sorry I should've put more emphasis on the startup process. Server 2 and 3 are stuck on the startup process and they never get up and running. I'll post journalctl logs of server 2 and 3 tomorrow, but both servers are stuck in a infinity loop of getting the node ready.

@brandond
Copy link
Member

brandond commented May 29, 2024

In addition to the logs from journalctl, also check the etcd and apiserver pod logs under /var/log/pods/

@tripunovic-uros
Copy link
Author

tripunovic-uros commented May 30, 2024

Here comes the log files, These are also the etcd ids in the cluster:

+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| https://192.168.0.82:2379 |  5094b2a4960282f |   3.5.9 |   25 MB |     false |      false |         5 |      32249 |              32249 |        |
| https://192.168.0.83:2379 | 76e572814f3fac47 |   3.5.9 |   25 MB |     false |      false |         5 |      32249 |              32249 |        |
| https://192.168.0.81:2379 | e8319108803585bc |   3.5.9 |   39 MB |      true |      false |         5 |      32249 |              32249 |        |
+---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Here are the logs:
kube-apiserver.txt
journalctl-server2.txt.zip
etcd-server2.txt

I also found a older issue of yours on Incompletely joined nodes are not removed from etcd cluster, could this be related?

@brandond
Copy link
Member

That issue is from like 4 years ago and is closed, so I would say no its not related.

@tripunovic-uros
Copy link
Author

tripunovic-uros commented May 30, 2024

Oh... I tried reading the logs but it didn't become any clearer. Any ideas?

@brandond
Copy link
Member

The pod logs both end around 2024-05-30T06:57:49 when etcd appears to have been cleanly shut down and the service stopped. The journald logs show that the service is frequently crashing on startup due to a bug that has been fixed in newer releases of rke2 - can you upgrade to the latest available release, and see if the issue persists?

@tripunovic-uros
Copy link
Author

Hi sorry for a late reply. I won't be able to get to it immediately but I'll get back to you as soon as possible.

@tripunovic-uros
Copy link
Author

Hi!

Sorry for a very very reply. At what versions have this bug been fixed?

@brandond
Copy link
Member

brandond commented Jul 5, 2024

Its been a few months and I don't recall what specifically I was referring to. If possible, please try with the June releases and let us know if you still run into the same issue.

@tripunovic-uros
Copy link
Author

tripunovic-uros commented Jul 11, 2024

Hello updating from rke2-server-1.27.10~rke2r1-0.el8.x86_64 to rke2-server-1.28.11~rke2r1-0.el8.x86_64 solved the issue! I haven't tried updating to different patch versions of v1.27 or v1.28 to see when this bug was solved. Thanks for the help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants