[Release-1.27] - `no nodes have reconciled ETCDSnapshotFile resources, requeuing` error on clusters with no etcd snapshots #4913

brandond · 2023-10-19T00:20:13Z

Backport fix for no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots

no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots #4906

The text was updated successfully, but these errors were encountered:

mdrahman-suse · 2023-10-24T18:35:28Z

Validated with RC version v1.27.7-rc2+rke2r1

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64)

Cluster Configuration:

Split servers: 1 etcd only, 1 cp only, 1 agent

Config.yaml:

# server1                                    |# server2                                               | # agent1                             |
---------------------------------------------|--------------------------------------------------------|--------------------------------------|
write-kubeconfig-mode: 644                   | write-kubeconfig-mode: 644                             | server: "https://<etcdonly-ip>:9345" |
token: <TOKEN>                               | token: <TOKEN>                                         | token: <TOKEN>                       |
node-external-ip: "<etcdonly-ip>"            | node-external-ip: "<cponly-ip>"                        | node-external-ip: "<agent-ip>"       |
node-name: etcdonly                          | node-name: cponly                                      | node-name: agent                     |
disable-apiserver: true                      | server: "https://<etcdonly-ip>:9345"                   |
disable-controller-manager: true             | disable-etcd: true                                     |
disable-scheduler: true                      | node-taint:                                            |
node-taint:                                  |   - node-role.kubernetes.io/control-plane:NoSchedule   |
  - node-role.kubernetes.io/etcd:NoExecute   |

Testing Steps

Copy config.yaml

$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2

Install rke2
Check cp-only server log does not have the error

level=error msg="error syncing '_reconcile_': handler managed-etcd-snapshots-controller: no nodes have reconciled ETCDSnapshotFile resources, requeuing"

Ensure the cluster is up and running

Replication Results:

Tried replication with release-1.27 commit 2e777bd but was unable to see the error
Cluster does not even come up properly

Validation Results:

rke2 version used for validation:

rke2 version v1.27.7-rc2+rke2r1 (5fdc8d79f890391e6ce3485a5b92fa4925398210)
go version go1.20.10 X:boringcrypto

Observed that the error msg is NOT visible

$ sudo journalctl -u rke2-server | grep 'ETCDSnapshotFile'
Oct 24 16:51:21 ip-xxx-xx-14-205 rke2[1771]: time="2023-10-24T16:51:21Z" level=info msg="Starting k3s.cattle.io/v1, Kind=ETCDSnapshotFile controller"
ubuntu@ip-xxx-xx-14-205:~$

Cluster is up and running

$ kubectl get nodes,pods -A -o wide
NAME                                               STATUS   ROLES                  AGE    VERSION          INTERNAL-IP     EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
node/ip-xxx-xx-14-205.us-east-2.compute.internal   Ready    control-plane,master   102m   v1.27.7+rke2r1   xxx.xx.14.205   xx.xxx.13.209   Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1
node/ip-xxx-xx-2-103.us-east-2.compute.internal    Ready    etcd                   102m   v1.27.7+rke2r1   xxx.xx.2.103    xx.xxx.63.118   Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1
node/ip-xxx-xx-7-121.us-east-2.compute.internal    Ready    <none>                 102m   v1.27.7+rke2r1   xxx.xx.7.121    xx.xx.77.49     Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1

NAMESPACE     NAME                                                                       READY   STATUS      RESTARTS   AGE    IP              NODE                                          NOMINATED NODE   READINESS GATES
kube-system   pod/cloud-controller-manager-ip-xxx-xx-14-205.us-east-2.compute.internal   1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/cloud-controller-manager-ip-xxx-xx-2-103.us-east-2.compute.internal    1/1     Running     0          102m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/etcd-ip-xxx-xx-2-103.us-east-2.compute.internal                        1/1     Running     0          102m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-canal-jwjd9                                          0/1     Completed   0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-coredns-lj6pr                                        0/1     Completed   0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-ingress-nginx-gw2dw                                  0/1     Completed   0          102m   xx.xx.2.5       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-metrics-server-22wgg                                 0/1     Completed   0          102m   xx.xx.2.6       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-crd-lgt6r                        0/1     Completed   0          102m   xx.xx.2.3       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-q8xnm                            0/1     Completed   1          102m   xx.xx.2.4       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-jb2xh                    0/1     Completed   0          102m   xx.xx.2.2       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-apiserver-ip-xxx-xx-14-205.us-east-2.compute.internal             1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-controller-manager-ip-xxx-xx-14-205.us-east-2.compute.internal    1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-14-205.us-east-2.compute.internal                 1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-2-103.us-east-2.compute.internal                  1/1     Running     0          102m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-7-121.us-east-2.compute.internal                  1/1     Running     0          101m   xxx.xx.7.121    ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-scheduler-ip-xxx-xx-14-205.us-east-2.compute.internal             1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-canal-czstx                                                       2/2     Running     0          101m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-canal-j2r4t                                                       2/2     Running     0          101m   xxx.xx.7.121    ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-canal-rf9d7                                                       2/2     Running     0          101m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-6f97df447-rpzls                   1/1     Running     0          101m   xx.xx.1.3       ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-f6c9f9649-br5g2                              1/1     Running     0          101m   xx.xx.0.2       ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-f6c9f9649-j7hjk                              1/1     Running     0          101m   xx.xx.1.2       ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-ingress-nginx-controller-rxskf                                    1/1     Running     0          99m    xx.xx.2.10      ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-metrics-server-6d79d977db-7pjbt                                   1/1     Running     0          100m   xx.xx.2.7       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-snapshot-controller-7d6476d7cb-99vlk                              1/1     Running     0          100m   xx.xx.2.9       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-snapshot-validation-webhook-5649fbd66c-9sj2t                      1/1     Running     0          100m   xx.xx.1.4       ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>

NOTE: The issue was not observed on all roles cluster setup

brandond self-assigned this Oct 19, 2023

brandond mentioned this issue Oct 19, 2023

[release-1.27] Bump K3s version for v1.27 #4917

Merged

aganesh-suse assigned mdrahman-suse Oct 19, 2023

mdrahman-suse closed this as completed Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Release-1.27] - `no nodes have reconciled ETCDSnapshotFile resources, requeuing` error on clusters with no etcd snapshots #4913

[Release-1.27] - `no nodes have reconciled ETCDSnapshotFile resources, requeuing` error on clusters with no etcd snapshots #4913

brandond commented Oct 19, 2023

mdrahman-suse commented Oct 24, 2023

[Release-1.27] - no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots #4913

[Release-1.27] - no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots #4913

Comments

brandond commented Oct 19, 2023

mdrahman-suse commented Oct 24, 2023

Validated with RC version v1.27.7-rc2+rke2r1

Environment Details

Testing Steps

[Release-1.27] - `no nodes have reconciled ETCDSnapshotFile resources, requeuing` error on clusters with no etcd snapshots #4913

[Release-1.27] - `no nodes have reconciled ETCDSnapshotFile resources, requeuing` error on clusters with no etcd snapshots #4913