Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Release-1.27] - no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots #4913

Closed
brandond opened this issue Oct 19, 2023 · 1 comment
Assignees

Comments

@brandond
Copy link
Contributor

Backport fix for no nodes have reconciled ETCDSnapshotFile resources, requeuing error on clusters with no etcd snapshots

@mdrahman-suse
Copy link
Contributor

Validated with RC version v1.27.7-rc2+rke2r1

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

Ubuntu 22.04.2 LTS (GNU/Linux 5.15.0-1031-aws x86_64)

Cluster Configuration:

Split servers: 1 etcd only, 1 cp only, 1 agent

Config.yaml:

# server1                                    |# server2                                               | # agent1                             |
---------------------------------------------|--------------------------------------------------------|--------------------------------------|
write-kubeconfig-mode: 644                   | write-kubeconfig-mode: 644                             | server: "https://<etcdonly-ip>:9345" |
token: <TOKEN>                               | token: <TOKEN>                                         | token: <TOKEN>                       |
node-external-ip: "<etcdonly-ip>"            | node-external-ip: "<cponly-ip>"                        | node-external-ip: "<agent-ip>"       |
node-name: etcdonly                          | node-name: cponly                                      | node-name: agent                     |
disable-apiserver: true                      | server: "https://<etcdonly-ip>:9345"                   |
disable-controller-manager: true             | disable-etcd: true                                     |
disable-scheduler: true                      | node-taint:                                            |
node-taint:                                  |   - node-role.kubernetes.io/control-plane:NoSchedule   |
  - node-role.kubernetes.io/etcd:NoExecute   |

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  1. Install rke2
  2. Check cp-only server log does not have the error
level=error msg="error syncing '_reconcile_': handler managed-etcd-snapshots-controller: no nodes have reconciled ETCDSnapshotFile resources, requeuing"
  1. Ensure the cluster is up and running

Replication Results:

  • Tried replication with release-1.27 commit 2e777bd but was unable to see the error
  • Cluster does not even come up properly

Validation Results:

  • rke2 version used for validation:
rke2 version v1.27.7-rc2+rke2r1 (5fdc8d79f890391e6ce3485a5b92fa4925398210)
go version go1.20.10 X:boringcrypto
  • Observed that the error msg is NOT visible
$ sudo journalctl -u rke2-server | grep 'ETCDSnapshotFile'
Oct 24 16:51:21 ip-xxx-xx-14-205 rke2[1771]: time="2023-10-24T16:51:21Z" level=info msg="Starting k3s.cattle.io/v1, Kind=ETCDSnapshotFile controller"
ubuntu@ip-xxx-xx-14-205:~$
  • Cluster is up and running
$ kubectl get nodes,pods -A -o wide
NAME                                               STATUS   ROLES                  AGE    VERSION          INTERNAL-IP     EXTERNAL-IP     OS-IMAGE             KERNEL-VERSION    CONTAINER-RUNTIME
node/ip-xxx-xx-14-205.us-east-2.compute.internal   Ready    control-plane,master   102m   v1.27.7+rke2r1   xxx.xx.14.205   xx.xxx.13.209   Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1
node/ip-xxx-xx-2-103.us-east-2.compute.internal    Ready    etcd                   102m   v1.27.7+rke2r1   xxx.xx.2.103    xx.xxx.63.118   Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1
node/ip-xxx-xx-7-121.us-east-2.compute.internal    Ready    <none>                 102m   v1.27.7+rke2r1   xxx.xx.7.121    xx.xx.77.49     Ubuntu 22.04.1 LTS   5.15.0-1019-aws   containerd://1.7.7-k3s1

NAMESPACE     NAME                                                                       READY   STATUS      RESTARTS   AGE    IP              NODE                                          NOMINATED NODE   READINESS GATES
kube-system   pod/cloud-controller-manager-ip-xxx-xx-14-205.us-east-2.compute.internal   1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/cloud-controller-manager-ip-xxx-xx-2-103.us-east-2.compute.internal    1/1     Running     0          102m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/etcd-ip-xxx-xx-2-103.us-east-2.compute.internal                        1/1     Running     0          102m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-canal-jwjd9                                          0/1     Completed   0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-coredns-lj6pr                                        0/1     Completed   0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/helm-install-rke2-ingress-nginx-gw2dw                                  0/1     Completed   0          102m   xx.xx.2.5       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-metrics-server-22wgg                                 0/1     Completed   0          102m   xx.xx.2.6       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-crd-lgt6r                        0/1     Completed   0          102m   xx.xx.2.3       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-controller-q8xnm                            0/1     Completed   1          102m   xx.xx.2.4       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/helm-install-rke2-snapshot-validation-webhook-jb2xh                    0/1     Completed   0          102m   xx.xx.2.2       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-apiserver-ip-xxx-xx-14-205.us-east-2.compute.internal             1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-controller-manager-ip-xxx-xx-14-205.us-east-2.compute.internal    1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-14-205.us-east-2.compute.internal                 1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-2-103.us-east-2.compute.internal                  1/1     Running     0          102m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-proxy-ip-xxx-xx-7-121.us-east-2.compute.internal                  1/1     Running     0          101m   xxx.xx.7.121    ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/kube-scheduler-ip-xxx-xx-14-205.us-east-2.compute.internal             1/1     Running     0          102m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-canal-czstx                                                       2/2     Running     0          101m   xxx.xx.14.205   ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-canal-j2r4t                                                       2/2     Running     0          101m   xxx.xx.7.121    ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-canal-rf9d7                                                       2/2     Running     0          101m   xxx.xx.2.103    ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-autoscaler-6f97df447-rpzls                   1/1     Running     0          101m   xx.xx.1.3       ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-f6c9f9649-br5g2                              1/1     Running     0          101m   xx.xx.0.2       ip-xxx-xx-14-205.us-east-2.compute.internal   <none>           <none>
kube-system   pod/rke2-coredns-rke2-coredns-f6c9f9649-j7hjk                              1/1     Running     0          101m   xx.xx.1.2       ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-ingress-nginx-controller-rxskf                                    1/1     Running     0          99m    xx.xx.2.10      ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-metrics-server-6d79d977db-7pjbt                                   1/1     Running     0          100m   xx.xx.2.7       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-snapshot-controller-7d6476d7cb-99vlk                              1/1     Running     0          100m   xx.xx.2.9       ip-xxx-xx-7-121.us-east-2.compute.internal    <none>           <none>
kube-system   pod/rke2-snapshot-validation-webhook-5649fbd66c-9sj2t                      1/1     Running     0          100m   xx.xx.1.4       ip-xxx-xx-2-103.us-east-2.compute.internal    <none>           <none>

NOTE: The issue was not observed on all roles cluster setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants