Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560

Open
rnic92 opened this issue Apr 19, 2024 · 1 comment

Comments

@rnic92
Copy link

rnic92 commented Apr 19, 2024

RKE2 Version:
v1.29.1+rke2r1

Node(s) CPU architecture, OS, and Version:
10 hp servers, rhel 8.9

I have a 10 node cluster running v1.29.1+rke2r1-
3 master
7 workers

I had a power outage last night. When I checked them this morning, everything says ready but no pods will schedule and about half of all containers are in restart.
The whole cluster seems to be in a defunct state after any full cluster outage (this has happened before I restarted from scratch).
I powered off all nodes again, and started booting them up one at a time, starting with master-1.

Master-1 kubelet failed to initialize. Containerd was started, but was only running 3 containers. scheduler, etcd, proxy

Booted master-2.

Master-1 and master-2 initialized and kubelet's started. Many more containers began to spawn.

check kubectl get nodes:
all 10 nodes in a ready state. Confused on how 8 nodes are off and ready still.
Waited two hours.
Same situation. 2 nodes powered on, 10 "ready"

Power on master-3.
Seems to come up, kubeapi responding still, shows 10 ready nodes all day.

kubectl get --raw='/readyz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]etcd-readiness ok
[+]kms-providers ok
[+]informer-sync ok
[+]poststarthook/start-encryption-provider-config-automatic-reload ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
[+]shutdown ok
readyz check passed

can't run k9s-
k9s
Error: [list watch] access denied on resource "default":"v1/pods"

Nodes reporting incorrectly.
pods still won't delete or schedule.
Pods all reporting running when they're indeed on powered off nodes.
My kube-scheduler and kube-controller-manager on master-3 are constantly crashLoopBackOff

master-3 kube-scheduler logs
E0419 20:18:33.716881       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0419 20:18:37.800008       1 leaderelection.go:260] successfully acquired lease kube-system/kube-scheduler
E0419 20:18:46.833219       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:18:51.830278       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0419 20:18:51.830734       1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0419 20:18:51.844243       1 server.go:242] "Leaderelection lost"
master-2 kube-scheduler logs
0419 20:22:08.320200       1 leaderelection.go:369] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io kube-scheduler)
E0419 20:22:15.485692       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:22:22.823885       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:22:31.019808       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:22:40.294291       1 leaderelection.go:369] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io kube-scheduler)

etcdctl --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --write-out table --endpoints=$ETCDCTL_ENDPOINTS
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ip1-redacted | id1-redacted |   3.5.9 |   76 MB |     false |      false |         8 |   49748961 |           49748961 |        |
| ip2-redacted | id2-redacted |   3.5.9 |   76 MB |      true |      false |         8 |   49748961 |           49748961 |        |
| ip3-redacted | id3-redacted |   3.5.9 |   76 MB |      false |      false |         8 |   49748961 |           49748961 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Don't know where else to look.

@rnic92
Copy link
Author

rnic92 commented Apr 19, 2024

k get po -n kube-system
NAME                                                    READY   STATUS             RESTARTS         AGE
cloud-controller-manager-ndsv01.muos.lab                1/1     Running            40 (93m ago)     38d
cloud-controller-manager-ndsv02.muos.lab                1/1     Running            3 (92m ago)      4h26m
cloud-controller-manager-ndsv03.muos.lab                0/1     CrashLoopBackOff   20 (2m14s ago)   38d
etcd-ndsv01.muos.lab                                    1/1     Running            2                38d
etcd-ndsv02.muos.lab                                    1/1     Running            1                4h26m
etcd-ndsv03.muos.lab                                    1/1     Running            2                38d
istio-cni-node-6wzcc                                    1/1     Running            1 (9h ago)       38d
istio-cni-node-885mh                                    1/1     Running            0                38d
istio-cni-node-cft4m                                    1/1     Running            1 (9h ago)       38d
istio-cni-node-g7chm                                    1/1     Running            1 (9h ago)       38d
istio-cni-node-hbhzq                                    1/1     Running            0                38d
istio-cni-node-nxxht                                    1/1     Running            1 (93m ago)      4h26m
istio-cni-node-qd5h8                                    1/1     Running            2 (48m ago)      38d
istio-cni-node-slww5                                    1/1     Running            1 (9h ago)       22d
istio-cni-node-t2d2s                                    1/1     Running            2 (135m ago)     22d
istio-cni-node-vk2mh                                    1/1     Running            1 (9h ago)       38d
kube-apiserver-ndsv01.muos.lab                          1/1     Running            5 (135m ago)     8h
kube-apiserver-ndsv02.muos.lab                          1/1     Running            1                4h25m
kube-apiserver-ndsv03.muos.lab                          1/1     Running            3                8h
kube-controller-manager-ndsv01.muos.lab                 1/1     Running            37 (94m ago)     38d
kube-controller-manager-ndsv02.muos.lab                 1/1     Running            3 (92m ago)      4h25m
kube-controller-manager-ndsv03.muos.lab                 0/1     CrashLoopBackOff   19 (3m31s ago)   38d
kube-proxy-ndar03.muos.lab                              1/1     Running            0                22d
kube-proxy-ndar04.muos.lab                              1/1     Running            0                22d
kube-proxy-ndsv01.muos.lab                              1/1     Running            1 (135m ago)     9h
kube-proxy-ndsv02.muos.lab                              1/1     Running            0                91m
kube-proxy-ndsv03.muos.lab                              1/1     Running            0                47m
kube-proxy-ndsv04.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv05.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv06.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv07.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv08.muos.lab                              1/1     Running            0                9h
kube-scheduler-ndsv01.muos.lab                          1/1     Running            5 (135m ago)     38d
kube-scheduler-ndsv02.muos.lab                          1/1     Running            2 (92m ago)      4h25m
kube-scheduler-ndsv03.muos.lab                          0/1     CrashLoopBackOff   18 (85s ago)     38d
rke2-canal-225vn                                        2/2     Running            2 (93m ago)      4h26m
rke2-canal-4snmm                                        2/2     Running            4 (48m ago)      38d
rke2-canal-78xl5                                        2/2     Running            4 (135m ago)     38d
rke2-canal-8f9sp                                        2/2     Running            2 (9h ago)       38d
rke2-canal-f2qd2                                        2/2     Running            2 (9h ago)       38d
rke2-canal-ksxm4                                        2/2     Running            0                38d
rke2-canal-qqtf9                                        2/2     Running            2 (9h ago)       38d
rke2-canal-t4fhv                                        2/2     Running            2 (9h ago)       38d
rke2-canal-ttm28                                        2/2     Running            2 (9h ago)       22d
rke2-canal-z98fz                                        2/2     Running            0                22d
rke2-coredns-rke2-coredns-9849d5ddb-6mgcw               1/1     Running            0                6h27m
rke2-coredns-rke2-coredns-9849d5ddb-9pzhr               1/1     Running            1 (9h ago)       36d
rke2-coredns-rke2-coredns-9849d5ddb-jzgdg               1/1     Running            2 (135m ago)     22d
rke2-coredns-rke2-coredns-9849d5ddb-x8g4n               1/1     Running            2 (48m ago)      38d
rke2-coredns-rke2-coredns-autoscaler-64b867c686-pr6g7   1/1     Running            2 (48m ago)      36d
rke2-metrics-server-544c8c66fc-28lrc                    1/1     Running            4 (47m ago)      36d
rke2-multus-46kml                                       1/1     Running            0                38d
rke2-multus-5p7fh                                       1/1     Running            1 (9h ago)       38d
rke2-multus-6wtmx                                       1/1     Running            3 (9h ago)       38d
rke2-multus-9jp5w                                       1/1     Running            2 (9h ago)       38d
rke2-multus-bxlbc                                       1/1     Running            2 (48m ago)      38d
rke2-multus-mh42p                                       1/1     Running            6 (9h ago)       38d
rke2-multus-qt4j2                                       1/1     Running            1 (93m ago)      4h26m
rke2-multus-v5d42                                       1/1     Running            0                22d
rke2-multus-wlpvc                                       1/1     Running            1 (9h ago)       38d
rke2-multus-wqvxx                                       1/1     Running            2 (135m ago)     38d
rke2-snapshot-controller-59cc9cd8f4-nb2bz               0/1     CrashLoopBackOff   54 (85s ago)     36d
rke2-snapshot-validation-webhook-54c5989b65-5xdps       1/1     Running            3 (47m ago)      36d

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant