Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560

rnic92 · 2024-04-19T20:31:57Z

RKE2 Version:
v1.29.1+rke2r1

Node(s) CPU architecture, OS, and Version:
10 hp servers, rhel 8.9

I have a 10 node cluster running v1.29.1+rke2r1-
3 master
7 workers

I had a power outage last night. When I checked them this morning, everything says ready but no pods will schedule and about half of all containers are in restart.
The whole cluster seems to be in a defunct state after any full cluster outage (this has happened before I restarted from scratch).
I powered off all nodes again, and started booting them up one at a time, starting with master-1.

Master-1 kubelet failed to initialize. Containerd was started, but was only running 3 containers. scheduler, etcd, proxy

Booted master-2.

Master-1 and master-2 initialized and kubelet's started. Many more containers began to spawn.

check kubectl get nodes:
all 10 nodes in a ready state. Confused on how 8 nodes are off and ready still.
Waited two hours.
Same situation. 2 nodes powered on, 10 "ready"

Power on master-3.
Seems to come up, kubeapi responding still, shows 10 ready nodes all day.

kubectl get --raw='/readyz?verbose'
[+]ping ok
[+]log ok
[+]etcd ok
[+]etcd-readiness ok
[+]kms-providers ok
[+]informer-sync ok
[+]poststarthook/start-encryption-provider-config-automatic-reload ok
[+]poststarthook/start-kube-apiserver-admission-initializer ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]poststarthook/priority-and-fairness-config-consumer ok
[+]poststarthook/priority-and-fairness-filter ok
[+]poststarthook/storage-object-count-tracker-hook ok
[+]poststarthook/start-apiextensions-informers ok
[+]poststarthook/start-apiextensions-controllers ok
[+]poststarthook/crd-informer-synced ok
[+]poststarthook/start-service-ip-repair-controllers ok
[+]poststarthook/rbac/bootstrap-roles ok
[+]poststarthook/scheduling/bootstrap-system-priority-classes ok
[+]poststarthook/priority-and-fairness-config-producer ok
[+]poststarthook/start-system-namespaces-controller ok
[+]poststarthook/bootstrap-controller ok
[+]poststarthook/start-cluster-authentication-info-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-controller ok
[+]poststarthook/start-kube-apiserver-identity-lease-garbage-collector ok
[+]poststarthook/start-legacy-token-tracking-controller ok
[+]poststarthook/aggregator-reload-proxy-client-cert ok
[+]poststarthook/start-kube-aggregator-informers ok
[+]poststarthook/apiservice-registration-controller ok
[+]poststarthook/apiservice-status-available-controller ok
[+]poststarthook/kube-apiserver-autoregistration ok
[+]autoregister-completion ok
[+]poststarthook/apiservice-openapi-controller ok
[+]poststarthook/apiservice-openapiv3-controller ok
[+]poststarthook/apiservice-discovery-controller ok
[+]shutdown ok
readyz check passed

can't run k9s-
k9s
Error: [list watch] access denied on resource "default":"v1/pods"

Nodes reporting incorrectly.
pods still won't delete or schedule.
Pods all reporting running when they're indeed on powered off nodes.
My kube-scheduler and kube-controller-manager on master-3 are constantly crashLoopBackOff

master-3 kube-scheduler logs
E0419 20:18:33.716881       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
I0419 20:18:37.800008       1 leaderelection.go:260] successfully acquired lease kube-system/kube-scheduler
E0419 20:18:46.833219       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:18:51.830278       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": context deadline exceeded
I0419 20:18:51.830734       1 leaderelection.go:285] failed to renew lease kube-system/kube-scheduler: timed out waiting for the condition
E0419 20:18:51.844243       1 server.go:242] "Leaderelection lost"

master-2 kube-scheduler logs
0419 20:22:08.320200       1 leaderelection.go:369] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io kube-scheduler)
E0419 20:22:15.485692       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:22:22.823885       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:22:31.019808       1 leaderelection.go:369] Failed to update lock: Put "https://127.0.0.1:6443/apis/coordination.k8s.io/v1/namespaces/kube-system/leases/kube-scheduler?timeout=5s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
E0419 20:22:40.294291       1 leaderelection.go:369] Failed to update lock: the server was unable to return a response in the time allotted, but may still be processing the request (put leases.coordination.k8s.io kube-scheduler)

etcdctl --cacert=/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt --cert=/var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key=/var/lib/rancher/rke2/server/tls/etcd/server-client.key endpoint status --write-out table --endpoints=$ETCDCTL_ENDPOINTS

+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
|        ENDPOINT         |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
| ip1-redacted | id1-redacted |   3.5.9 |   76 MB |     false |      false |         8 |   49748961 |           49748961 |        |
| ip2-redacted | id2-redacted |   3.5.9 |   76 MB |      true |      false |         8 |   49748961 |           49748961 |        |
| ip3-redacted | id3-redacted |   3.5.9 |   76 MB |      false |      false |         8 |   49748961 |           49748961 |        |
+-------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

Don't know where else to look.

The text was updated successfully, but these errors were encountered:

rnic92 · 2024-04-19T20:32:25Z

k get po -n kube-system

NAME                                                    READY   STATUS             RESTARTS         AGE
cloud-controller-manager-ndsv01.muos.lab                1/1     Running            40 (93m ago)     38d
cloud-controller-manager-ndsv02.muos.lab                1/1     Running            3 (92m ago)      4h26m
cloud-controller-manager-ndsv03.muos.lab                0/1     CrashLoopBackOff   20 (2m14s ago)   38d
etcd-ndsv01.muos.lab                                    1/1     Running            2                38d
etcd-ndsv02.muos.lab                                    1/1     Running            1                4h26m
etcd-ndsv03.muos.lab                                    1/1     Running            2                38d
istio-cni-node-6wzcc                                    1/1     Running            1 (9h ago)       38d
istio-cni-node-885mh                                    1/1     Running            0                38d
istio-cni-node-cft4m                                    1/1     Running            1 (9h ago)       38d
istio-cni-node-g7chm                                    1/1     Running            1 (9h ago)       38d
istio-cni-node-hbhzq                                    1/1     Running            0                38d
istio-cni-node-nxxht                                    1/1     Running            1 (93m ago)      4h26m
istio-cni-node-qd5h8                                    1/1     Running            2 (48m ago)      38d
istio-cni-node-slww5                                    1/1     Running            1 (9h ago)       22d
istio-cni-node-t2d2s                                    1/1     Running            2 (135m ago)     22d
istio-cni-node-vk2mh                                    1/1     Running            1 (9h ago)       38d
kube-apiserver-ndsv01.muos.lab                          1/1     Running            5 (135m ago)     8h
kube-apiserver-ndsv02.muos.lab                          1/1     Running            1                4h25m
kube-apiserver-ndsv03.muos.lab                          1/1     Running            3                8h
kube-controller-manager-ndsv01.muos.lab                 1/1     Running            37 (94m ago)     38d
kube-controller-manager-ndsv02.muos.lab                 1/1     Running            3 (92m ago)      4h25m
kube-controller-manager-ndsv03.muos.lab                 0/1     CrashLoopBackOff   19 (3m31s ago)   38d
kube-proxy-ndar03.muos.lab                              1/1     Running            0                22d
kube-proxy-ndar04.muos.lab                              1/1     Running            0                22d
kube-proxy-ndsv01.muos.lab                              1/1     Running            1 (135m ago)     9h
kube-proxy-ndsv02.muos.lab                              1/1     Running            0                91m
kube-proxy-ndsv03.muos.lab                              1/1     Running            0                47m
kube-proxy-ndsv04.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv05.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv06.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv07.muos.lab                              1/1     Running            0                9h
kube-proxy-ndsv08.muos.lab                              1/1     Running            0                9h
kube-scheduler-ndsv01.muos.lab                          1/1     Running            5 (135m ago)     38d
kube-scheduler-ndsv02.muos.lab                          1/1     Running            2 (92m ago)      4h25m
kube-scheduler-ndsv03.muos.lab                          0/1     CrashLoopBackOff   18 (85s ago)     38d
rke2-canal-225vn                                        2/2     Running            2 (93m ago)      4h26m
rke2-canal-4snmm                                        2/2     Running            4 (48m ago)      38d
rke2-canal-78xl5                                        2/2     Running            4 (135m ago)     38d
rke2-canal-8f9sp                                        2/2     Running            2 (9h ago)       38d
rke2-canal-f2qd2                                        2/2     Running            2 (9h ago)       38d
rke2-canal-ksxm4                                        2/2     Running            0                38d
rke2-canal-qqtf9                                        2/2     Running            2 (9h ago)       38d
rke2-canal-t4fhv                                        2/2     Running            2 (9h ago)       38d
rke2-canal-ttm28                                        2/2     Running            2 (9h ago)       22d
rke2-canal-z98fz                                        2/2     Running            0                22d
rke2-coredns-rke2-coredns-9849d5ddb-6mgcw               1/1     Running            0                6h27m
rke2-coredns-rke2-coredns-9849d5ddb-9pzhr               1/1     Running            1 (9h ago)       36d
rke2-coredns-rke2-coredns-9849d5ddb-jzgdg               1/1     Running            2 (135m ago)     22d
rke2-coredns-rke2-coredns-9849d5ddb-x8g4n               1/1     Running            2 (48m ago)      38d
rke2-coredns-rke2-coredns-autoscaler-64b867c686-pr6g7   1/1     Running            2 (48m ago)      36d
rke2-metrics-server-544c8c66fc-28lrc                    1/1     Running            4 (47m ago)      36d
rke2-multus-46kml                                       1/1     Running            0                38d
rke2-multus-5p7fh                                       1/1     Running            1 (9h ago)       38d
rke2-multus-6wtmx                                       1/1     Running            3 (9h ago)       38d
rke2-multus-9jp5w                                       1/1     Running            2 (9h ago)       38d
rke2-multus-bxlbc                                       1/1     Running            2 (48m ago)      38d
rke2-multus-mh42p                                       1/1     Running            6 (9h ago)       38d
rke2-multus-qt4j2                                       1/1     Running            1 (93m ago)      4h26m
rke2-multus-v5d42                                       1/1     Running            0                22d
rke2-multus-wlpvc                                       1/1     Running            1 (9h ago)       38d
rke2-multus-wqvxx                                       1/1     Running            2 (135m ago)     38d
rke2-snapshot-controller-59cc9cd8f4-nb2bz               0/1     CrashLoopBackOff   54 (85s ago)     36d
rke2-snapshot-validation-webhook-54c5989b65-5xdps       1/1     Running            3 (47m ago)      36d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560

Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560

rnic92 commented Apr 19, 2024 •

edited

rnic92 commented Apr 19, 2024

Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560

Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560

Comments

rnic92 commented Apr 19, 2024 • edited

rnic92 commented Apr 19, 2024

rnic92 commented Apr 19, 2024 •

edited