New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster unrecoverable after every power outage - nodes all say ready (even when off) #3560
Comments
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
RKE2 Version:
v1.29.1+rke2r1
Node(s) CPU architecture, OS, and Version:
10 hp servers, rhel 8.9
I have a 10 node cluster running v1.29.1+rke2r1-
3 master
7 workers
I had a power outage last night. When I checked them this morning, everything says ready but no pods will schedule and about half of all containers are in restart.
The whole cluster seems to be in a defunct state after any full cluster outage (this has happened before I restarted from scratch).
I powered off all nodes again, and started booting them up one at a time, starting with master-1.
Master-1 kubelet failed to initialize. Containerd was started, but was only running 3 containers. scheduler, etcd, proxy
Booted master-2.
Master-1 and master-2 initialized and kubelet's started. Many more containers began to spawn.
check kubectl get nodes:
all 10 nodes in a ready state. Confused on how 8 nodes are off and ready still.
Waited two hours.
Same situation. 2 nodes powered on, 10 "ready"
Power on master-3.
Seems to come up, kubeapi responding still, shows 10 ready nodes all day.
can't run k9s-
k9s
Error: [list watch] access denied on resource "default":"v1/pods"
Nodes reporting incorrectly.
pods still won't delete or schedule.
Pods all reporting running when they're indeed on powered off nodes.
My kube-scheduler and kube-controller-manager on master-3 are constantly crashLoopBackOff
Don't know where else to look.
The text was updated successfully, but these errors were encountered: