-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557
Comments
We are adding total 27 images into agent directory which will get seeded to containerd on rke2 boot-up. |
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as off-topic.
This comment was marked as off-topic.
Actually I'm wrong - v1.26.11+rke2r1 is from December, and we didn't make the executor changes until February, in v1.26.14+rke2r1 - so that would not be related. |
The problem is not that the images aren't getting imported quickly enough. On the first startup, RKE2 starts importing images at The root cause here is that the joining node is not able to get the current etcd member list from an existing cluster member:
This results in etcd joining with a bad initial cluster member list, which prevents it from starting successfully on that startup, or on any subsequent start. This causes a cascading failure where the apiserver never comes up due to etcd not being up. The fact that it's still importing images when the apiserver failure is reported is a red herring. Please make sure that the etcd ports are open between all the cluster members, and that the nodes can reach each other via their private IP addresses. |
The fact that the |
Thanks @brandond, appreciate your support ! |
Hi @brandond |
Availability of the supervisor port (9345) is not a sufficient health check. RKE2 will listen on this port regardless of whether or not the node is ready. If all of your servers are all control-plane+etcd, you could probably use reachability of ports 2379, 6443, and 9345 as an indicator that all components are running, if not healthy. For a proper health check you can make an authenticated request to the rke2 readyz endpoint:
Where |
Environmental Info:
RKE2 Version:
rke2 version v1.26.11+rke2r1 (7ee1cfc)
go version go1.20.11 X:boringcrypto
Node(s) CPU architecture, OS, and Version:
CPU: 32
OS: RHEL 8.6
Kernel: 4.18.0-372.91.1.el8_6.x86_64
Cluster Configuration: 3 server
Describe the bug:
RKE2 is failing to boot-up on one of the nodes. It ran for ~15minutes and fails.
Steps To Reproduce:
Create 3 node cluster
Load agent/images directory with few images
Start RKE2 server
Installed RKE2:
Expected behavior:
RKE2 should be able to boot-up and join to other nodes.
Actual behavior:
RKE2 is not booting up
Additional context / logs:
rke2.txt
The text was updated successfully, but these errors were encountered: