Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557

Closed
mynktl opened this issue Mar 6, 2024 · 9 comments
Closed

Comments

@mynktl
Copy link

mynktl commented Mar 6, 2024

Environmental Info:
RKE2 Version:
rke2 version v1.26.11+rke2r1 (7ee1cfc)
go version go1.20.11 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
CPU: 32
OS: RHEL 8.6
Kernel: 4.18.0-372.91.1.el8_6.x86_64

Cluster Configuration: 3 server

Describe the bug:
RKE2 is failing to boot-up on one of the nodes. It ran for ~15minutes and fails.

Steps To Reproduce:

  • Create 3 node cluster

  • Load agent/images directory with few images

  • Start RKE2 server

  • Installed RKE2:

Expected behavior:
RKE2 should be able to boot-up and join to other nodes.

Actual behavior:
RKE2 is not booting up

Additional context / logs:
rke2.txt

@mynktl
Copy link
Author

mynktl commented Mar 6, 2024

We are adding total 27 images into agent directory which will get seeded to containerd on rke2 boot-up.

@brandond

This comment was marked as outdated.

@brandond

This comment was marked as off-topic.

@brandond brandond added this to the v1.29.3+rke2r1 milestone Mar 6, 2024
@brandond brandond changed the title RKE2 boot-up become flaky if agent have more images RKE2 fails to start if airgap images take more than 60 seconds to import Mar 6, 2024
@brandond
Copy link
Member

brandond commented Mar 6, 2024

Actually I'm wrong - v1.26.11+rke2r1 is from December, and we didn't make the executor changes until February, in v1.26.14+rke2r1 - so that would not be related.

@brandond
Copy link
Member

brandond commented Mar 6, 2024

The problem is not that the images aren't getting imported quickly enough. On the first startup, RKE2 starts importing images at Mar 05 03:37:56, and finishes importing all of the images by Mar 05 03:42:47 when the kubelet is started.

The root cause here is that the joining node is not able to get the current etcd member list from an existing cluster member:

Mar 05 03:42:50 server2 rke2[22013]: {"level":"warn","ts":"2024-03-05T03:42:50.057248Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d8a80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Mar 05 03:42:50 server2 rke2[22013]: time="2024-03-05T03:42:50Z" level=warning msg="Failed to get etcd MemberList for 4.246.140.77:59850: context deadline exceeded"
Mar 05 03:42:50 server2 rke2[22013]: time="2024-03-05T03:42:50Z" level=info msg="Starting etcd to join cluster with members [server2-34e005e7=https://10.0.1.9:2380]"

This results in etcd joining with a bad initial cluster member list, which prevents it from starting successfully on that startup, or on any subsequent start. This causes a cascading failure where the apiserver never comes up due to etcd not being up. The fact that it's still importing images when the apiserver failure is reported is a red herring.

Please make sure that the etcd ports are open between all the cluster members, and that the nodes can reach each other via their private IP addresses.

@brandond brandond closed this as completed Mar 6, 2024
@brandond
Copy link
Member

brandond commented Mar 6, 2024

The fact that the Failed to get etcd MemberList for 4.246.140.77:59850 error is printed on this node suggests that it is attempting to get the member list from ITSELF, instead of from an existing cluster member. I see that this node is configured to join using https://sfdev5277747-cluster.infra-sf-ea.infra.uipath-dev.com:9345 as the server address. Is this perhaps an external load-balancer that includes this server in the backend pool? If you're using an external load-balancer as the fixed registration endpoint, you MUST ensure that the load-balancer does not send requests to pool members until the member is healthy. Otherwise you'll end up with cases like this, where it is trying to join itself, and gets stuck.

@brandond brandond changed the title RKE2 fails to start if airgap images take more than 60 seconds to import RKE2 fails to start if etcd member list cannot be retrieved during initial join Mar 6, 2024
@rajivml
Copy link

rajivml commented Mar 8, 2024

Thanks @brandond, appreciate your support !

@mynktl
Copy link
Author

mynktl commented Mar 8, 2024

Hi @brandond
Thanks for looking into this.
We do have a health probe defined for 9345 port at load-balancer to route traffic destined for 9345 port. It seems that it is forwarding request to same node since rke2 is bringing up this port. Should we change health probe to use different port, i.e etcd one?

@brandond
Copy link
Member

brandond commented Mar 8, 2024

Availability of the supervisor port (9345) is not a sufficient health check. RKE2 will listen on this port regardless of whether or not the node is ready. If all of your servers are all control-plane+etcd, you could probably use reachability of ports 2379, 6443, and 9345 as an indicator that all components are running, if not healthy. For a proper health check you can make an authenticated request to the rke2 readyz endpoint:

curl -ksf https://node:token@172.17.0.8:9345/v1-rke2/readyz

Where token is the agent join token for this cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants