RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557

mynktl · 2024-03-06T05:41:06Z

Environmental Info:
RKE2 Version:
rke2 version v1.26.11+rke2r1 (7ee1cfc)
go version go1.20.11 X:boringcrypto

Node(s) CPU architecture, OS, and Version:
CPU: 32
OS: RHEL 8.6
Kernel: 4.18.0-372.91.1.el8_6.x86_64

Cluster Configuration: 3 server

Describe the bug:
RKE2 is failing to boot-up on one of the nodes. It ran for ~15minutes and fails.

Steps To Reproduce:

Create 3 node cluster
Load agent/images directory with few images
Start RKE2 server
Installed RKE2:

Expected behavior:
RKE2 should be able to boot-up and join to other nodes.

Actual behavior:
RKE2 is not booting up

Additional context / logs:
rke2.txt

The text was updated successfully, but these errors were encountered:

mynktl · 2024-03-06T07:05:35Z

We are adding total 27 images into agent directory which will get seeded to containerd on rke2 boot-up.

brandond · 2024-03-06T19:24:54Z

Actually I'm wrong - v1.26.11+rke2r1 is from December, and we didn't make the executor changes until February, in v1.26.14+rke2r1 - so that would not be related.

brandond · 2024-03-06T19:57:26Z

The problem is not that the images aren't getting imported quickly enough. On the first startup, RKE2 starts importing images at Mar 05 03:37:56, and finishes importing all of the images by Mar 05 03:42:47 when the kubelet is started.

The root cause here is that the joining node is not able to get the current etcd member list from an existing cluster member:

Mar 05 03:42:50 server2 rke2[22013]: {"level":"warn","ts":"2024-03-05T03:42:50.057248Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0006d8a80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: connect: connection refused\""}
Mar 05 03:42:50 server2 rke2[22013]: time="2024-03-05T03:42:50Z" level=warning msg="Failed to get etcd MemberList for 4.246.140.77:59850: context deadline exceeded"
Mar 05 03:42:50 server2 rke2[22013]: time="2024-03-05T03:42:50Z" level=info msg="Starting etcd to join cluster with members [server2-34e005e7=https://10.0.1.9:2380]"

This results in etcd joining with a bad initial cluster member list, which prevents it from starting successfully on that startup, or on any subsequent start. This causes a cascading failure where the apiserver never comes up due to etcd not being up. The fact that it's still importing images when the apiserver failure is reported is a red herring.

Please make sure that the etcd ports are open between all the cluster members, and that the nodes can reach each other via their private IP addresses.

brandond · 2024-03-06T20:04:41Z

The fact that the Failed to get etcd MemberList for 4.246.140.77:59850 error is printed on this node suggests that it is attempting to get the member list from ITSELF, instead of from an existing cluster member. I see that this node is configured to join using https://sfdev5277747-cluster.infra-sf-ea.infra.uipath-dev.com:9345 as the server address. Is this perhaps an external load-balancer that includes this server in the backend pool? If you're using an external load-balancer as the fixed registration endpoint, you MUST ensure that the load-balancer does not send requests to pool members until the member is healthy. Otherwise you'll end up with cases like this, where it is trying to join itself, and gets stuck.

rajivml · 2024-03-08T04:16:26Z

Thanks @brandond, appreciate your support !

mynktl · 2024-03-08T08:27:28Z

Hi @brandond
Thanks for looking into this.
We do have a health probe defined for 9345 port at load-balancer to route traffic destined for 9345 port. It seems that it is forwarding request to same node since rke2 is bringing up this port. Should we change health probe to use different port, i.e etcd one?

brandond · 2024-03-08T16:49:58Z

Availability of the supervisor port (9345) is not a sufficient health check. RKE2 will listen on this port regardless of whether or not the node is ready. If all of your servers are all control-plane+etcd, you could probably use reachability of ports 2379, 6443, and 9345 as an indicator that all components are running, if not healthy. For a proper health check you can make an authenticated request to the rke2 readyz endpoint:

curl -ksf https://node:token@172.17.0.8:9345/v1-rke2/readyz

Where token is the agent join token for this cluster.

This comment was marked as outdated.

Sign in to view

This comment was marked as off-topic.

Sign in to view

brandond added this to the v1.29.3+rke2r1 milestone Mar 6, 2024

brandond changed the title ~~RKE2 boot-up become flaky if agent have more images~~ RKE2 fails to start if airgap images take more than 60 seconds to import Mar 6, 2024

brandond added the priority/urgent label Mar 6, 2024

brandond closed this as completed Mar 6, 2024

brandond changed the title ~~RKE2 fails to start if airgap images take more than 60 seconds to import~~ RKE2 fails to start if etcd member list cannot be retrieved during initial join Mar 6, 2024

brandond mentioned this issue Mar 6, 2024

Unrecoverable error when joining node attempts to retrieve etcd member list from itself k3s-io/k3s#9661

Closed

brandond mentioned this issue Apr 18, 2024

Unrecoverable error when joining node attempts to retrieve etcd member list from itself #5804

Closed

brandond mentioned this issue Jul 1, 2024

Update ha.md to add a note in section "1. Configure the Fixed Registr… rancher/rke2-docs#231

Closed

hoo29 mentioned this issue Nov 7, 2024

kube-apiserver occasionally terminates on cluster boot and doesn't get restarted #7241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557

RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557

mynktl commented Mar 6, 2024

mynktl commented Mar 6, 2024

This comment was marked as outdated.

This comment was marked as off-topic.

brandond commented Mar 6, 2024

brandond commented Mar 6, 2024 •

edited

Loading

brandond commented Mar 6, 2024

rajivml commented Mar 8, 2024

mynktl commented Mar 8, 2024

brandond commented Mar 8, 2024 •

edited

Loading

RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557

RKE2 fails to start if etcd member list cannot be retrieved during initial join #5557

Comments

mynktl commented Mar 6, 2024

mynktl commented Mar 6, 2024

This comment was marked as outdated.

This comment was marked as off-topic.

brandond commented Mar 6, 2024

brandond commented Mar 6, 2024 • edited Loading

brandond commented Mar 6, 2024

rajivml commented Mar 8, 2024

mynktl commented Mar 8, 2024

brandond commented Mar 8, 2024 • edited Loading

brandond commented Mar 6, 2024 •

edited

Loading

brandond commented Mar 8, 2024 •

edited

Loading