Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role #43356

susesgartner · 2023-10-31T21:19:45Z

Rancher Server Setup

Rancher version: 2.8.0-rc3
Installation option: Helm

Information about the Cluster

Kubernetes version: 1.27
Cluster Type: Downstream EC2 cluster
-Node Setup: 1 node all roles

User Information

User role: Admin

Describe the bug
Cluster hangs when attempting to scale from 1 to 3 nodes.

To Reproduce

Create an RKE1 cluster with 1 node all roles
Wait for the cluster to finish provisioning
Edit config and scale the cluster to 3 nodes all roles

Result
Cluster hangs with nodes in the registering state (I left them in that state overnight).

Expected Result
Cluster scales properly with no issues.

Screenshots

Provisioning log for one of those downstream clusters:

Additional context
I was unable to reproduce this issue on a fresh rancher install until I created/deleted several clusters.

sowmyav27 · 2023-11-01T07:13:02Z

Validated a few usecases

2.8-head - `9bf6631`

On 2.8-head - 9bf6631, I was able to reproduce at the first attempt on a fresh install of Rancher

Usecases that failed:

1 etcd, 1 cp and 1 worker node --> After cluster is active, Scale up etcd to 3.
1 node all roles --> After cluster is active, scale up to 3 nodes all roles
3 nodes all roles --> After cluster is active, scale up to 5 nodes all roles
Warning in rancher logs:

W1101 04:41:27.201143      38 logging.go:59] [core] [Channel #229 SubChannel #230] grpc: addrConn.createTransport failed to connect to {Addr: "<>:2379", ServerName: "<>", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp <>:2379: connect: connection refused"

Prov logs from rancher:

[INFO ] [/etc/kubernetes/audit-policy.yaml] Successfully deployed audit policy file to Cluster control nodes
[INFO ] [reconcile] Reconciling cluster state
[INFO ] [reconcile] Check etcd hosts to be deleted
[INFO ] [reconcile] Check etcd hosts to be added

Stops after this ^

Repeated Scenario 2 with k8s v1.26.8 (the k8s in 2.7.5 and 2.7.9) I see this. But the cluster came up eventually, in less than 10 mins

Cluster has an error: Failed to reconcile etcd plane: Failed to add etcd member [etcd-sowmya-final-3] to etcd cluster
Rancher logs:

2023/11/01 05:36:43 [INFO] cluster [c-qhfrr] provisioning: [add/etcd] Adding member [etcd-sowmya-final-3] to etcd cluster
{"level":"warn","ts":"2023-11-01T05:36:44.401009Z","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00f5c1340/172.31.37.245:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
{"level":"warn","ts":"2023-11-01T05:36:45.229915Z","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00f5c1880/172.31.38.13:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}

On 2.7.5

Usecases that passed (these failed on 2.8-head, All 3 clusters, same k8s version - 1.26.8)

1 etcd, 1 cp and 1 worker node --> After cluster is active, Scale up etcd to 3.
1 node all roles --> After cluster is active, scale up to 3 nodes all roles
3 nodes all roles --> After cluster is active, scale up to 5 nodes all roles

But the 2nd scenario ^, there were transient error messages, but eventually cluster came up Active.

Errors/Warning seen

{"level":"warn","ts":"2023-11-01T04:49:59.943Z","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001173c00/172.31.40.67:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}

Failed to reconcile etcd plane: Failed to add etcd member [etcd-sowmya-279-two-3] to etcd cluster

And

[Failed to create [rke-log-cleaner] container on host [35.90.235.132]: Failed to create Docker container [rke-log-cleaner] on host [35.90.235.132]: Error response from daemon: Conflict. The container name "/rke-log-cleaner" is already in use by container "b7d8e4cc608c0014612dd3ba24fdb9b62a1cc97ae24d5fb4c5f7e22029beb879". You have to remove (or rename) that container to be able to reuse that name.]

on v2.7.9

Usecases that passed (All 3 clusters, same k8s version - 1.26.8)

1 etcd, 1 cp and 1 worker node --> After cluster is active, Scale up etcd to 3.
3 nodes all roles --> After cluster is active, scale up to 5 nodes all roles

But there were transient errors, but the cluster eventually came up Active (in less than 10 minutes)

Errors for these ^ scenarios
Failed to reconcile etcd plane: Failed to add etcd member [etcd-sowmya-test-issue-2] to etcd cluster
And
[Error response from daemon: removal of container rke-log-cleaner is already in progress]

However the 2nd scenario from 2.7.5/2.8-head, failed on 2.7.9 also.

1 node all roles --> After cluster is active, scale up to 3 nodes all roles --> Failed
Rancher logs:

2023-11-01T05:30:46.601158191Z 2023/11/01 05:30:46 [INFO] cluster [c-59wbb] provisioning: [add/etcd] Adding member [etcd-sowmya-test-issue-2] to etcd cluster
2023-11-01T05:30:46.627550715Z {"level":"warn","ts":"2023-11-01T05:30:46.627Z","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001565340/172.31.3.60:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
{"level":"warn","ts":"2023-11-01T05:30:46.640Z","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc004eab340/172.31.0.18:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
2023-11-01T05:30:46.641671534Z 2023/11/01 05:30:46 [INFO] kontainerdriver rancherkubernetesengine stopped

snasovich · 2023-11-03T14:36:32Z

There were a lot of conversations on this issue internally and TLDR is that this issue was reproduced on multiple RKE1 versions on Rancher versions going back to at least 2.7.5 (and likely applicable to versions much earlier).

As such, it's no longer considered a release blocker for 2.8.0 since we're past code freeze for that release. The focus of the engineering team is to come up with recommendations for both preventative and reactive workarounds for this issue and release note them.

FYI @Jono-SUSE-Rancher
Heads-up @rancher/docs for adding this to release notes.

kinarashah · 2023-11-03T18:21:41Z

Issue
Scaling up etcd nodes fails and cluster hangs up in waiting state. Nodes are stuck waiting to register with Kubernetes.

Affected Versions
The issue is not always reproducible.
RKE - v1.3.3
Rancher - v2.6.3

Rootcase
RKE checks the new etcd node's membership by looking at peerURLs of other etcd nodes, possibly causing failure if it selects another etcd node that has not been added yet. This issue is specific to >= Kubernetes 1.22 which uses etcd v3 client, v2 client instantly errors out vs v3 client continuously keeps retrying.

v2:

Failed to Add etcd member [xxx-xx-xx-xx] from host: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp xxx.xx.xx.xx:2379: connect: connection refused

v3:

[INFO] cluster [c-wkp4m] provisioning: [reconcile] Check etcd hosts to be added
[core] [Channel #25 SubChannel #26] grpc: addrConn.createTransport failed to connect to {Addr: "xxx.xx.xx.xx:2379", ServerName: "xxx.xx.xx.xx", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp xxx.xx.xx.xx:2379: connect: connection refused"
[core] [Channel #25 SubChannel #26] grpc: addrConn.createTransport failed to connect to {Addr: "xxx.xx.xx.xx:2379", ServerName: "xxx.xx.xx.xx", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp xxx.xx.xx.xx:2379: connect: connection refused"
[core] [Channel #25 SubChannel #26] grpc: addrConn.createTransport failed to connect to {Addr: "xxx.xx.xx.xx:2379", ServerName: "xxx.xx.xx.xx", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp xxx.xx.xx.xx:2379: connect: connection refused"

Workaround
Active cluster:

Add one etcd node, wait for the cluster to be active and then repeat the process

Cluster stuck in waiting state:

Delete the stuck etcd nodes
Restart rancher leader pod, this step is required to terminate the grpc goroutine. Find the leader pod by looking for leaderIdentity in cattle-controllers configmap: kubectl -n kube-system get configmap cattle-controller
Wait for the stuck etcd nodes to be removed
Add one etcd node, wait for the cluster to be active and then repeat the process

Note: etcd restore doesn't work as a workaround, restarting rancher is required to terminate the hung request and then adding nodes one by one works.

kinarashah · 2024-03-20T21:18:05Z

Have a draft PR up, but need to sync changes from v1.5 to v1.6 branch first before opening the final PR for v1.6 branch since this issue is for 2.9-Next. rancher/rke#3536

susesgartner added kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker labels Oct 31, 2023

susesgartner added this to the v2.8.x milestone Oct 31, 2023

susesgartner assigned snasovich and susesgartner Oct 31, 2023

snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Oct 31, 2023

snasovich assigned jiaqiluo Oct 31, 2023

snasovich modified the milestones: v2.8.x, v2.8.0 Oct 31, 2023

sowmyav27 changed the title ~~Cluster hangs when trying to scale an RKE1 cluster from 1 to 3 nodes~~ Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role Nov 1, 2023

jiaqiluo assigned kinarashah Nov 1, 2023

jiaqiluo added [zube]: Working [zube]: Team Area 2 labels Nov 1, 2023

zube bot removed the [zube]: Team Area 2 label Nov 1, 2023

snasovich added release-note Note this issue in the milestone's release notes [zube]: Release Note and removed status/release-blocker labels Nov 3, 2023

zube bot removed the [zube]: Working label Nov 3, 2023

snasovich modified the milestones: v2.8.0, v2.8-Next1 Nov 3, 2023

martyav added the status/release-note-added label Nov 13, 2023

Oats87 added the area/provisioning-rke1 Provisioning issues with RKE1 label Dec 5, 2023

Jono-SUSE-Rancher modified the milestones: v2.8-Next1, v2.9-Next1 Jan 10, 2024

Jono-SUSE-Rancher added [zube]: To Triage and removed [zube]: Release Note labels Jan 10, 2024

jiaqiluo removed the [zube]: To Triage label Jan 31, 2024

jiaqiluo added [zube]: Working [zube]: To Triage and removed [zube]: Working labels Jan 31, 2024

snasovich added the [zube]: Next Up label Feb 22, 2024

zube bot removed the [zube]: To Triage label Feb 22, 2024

Jono-SUSE-Rancher removed the [zube]: Next Up label Feb 29, 2024

jiaqiluo removed their assignment Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role #43356

Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role #43356

susesgartner commented Oct 31, 2023 •

edited

sowmyav27 commented Nov 1, 2023 •

edited

snasovich commented Nov 3, 2023

kinarashah commented Nov 3, 2023 •

edited

kinarashah commented Mar 20, 2024

Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role #43356

Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role #43356

Comments

susesgartner commented Oct 31, 2023 • edited

sowmyav27 commented Nov 1, 2023 • edited

2.8-head - 9bf6631

On 2.7.5

on v2.7.9

snasovich commented Nov 3, 2023

kinarashah commented Nov 3, 2023 • edited

kinarashah commented Mar 20, 2024

susesgartner commented Oct 31, 2023 •

edited

sowmyav27 commented Nov 1, 2023 •

edited

2.8-head - `9bf6631`

kinarashah commented Nov 3, 2023 •

edited