Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role #43356

Open
susesgartner opened this issue Oct 31, 2023 · 4 comments
Assignees
Labels
area/provisioning-rke1 Provisioning issues with RKE1 kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes status/release-note-added team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@susesgartner
Copy link
Contributor

susesgartner commented Oct 31, 2023

Rancher Server Setup

  • Rancher version: 2.8.0-rc3
  • Installation option: Helm

Information about the Cluster

  • Kubernetes version: 1.27
  • Cluster Type: Downstream EC2 cluster
    -Node Setup: 1 node all roles

User Information

  • User role: Admin

Describe the bug
Cluster hangs when attempting to scale from 1 to 3 nodes.

To Reproduce

  1. Create an RKE1 cluster with 1 node all roles
  2. Wait for the cluster to finish provisioning
  3. Edit config and scale the cluster to 3 nodes all roles

Result
Cluster hangs with nodes in the registering state (I left them in that state overnight).

Expected Result
Cluster scales properly with no issues.

Screenshots
image
Provisioning log for one of those downstream clusters:
Screenshot 2023-10-31 at 3 46 14 PM

Additional context
I was unable to reproduce this issue on a fresh rancher install until I created/deleted several clusters.

@susesgartner susesgartner added kind/bug Issues that are defects reported by users or that we know have reached a real release status/release-blocker labels Oct 31, 2023
@susesgartner susesgartner added this to the v2.8.x milestone Oct 31, 2023
@snasovich snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Oct 31, 2023
@snasovich snasovich modified the milestones: v2.8.x, v2.8.0 Oct 31, 2023
@sowmyav27 sowmyav27 changed the title Cluster hangs when trying to scale an RKE1 cluster from 1 to 3 nodes Unable scale an RKE1 cluster from 1 to 3 etcd nodes or nodes with etcd role Nov 1, 2023
@sowmyav27
Copy link
Contributor

sowmyav27 commented Nov 1, 2023

Validated a few usecases

2.8-head - 9bf6631

On 2.8-head - 9bf6631, I was able to reproduce at the first attempt on a fresh install of Rancher

Usecases that failed:

  • 1 etcd, 1 cp and 1 worker node --> After cluster is active, Scale up etcd to 3.
  • 1 node all roles --> After cluster is active, scale up to 3 nodes all roles
  • 3 nodes all roles --> After cluster is active, scale up to 5 nodes all roles
  • Warning in rancher logs:
W1101 04:41:27.201143      38 logging.go:59] [core] [Channel #229 SubChannel #230] grpc: addrConn.createTransport failed to connect to {Addr: "<>:2379", ServerName: "<>", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp <>:2379: connect: connection refused"
  • Prov logs from rancher:
[INFO ] [/etc/kubernetes/audit-policy.yaml] Successfully deployed audit policy file to Cluster control nodes
[INFO ] [reconcile] Reconciling cluster state
[INFO ] [reconcile] Check etcd hosts to be deleted
[INFO ] [reconcile] Check etcd hosts to be added

Stops after this ^

Repeated Scenario 2 with k8s v1.26.8 (the k8s in 2.7.5 and 2.7.9) I see this. But the cluster came up eventually, in less than 10 mins

  • Cluster has an error: Failed to reconcile etcd plane: Failed to add etcd member [etcd-sowmya-final-3] to etcd cluster
  • Rancher logs:
2023/11/01 05:36:43 [INFO] cluster [c-qhfrr] provisioning: [add/etcd] Adding member [etcd-sowmya-final-3] to etcd cluster
{"level":"warn","ts":"2023-11-01T05:36:44.401009Z","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00f5c1340/172.31.37.245:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
{"level":"warn","ts":"2023-11-01T05:36:45.229915Z","logger":"etcd-client","caller":"v3@v3.5.9/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc00f5c1880/172.31.38.13:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}

On 2.7.5

Usecases that passed (these failed on 2.8-head, All 3 clusters, same k8s version - 1.26.8)

  • 1 etcd, 1 cp and 1 worker node --> After cluster is active, Scale up etcd to 3.
  • 1 node all roles --> After cluster is active, scale up to 3 nodes all roles
  • 3 nodes all roles --> After cluster is active, scale up to 5 nodes all roles

But the 2nd scenario ^, there were transient error messages, but eventually cluster came up Active.

  • Errors/Warning seen
{"level":"warn","ts":"2023-11-01T04:49:59.943Z","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001173c00/172.31.40.67:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}

Failed to reconcile etcd plane: Failed to add etcd member [etcd-sowmya-279-two-3] to etcd cluster

  • And
[Failed to create [rke-log-cleaner] container on host [35.90.235.132]: Failed to create Docker container [rke-log-cleaner] on host [35.90.235.132]: Error response from daemon: Conflict. The container name "/rke-log-cleaner" is already in use by container "b7d8e4cc608c0014612dd3ba24fdb9b62a1cc97ae24d5fb4c5f7e22029beb879". You have to remove (or rename) that container to be able to reuse that name.]

on v2.7.9

Usecases that passed (All 3 clusters, same k8s version - 1.26.8)

  • 1 etcd, 1 cp and 1 worker node --> After cluster is active, Scale up etcd to 3.
  • 3 nodes all roles --> After cluster is active, scale up to 5 nodes all roles

But there were transient errors, but the cluster eventually came up Active (in less than 10 minutes)

  • Errors for these ^ scenarios
  • Failed to reconcile etcd plane: Failed to add etcd member [etcd-sowmya-test-issue-2] to etcd cluster
  • And
  • [Error response from daemon: removal of container rke-log-cleaner is already in progress]

However the 2nd scenario from 2.7.5/2.8-head, failed on 2.7.9 also.

  • 1 node all roles --> After cluster is active, scale up to 3 nodes all roles --> Failed
  • Rancher logs:
2023-11-01T05:30:46.601158191Z 2023/11/01 05:30:46 [INFO] cluster [c-59wbb] provisioning: [add/etcd] Adding member [etcd-sowmya-test-issue-2] to etcd cluster
2023-11-01T05:30:46.627550715Z {"level":"warn","ts":"2023-11-01T05:30:46.627Z","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc001565340/172.31.3.60:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
{"level":"warn","ts":"2023-11-01T05:30:46.640Z","logger":"etcd-client","caller":"v3@v3.5.4/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc004eab340/172.31.0.18:2379","attempt":0,"error":"rpc error: code = Unavailable desc = etcdserver: unhealthy cluster"}
2023-11-01T05:30:46.641671534Z 2023/11/01 05:30:46 [INFO] kontainerdriver rancherkubernetesengine stopped

@snasovich
Copy link
Collaborator

There were a lot of conversations on this issue internally and TLDR is that this issue was reproduced on multiple RKE1 versions on Rancher versions going back to at least 2.7.5 (and likely applicable to versions much earlier).

As such, it's no longer considered a release blocker for 2.8.0 since we're past code freeze for that release. The focus of the engineering team is to come up with recommendations for both preventative and reactive workarounds for this issue and release note them.

FYI @Jono-SUSE-Rancher
Heads-up @rancher/docs for adding this to release notes.

@kinarashah
Copy link
Member

kinarashah commented Nov 3, 2023

Issue
Scaling up etcd nodes fails and cluster hangs up in waiting state. Nodes are stuck waiting to register with Kubernetes.

Affected Versions
The issue is not always reproducible.
RKE - v1.3.3
Rancher - v2.6.3

Rootcase
RKE checks the new etcd node's membership by looking at peerURLs of other etcd nodes, possibly causing failure if it selects another etcd node that has not been added yet. This issue is specific to >= Kubernetes 1.22 which uses etcd v3 client, v2 client instantly errors out vs v3 client continuously keeps retrying.

v2:

Failed to Add etcd member [xxx-xx-xx-xx] from host: client: etcd cluster is unavailable or misconfigured; error #0: dial tcp xxx.xx.xx.xx:2379: connect: connection refused

v3:

[INFO] cluster [c-wkp4m] provisioning: [reconcile] Check etcd hosts to be added
[core] [Channel #25 SubChannel #26] grpc: addrConn.createTransport failed to connect to {Addr: "xxx.xx.xx.xx:2379", ServerName: "xxx.xx.xx.xx", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp xxx.xx.xx.xx:2379: connect: connection refused"
[core] [Channel #25 SubChannel #26] grpc: addrConn.createTransport failed to connect to {Addr: "xxx.xx.xx.xx:2379", ServerName: "xxx.xx.xx.xx", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp xxx.xx.xx.xx:2379: connect: connection refused"
[core] [Channel #25 SubChannel #26] grpc: addrConn.createTransport failed to connect to {Addr: "xxx.xx.xx.xx:2379", ServerName: "xxx.xx.xx.xx", }. Err: connection error: desc = "transport: authentication handshake failed: dial tcp xxx.xx.xx.xx:2379: connect: connection refused"

Workaround
Active cluster:

  • Add one etcd node, wait for the cluster to be active and then repeat the process

Cluster stuck in waiting state:

  • Delete the stuck etcd nodes
  • Restart rancher leader pod, this step is required to terminate the grpc goroutine. Find the leader pod by looking for leaderIdentity in cattle-controllers configmap: kubectl -n kube-system get configmap cattle-controller
  • Wait for the stuck etcd nodes to be removed
  • Add one etcd node, wait for the cluster to be active and then repeat the process

Note: etcd restore doesn't work as a workaround, restarting rancher is required to terminate the hung request and then adding nodes one by one works.

@kinarashah
Copy link
Member

Have a draft PR up, but need to sync changes from v1.5 to v1.6 branch first before opening the final PR for v1.6 branch since this issue is for 2.9-Next. rancher/rke#3536

@jiaqiluo jiaqiluo removed their assignment Mar 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/provisioning-rke1 Provisioning issues with RKE1 kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes status/release-note-added team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

8 participants