Cannot provision RKE2 node driver cluster #36939

sowmyav27 · 2022-03-17T06:37:22Z

Rancher Server Setup

Rancher version: 2.6-head commit id: 5bb08b3
Installation option (Docker install/Helm Chart): docker install

Information about the Cluster

Kubernetes version: v1.22.7+rke2r2 and v1.23.4+rke2r2
Cluster Type (Local/Downstream): Node driver EC2 cluster

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) admin

Describe the bug
Cannot provision RKE2 node driver cluster

To Reproduce

Deploy a node driver RKE2 cluster
the control plane and worker nodes fail to come up in the cluster.
Cluster is stuck in provisioning state

journalctl -u rancher-system-agent - CP node

Mar 17 06:20:54 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:54Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-schedule>
Mar 17 06:20:54 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:54Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-man>
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-schedule>
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Mar 17 06:21:04 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:21:04Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-man>

journalctl -u rke2-server

Mar 17 06:21:39 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:39Z" level=info msg="Waiting for API server to become available"
Mar 17 06:21:39 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:39Z" level=info msg="Waiting for API server to become available"
Mar 17 06:21:42 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:42Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"
Mar 17 06:21:42 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:42Z" level=error msg="Failed to connect to proxy" error="unexpected EOF"
Mar 17 06:21:42 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:42Z" level=error msg="Remotedialer proxy error" error="unexpected EOF"
Mar 17 06:21:47 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:47Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"

On the worker node

 journalctl -u rancher-system-agent
-- Logs begin at Thu 2022-03-17 06:16:39 UTC, end at Thu 2022-03-17 06:42:56 UTC. --
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn systemd[1]: Started Rancher System Agent.
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:58Z" level=info msg="Rancher System Agent version v0.2.3 (00181cd) is starting"
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:58Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:58Z" level=info msg="Starting remote watch of plans"
Mar 17 06:16:59 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:59Z" level=info msg="Starting /v1, Kind=Secret controller"
root@sow-rke2-upgrade-pool3-b1caa322-5v7xn:/home/ubuntu# 
root@sow-rke2-upgrade-pool3-b1caa322-5v7xn:/home/ubuntu# 
root@sow-rke2-upgrade-pool3-b1caa322-5v7xn:/home/ubuntu# journalctl -u rke2-server
-- Logs begin at Thu 2022-03-17 06:16:39 UTC, end at Thu 2022-03-17 06:42:56 UTC. --
-- No entries --

in Another cluster (multus/canal) as network config, all the nodes are stuck and have not come up Active

The text was updated successfully, but these errors were encountered:

thedadams · 2022-03-18T20:00:36Z

I was unable to reproduce the issue where the cluster provsioning would stop with CNI calico. However, I was able to consistently reproduce the problem when multus,calico or multus,canal was selected. That is problem that is being fixed in RKE2: rancher/rke2#2646

snasovich · 2022-03-18T21:53:00Z

Since this requires a new RKE2 release and we're not incorporating March 2022 releases of RKE2 into 2.6.4 due to timing. This will have to be release noted as a known issue in 2.6.4.

sowmyav27 · 2022-03-20T17:20:37Z

@thedadams @snasovich I hit the original issue on an RKE2 cluster on 1.23 k8s version - 3 etcd, 2 cp and 3 worker nodes on 2.6-head commit id: 024e6fa. The first RKE2 cluster in the setup does not come up Active, but the second one deployed comes up Active

thedadams · 2022-03-22T01:35:33Z

@sowmyav27 I was able to successfully provision a cluster with 3 etcd, 2 cp and 3 worker nodes with v1.22.7+rke2r2 on 9146220.

I double-checked and was not able to provision a cluster with the same types of nodes with v1.23.4+rke2r2.

sowmyav27 · 2022-03-22T07:22:53Z

On 2.6-head commit id: 0b44573

Cluster config - 3 etcd, 2 cp, 3 worker nodes

Cluster with default values and k8s v1.22.7+rke2r2 comes up Active
Cluster with default value and k8s v1.23.4+rke2r2 does NOT come up Active.

snasovich · 2022-03-22T19:26:12Z

Since this issue only affects 1.23 that will be marked as "experimental" in 1.23, this is fine to just release note this and have the fix available in 2.6.5 release.
From the conversations above, the impact of the issue is limited to 1.23 when there are multiple etcd-only nodes.
Leaving this in 2.6.4 milestone for inclusion in release note, will be moved to 2.6.5 then.

thedadams · 2022-04-06T23:51:49Z

The relevant RKE2 versions have been added to KDM. This is ready to test.

timhaneunsoo · 2022-04-12T16:39:06Z

Test Environment:

Rancher version: v2.6-head 851ff2f
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 EC2 Node driver

Testing:

Tested this issue with the following steps:

Deploy a node driver RKE2 cluster
Check the cluster status

Result - Fail

1.22 and 1.23 Clusters for both standard and admin user are showing same results as the issue described in this ticket.

timhaneunsoo · 2022-04-18T15:40:55Z

Test Environment:

Rancher version: v2.6-head b8cbc1a
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 EC2 Node driver

Testing:

Tested this issue with the following steps:

Deploy a node driver RKE2 cluster
Check the cluster status

Result - Pass

1.22 and 1.23 Clusters for both standard and admin users are now provisioning successfully and coming up as active.

sowmyav27 self-assigned this Mar 17, 2022

sowmyav27 added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support regression status/blocker labels Mar 17, 2022

sowmyav27 added this to the v2.6.4 milestone Mar 17, 2022

sowmyav27 mentioned this issue Mar 17, 2022

Cluster goes into Provisioning state multiple times after it has finished provisioning #36772

Closed

Sahota1225 assigned thedadams Mar 17, 2022

Sahota1225 added the [zube]: Working label Mar 17, 2022

sowmyav27 mentioned this issue Mar 17, 2022

Rancher is not able to upgrade with RKE2 clusters #36924

Closed

thedadams added the [zube]: Waiting for RKE2 Release label Mar 18, 2022

zube bot removed the [zube]: Working label Mar 18, 2022

snasovich added the release-note Note this issue in the milestone's release notes label Mar 18, 2022

Sahota1225 modified the milestones: v2.6.4, v2.6.5 Mar 24, 2022

snasovich added the area/capr/rke2 RKE2 Provisioning issues involving CAPR label Apr 1, 2022

thedadams added [zube]: To Test and removed [zube]: Waiting for RKE2 Release labels Apr 6, 2022

zube bot assigned timhaneunsoo Apr 7, 2022

zube bot added QA/XS [zube]: QA Working and removed [zube]: To Test labels Apr 7, 2022

zube bot added [zube]: Reopened and removed [zube]: QA Working labels Apr 12, 2022

timhaneunsoo mentioned this issue Apr 12, 2022

[rke2] unable to create an rke2 cluster with more than 1 node #37272

Closed

thedadams mentioned this issue Apr 12, 2022

Add machine PlanApplied condition to provide feedback on failed plans and change logic to allow unavailable nodes to be updated immediately #36824

Merged

thedadams added the [zube]: Review label Apr 12, 2022

zube bot removed the [zube]: Reopened label Apr 12, 2022

thedadams added the [zube]: To Test label Apr 15, 2022

zube bot added [zube]: QA Working and removed [zube]: Review labels Apr 15, 2022

Josh-Diamond mentioned this issue Apr 15, 2022

RKE2 not pulling all images from private registry #36137

Closed

zube bot closed this as completed Apr 18, 2022

zube bot added [zube]: Done and removed [zube]: QA Working labels Apr 18, 2022

timhaneunsoo mentioned this issue Apr 18, 2022

[rke2] unable to deploy production-ready setup with RKE2 linux node driver (3etcd, 2cp, 3worker) #37252

Closed

zube bot changed the title ~~Cannot provision RKE2 node driver cluster~~ Cannot provision RKE2 node driver cluster Apr 18, 2022

zube bot removed the [zube]: Done label Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot provision RKE2 node driver cluster #36939

Cannot provision RKE2 node driver cluster #36939

sowmyav27 commented Mar 17, 2022 •

edited

Loading

thedadams commented Mar 18, 2022

snasovich commented Mar 18, 2022

sowmyav27 commented Mar 20, 2022 •

edited

Loading

thedadams commented Mar 22, 2022

sowmyav27 commented Mar 22, 2022

snasovich commented Mar 22, 2022

thedadams commented Apr 6, 2022

timhaneunsoo commented Apr 12, 2022

timhaneunsoo commented Apr 18, 2022

Cannot provision RKE2 node driver cluster #36939

Cannot provision RKE2 node driver cluster #36939

Comments

sowmyav27 commented Mar 17, 2022 • edited Loading

thedadams commented Mar 18, 2022

snasovich commented Mar 18, 2022

sowmyav27 commented Mar 20, 2022 • edited Loading

thedadams commented Mar 22, 2022

sowmyav27 commented Mar 22, 2022

snasovich commented Mar 22, 2022

thedadams commented Apr 6, 2022

timhaneunsoo commented Apr 12, 2022

Test Environment:

Testing:

timhaneunsoo commented Apr 18, 2022

Test Environment:

Testing:

sowmyav27 commented Mar 17, 2022 •

edited

Loading

sowmyav27 commented Mar 20, 2022 •

edited

Loading