Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot provision RKE2 node driver cluster #36939

Closed
sowmyav27 opened this issue Mar 17, 2022 · 9 comments
Closed

Cannot provision RKE2 node driver cluster #36939

sowmyav27 opened this issue Mar 17, 2022 · 9 comments
Assignees
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS regression release-note Note this issue in the milestone's release notes status/blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@sowmyav27
Copy link
Contributor

sowmyav27 commented Mar 17, 2022

Rancher Server Setup

  • Rancher version: 2.6-head commit id: 5bb08b3
  • Installation option (Docker install/Helm Chart): docker install

Information about the Cluster

  • Kubernetes version: v1.22.7+rke2r2 and v1.23.4+rke2r2
  • Cluster Type (Local/Downstream): Node driver EC2 cluster

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) admin

Describe the bug
Cannot provision RKE2 node driver cluster

To Reproduce

  • Deploy a node driver RKE2 cluster
  • the control plane and worker nodes fail to come up in the cluster.
  • Cluster is stuck in provisioning state

Screen Shot 2022-03-17 at 12 03 21 PM

  • journalctl -u rancher-system-agent - CP node
Mar 17 06:20:54 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:54Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-schedule>
Mar 17 06:20:54 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:54Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-man>
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error while appending ca cert to pool for probe kube-controller-manager"
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error loading CA cert for probe (kube-scheduler) /var/lib/rancher/rke2/server/tls/kube-scheduler/kube-schedule>
Mar 17 06:20:59 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:20:59Z" level=error msg="error while appending ca cert to pool for probe kube-scheduler"
Mar 17 06:21:04 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rancher-system-agent[1376]: time="2022-03-17T06:21:04Z" level=error msg="error loading CA cert for probe (kube-controller-manager) /var/lib/rancher/rke2/server/tls/kube-controller-man>

  • journalctl -u rke2-server
Mar 17 06:21:39 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:39Z" level=info msg="Waiting for API server to become available"
Mar 17 06:21:39 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:39Z" level=info msg="Waiting for API server to become available"
Mar 17 06:21:42 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:42Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"
Mar 17 06:21:42 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:42Z" level=error msg="Failed to connect to proxy" error="unexpected EOF"
Mar 17 06:21:42 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:42Z" level=error msg="Remotedialer proxy error" error="unexpected EOF"
Mar 17 06:21:47 sow-rke2-upgrade-pool2-ebe3d5a6-klf27 rke2[1597]: time="2022-03-17T06:21:47Z" level=info msg="Connecting to proxy" url="wss://127.0.0.1:9345/v1-rke2/connect"
  • On the worker node
 journalctl -u rancher-system-agent
-- Logs begin at Thu 2022-03-17 06:16:39 UTC, end at Thu 2022-03-17 06:42:56 UTC. --
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn systemd[1]: Started Rancher System Agent.
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:58Z" level=info msg="Rancher System Agent version v0.2.3 (00181cd) is starting"
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:58Z" level=info msg="Using directory /var/lib/rancher/agent/work for work"
Mar 17 06:16:58 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:58Z" level=info msg="Starting remote watch of plans"
Mar 17 06:16:59 sow-rke2-upgrade-pool3-b1caa322-5v7xn rancher-system-agent[1383]: time="2022-03-17T06:16:59Z" level=info msg="Starting /v1, Kind=Secret controller"
root@sow-rke2-upgrade-pool3-b1caa322-5v7xn:/home/ubuntu# 
root@sow-rke2-upgrade-pool3-b1caa322-5v7xn:/home/ubuntu# 
root@sow-rke2-upgrade-pool3-b1caa322-5v7xn:/home/ubuntu# journalctl -u rke2-server
-- Logs begin at Thu 2022-03-17 06:16:39 UTC, end at Thu 2022-03-17 06:42:56 UTC. --
-- No entries --
  • in Another cluster (multus/canal) as network config, all the nodes are stuck and have not come up Active

Screen Shot 2022-03-17 at 12 07 03 PM

@sowmyav27 sowmyav27 self-assigned this Mar 17, 2022
@sowmyav27 sowmyav27 added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support regression status/blocker labels Mar 17, 2022
@sowmyav27 sowmyav27 added this to the v2.6.4 milestone Mar 17, 2022
@thedadams
Copy link
Contributor

I was unable to reproduce the issue where the cluster provsioning would stop with CNI calico. However, I was able to consistently reproduce the problem when multus,calico or multus,canal was selected. That is problem that is being fixed in RKE2: rancher/rke2#2646

@snasovich
Copy link
Collaborator

Since this requires a new RKE2 release and we're not incorporating March 2022 releases of RKE2 into 2.6.4 due to timing. This will have to be release noted as a known issue in 2.6.4.

@snasovich snasovich added the release-note Note this issue in the milestone's release notes label Mar 18, 2022
@sowmyav27
Copy link
Contributor Author

sowmyav27 commented Mar 20, 2022

@thedadams @snasovich I hit the original issue on an RKE2 cluster on 1.23 k8s version - 3 etcd, 2 cp and 3 worker nodes on 2.6-head commit id: 024e6fa. The first RKE2 cluster in the setup does not come up Active, but the second one deployed comes up Active

@thedadams
Copy link
Contributor

@sowmyav27 I was able to successfully provision a cluster with 3 etcd, 2 cp and 3 worker nodes with v1.22.7+rke2r2 on 9146220.

I double-checked and was not able to provision a cluster with the same types of nodes with v1.23.4+rke2r2.

@sowmyav27
Copy link
Contributor Author

On 2.6-head commit id: 0b44573

Cluster config - 3 etcd, 2 cp, 3 worker nodes

  • Cluster with default values and k8s v1.22.7+rke2r2 comes up Active
  • Cluster with default value and k8s v1.23.4+rke2r2 does NOT come up Active.

@snasovich
Copy link
Collaborator

Since this issue only affects 1.23 that will be marked as "experimental" in 1.23, this is fine to just release note this and have the fix available in 2.6.5 release.
From the conversations above, the impact of the issue is limited to 1.23 when there are multiple etcd-only nodes.
Leaving this in 2.6.4 milestone for inclusion in release note, will be moved to 2.6.5 then.

@Sahota1225 Sahota1225 modified the milestones: v2.6.4, v2.6.5 Mar 24, 2022
@snasovich snasovich added the area/capr/rke2 RKE2 Provisioning issues involving CAPR label Apr 1, 2022
@thedadams
Copy link
Contributor

The relevant RKE2 versions have been added to KDM. This is ready to test.

@timhaneunsoo
Copy link

Test Environment:

Rancher version: v2.6-head 851ff2f
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 EC2 Node driver


Testing:

Tested this issue with the following steps:

  1. Deploy a node driver RKE2 cluster
  2. Check the cluster status

Result - Fail

1.22 and 1.23 Clusters for both standard and admin user are showing same results as the issue described in this ticket.

@timhaneunsoo
Copy link

Test Environment:

Rancher version: v2.6-head b8cbc1a
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: RKE2 EC2 Node driver


Testing:

Tested this issue with the following steps:

  1. Deploy a node driver RKE2 cluster
  2. Check the cluster status

Result - Pass

1.22 and 1.23 Clusters for both standard and admin users are now provisioning successfully and coming up as active.

@zube zube bot closed this as completed Apr 18, 2022
@zube zube bot changed the title Cannot provision RKE2 node driver cluster Cannot provision RKE2 node driver cluster Apr 18, 2022
@zube zube bot removed the [zube]: Done label Jul 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement QA/XS regression release-note Note this issue in the milestone's release notes status/blocker team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

5 participants