Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to provision RKE2 clusters using Amazon cloud provider with k8s 1.22 #35618

Closed
davidnuzik opened this issue Nov 19, 2021 · 13 comments
Closed
Assignees
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR dependency-rke2 Indicates that the rancher issue has a dependency to an RKE2 issue kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@davidnuzik
Copy link
Contributor

Summary:

I'm unable to provision any ec2 RKE2 clusters with Amazon cloud provider option.


Environment:
Rancher version: v2.6-head b736007 (also seen on 9740d7a build from earlier today) 11/19/2021
Rancher cluster type: single-node docker install and HA helm install both attempted

Downstream cluster type: RKE2 w/ ec2 node driver and amazon cloud provider
Downstream K8s version: v1.21.6+rke2r1

Steps to Reproduce:

  1. Provision an RKE2 cluster with ec2 node driver. Set up amazon cloud provider.
  2. The cluster is stuck in provisioning state forever. It is stuck on step waiting on probes: calico
  3. Try provisioning an ec2 rke2 cluster but without cloud provider. This works.

Expected Result:
I expected to be able provision an RKE2 cluster using the amazon cloud provider.

Actual Result:
I cannot provision an RKE2 cluster using the amazon cloud provider. This was working recently, just yesterday.

Additional Info:
I attempted to do my due diligence. I didn't spot anything in rancher server logs of use. When I downloaded the SSH key to remove in to the node and investigate the SSH key does not work (and I don't think I made any mistakes using the rancher provided key).

@davidnuzik davidnuzik added the kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement label Nov 19, 2021
@davidnuzik davidnuzik added this to the v2.6.3 milestone Nov 19, 2021
@davidnuzik davidnuzik self-assigned this Nov 19, 2021
@sowmyav27 sowmyav27 added the area/capr/rke2 RKE2 Provisioning issues involving CAPR label Nov 19, 2021
@zube zube bot removed the [zube]: Next Up label Nov 20, 2021
@zube zube bot removed the [zube]: Working label Nov 22, 2021
@thedadams thedadams added the dependency-rke2 Indicates that the rancher issue has a dependency to an RKE2 issue label Nov 23, 2021
@snasovich snasovich added the release-note Note this issue in the milestone's release notes label Nov 23, 2021
@thedadams
Copy link
Contributor

RKE2 PR: rancher/rke2#2163

@thedadams
Copy link
Contributor

It was decided that the best way to fix this issue was through RKE2 (the PR is linked).

There is a work-around for this issue: add the environment variable CATTLE_MACHINE_PROVISION_IMAGE to rancher/machine:v0.15.0-rancher72. This will not set the hostname on the EC2 nodes and this will allow provisioning to continue. This can be used until the RKE2 fix is available.

NOTE: this is only for RKE2 clusters provisioned from Rancher with the AWS cloud provider enabled. All other provisioning is unaffected.

@snasovich
Copy link
Collaborator

@thedadams , thank you. Depending on which RKE2 version the fix is going into and whether it's included in 2.6.3 release's KDM, we may want to add this workaround to release notes hence release-note label.

@rancher-max
Copy link
Contributor

Gave this a quick check using KDM branch dev-v2.6 on rancher v2.6.3-rc3. I was able to provision a v1.21.7-rc2+rke2r1 cluster using aws cloud provider.

After ssh'ing to a node, I see the following config:

# $ sudo cat /etc/rancher/rke2/config.yaml.d/50-rancher.yaml
{
  "advertise-address": "<redacted>",
  "agent-token": "<redacted>",
  "cloud-provider-name": "aws",
  "cni": "calico",
  "disable-kube-proxy": false,
  "etcd-expose-metrics": false,
  "etcd-snapshot-retention": 5,
  "etcd-snapshot-schedule-cron": "0 */5 * * *",
  "node-ip": [
    "<redacted>"
  ],
  "node-label": [
    "rke.cattle.io/machine=7e6f0cba-7972-437f-b6d1-66a6fb1f94cb"
  ],
  "private-registry": "/etc/rancher/rke2/registries.yaml",
  "protect-kernel-defaults": false,
  "tls-san": [
    "<redacted>"
  ],
  "token": "<redacted>"
}

The cluster is healthy in the rancher UI and running successfully. I'll leave this open so that we can validate with v1.22 when that is functional as well.

@davidnuzik
Copy link
Contributor Author

(Waiting for #35683 so we can test 1.22.x)

@sowmyav27
Copy link
Contributor

Moving it out of 2.6.3 since 1.22 is experimental in this release. And 1.21RKE2 version works.

@sowmyav27 sowmyav27 removed this from the v2.6.3 milestone Dec 10, 2021
@zube zube bot removed the [zube]: QA Working label Feb 2, 2022
@davidnuzik
Copy link
Contributor Author

Thanks @thedadams I just updated my prior comment too, but sounds like you already know the issue. Let me know if there is anything I can do; appreciate it.

@thedadams
Copy link
Contributor

This issue is waiting for an RKE2 release. It should be noted that it was me who is doing the work (and did the work that didn't fix it the first time).

@snasovich
Copy link
Collaborator

Given the timing on issues linked above the expectation is for these fixes to be available in Feb k8s patches releases of RKE2. No additional changes are believed to be needed on Rancher side.

@Auston-Ivison-Suse
Copy link

Auston-Ivison-Suse commented Mar 8, 2022

Validation Setup

  • Rancher version: v2.6-head(3df76e4)
  • Downstream clusters: rke2, k8s version: v1.21.10+rke2r1 & v1.22.7+rke2r1

*Validation Steps
Step 1. Provision an RKE2 cluster with ec2 node driver. Set up amazon cloud provider.

I used the following Permissions for the IAM profile:

  • AmazonEC2FullAccess
  • IAMFullAccess
  • AutoScalingFullAccess
  • ElasticLoadBalancingFullAccess
  • AmazonKeyspacesReadOnlyAccess

Step 2. Select k8s version.

Step 3. Provision cluster.

Result

The cluster provisions successfully with aws cloud provider selected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR dependency-rke2 Indicates that the rancher issue has a dependency to an RKE2 issue kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
Development

No branches or pull requests

7 participants