Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEv2 clusters imported using the Rancher client have their config improperly rewritten #36128

Closed
thedadams opened this issue Jan 13, 2022 · 7 comments
Assignees
Labels
area/aks area/cli area/eks area/gke area/terraform feature/kev2 internal QA/S team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@thedadams
Copy link
Contributor

thedadams commented Jan 13, 2022

Rancher Server Setup

  • Rancher version: v2.6-head
  • Installation option (Docker install/Helm Chart): any
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): any
  • Proxy/Cert Details: any

Information about the Cluster

  • Kubernetes version:
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): imported KEv2

Describe the bug
If a KEv2 cluster is imported using the Rancher client (like with terraform), then the config will be improperly applied. For example, importing an EKS cluster with terraform while leaving out any mention of node groups (which is proper) will end up deleting all the node groups from the EKS cluster when it is imported into Rancher.

To Reproduce
Use the terraform provider to import an EKS cluster. Do not specify any node groups in the terraform config.

Result
All the node groups of the EKS cluster will be deleted after the EKS cluster is imported.

Expected Result
The cluster should be imported and not updated at all in EKS.

Additional context
The underlying issue here is that the slice types in the KEv2 structs have the struct tag norman:pointer. This is improper because a slice type is already a pointer.

To fix this:

  1. All slices types in the three KEv2 operators need to have this struct tag removed.
  2. The dependencies for these three operators should be bumped in Rancher.
  3. Run go generate in Rancher.

After Rancher is updated, the terraform provider and the CLI also should be updated.

rancher/terraform-provider-rancher2#800

SURE-3842
SURE-3833

@a-blender
Copy link
Contributor

a-blender commented Feb 11, 2022

I fixed this issue with direction from @thedadams.

If a node_group is not defined in Terraform, Rancher received an empty NodeGroups array in the ClusterConfigSpec and reconciled that state by removing node groups in the KEv2 provider.

I opened three PRs to fix each KEv2 operator and three more PRs to fix code in the terraform provider and bump the operator and crd versions in rancher/rancher and rancher/charts.

Fixes

Integration PRs

@a-blender
Copy link
Contributor

a-blender commented Feb 11, 2022

Testing template

Root cause

Unnecessary norman:pointer struct tags on the NodeGroups field in the ClusterConfigSpec for each KEv2 provider.

What was fixed, or what changes have occurred

  • norman:pointer struct tags removed from each KEv2 operator
  • Updated code in terraform provider
  • KEv2 operator versions bumped in rancher/rancher and rancher/charts

Areas or cases that should be tested

Test importing an EKS and AKS cluster into Rancher using terraform (GKE didn't actually have the bug but did receive code updates)

  1. Create EKS cluster in AWS
  2. Create terraform config, don't specify node_group
  3. Import cluster into Rancher
  4. Check in EKS if your node groups have been deleted or not. They should still be active!
  5. Repeat for AKS

EKS terraform config

terraform {
  required_providers {
    rancher2 = {
      source = "rancher/rancher2"
      version = "1.0.0"
    }
  }
}

provider "rancher2" {
  api_url    = "<Rancher URL>"
  token_key = "<Rancher API bearer token>"
  insecure = true
}

resource "rancher2_cluster" "bar" {
  name = "ablender-eks-test"
  description = "Terraform EKS cluster"
  eks_config_v2 {
    cloud_credential_id = "cattle-global-data:<cloud credential token>"
    region = "<region>"
    imported = true
  }
}

AKS terraform config

terraform {
  required_providers {
    rancher2 = {
      source = "rancher/rancher2"
      version = "1.0.0"
    }
  }
}

provider "rancher2" {
  api_url    = "<Rancher URL>"
  token_key = "<Rancher API bearer token>"
  insecure = true
}

resource "rancher2_cloud_credential" "aks-cluster" {
  name = "aks-cluster"
  azure_credential_config {
    client_id = "<Client ID"
    client_secret = "<Client secret value>"
    subscription_id = "<Subscription ID"
  }
}

What areas could experience regressions ?

KEv2 provider import and cluster provisioning.

Are the repro steps accurate/minimal ?

Yes.

@sowmyav27
Copy link
Contributor

Test cases to validate:

  • Import EKS v2 into Rancher using TF
  • Import EKS v2 in Rancher using Rancher UI
- Create an EKS cluster - `cluster-1` in AWS console (Latest k8s version). Add Node groups in this cluster in AWS console.
- Import this cluster in Rancher by navigating to Cluster Management --> Import --> EKS
- Add in the credentials, select the right region where EKS is created in AWS console.
- Select the cluster from the dropdown - `cluster-1`
- Click on Create/Save
- Verify the cluster comes up Active. And nodegroups/nodes are available on the cluster
- Edit cluster
- Add another nodegroup
- Save changes made
- Wait for the cluster to come back up Active.
- Verify in Rancher - all nodegroups including existing ones and the new one added is available.
- Verify in AWS, all nodegoups are available.
  • Provision an EKS v2 cluster from Rancher UI

@timhaneunsoo
Copy link

timhaneunsoo commented Feb 23, 2022

Test Environment:

Rancher version: v2.6-head
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: EKS v2 cluster


Testing:

Tested this issue with the following scenarios:

  • Import EKS v2 into Rancher using TF
  • Import EKS v2 in Rancher using Rancher UI
  • Provision an EKS v2 cluster from Rancher UI

Result

  • Import EKS v2 into Rancher using TF - Low Pass
    The cluster gets added into Rancher but is stuck on Waiting
    Screen Shot 2022-02-22 at 7.13.54 PM.png

  • Import EKS v2 in Rancher using Rancher UI - Low Pass
    The cluster gets added into Rancher but is stuck on Waiting
    Screen Shot 2022-02-22 at 7.14.28 PM.png

  • Provision an EKS v2 cluster from Rancher UI - Pass

@a-blender
Copy link
Contributor

a-blender commented Mar 1, 2022

@timhaneunsoo I was unable to reproduce the specific error that you showed above. I did, however, test importing an EKS cluster on v2.6-head run on a DO node, and ran into an intermittent issue that I've seen before where the UI hangs with Waiting for API to be available.

This is my terraform config

terraform {
  required_providers {
    rancher2 = {
      source = "terraform.example.com/local/rancher2"
      version = "1.0.0"
    }
  }
}

provider "rancher2" {
  api_url    = "<DO node ip>"
  token_key = "<Rancher API bearer token>"
  insecure = true
}

resource "rancher2_cluster" "bar" {
  name = "ablender-eks-test"
  description = "Terraform EKS cluster"
  eks_config_v2 {
    cloud_credential_id = "cattle-global-data:<Rancher AWS cloud credential>"
    region = "us-east-2"
    imported = true
  }
}

Here's the result of trying to import EKS ablender-eks-test

image

The mgmt cluster shows NodeGroups: null in the eksConfig which is correct.

image

This means node groups in EKS will not get deleted on import. The node group for my cluster is also still active. @kinarashah and I believe this issue is fixed. I will open a new GitHub issue for the error I saw on 2.6-head.

This issue can be retested.

@a-blender
Copy link
Contributor

a-blender commented Mar 1, 2022

Here's the new GitHub issue #36700 that blocked my EKS import when I replicated the QA tests. To verify the fix for this issue, check the mgmt cluster yaml in the Rancher UI and verify that NodeGroups: null. If this is set and your node group still exists in AWS, the node deletion bug is fixed.

@timhaneunsoo
Copy link

Test Environment:

Rancher version: v2.6-head
Rancher cluster type: HA
Docker version: 20.10

Downstream cluster type: EKS v2 cluster


Testing:

Retested this issue with the following scenarios:

  • Import EKS v2 into Rancher using TF
  • Import EKS v2 in Rancher using Rancher UI
  • Provision an EKS v2 cluster from Rancher UI (Pass from previous testing)

Result

  • Import EKS v2 into Rancher using TF - Pass
    Node groups is set to null and node group is still preserved once importing into Rancher
    Screen Shot 2022-03-03 at 1.52.11 PM.png

  • Import EKS v2 in Rancher using Rancher UI - Pass
    Node groups is set to null and node group is still preserved once importing into Rancher
    Screen Shot 2022-03-02 at 6.49.30 PM.png

  • Provision an EKS v2 cluster from Rancher UI - Pass (from previous testing)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/aks area/cli area/eks area/gke area/terraform feature/kev2 internal QA/S team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

5 participants