[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300

rancherbot · 2023-01-25T18:08:42Z

This is a backport issue for #40080, automatically created via rancherbot by @eliyamlevy

Original issue description:

Rancher Server Setup

Rancher version: v2.6.9 && v2.7.0
Installation option (Docker install/Helm Chart): Helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2
Proxy/Cert Details: byo-valid

Information about the Cluster

Kubernetes version: v1.24.8+rke2r1
Cluster Type (Local/Downstream): Downstream
- If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): AWS

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom): Admin

Describe the bug
When migrating rancher servers from one HA to another HA using Rancher Backups and Restore, the RKE2 downstream cluster is not coming back up. The status is stuck at Updating with the following message: Configuring bootstrap node(s) <redacted>: waiting for plan to be applied

The RKE2 version I used for both Rancher versions is v1.24.8+rke2r1 and after changing the version to the default RKE2 version for each Rancher version (v1.24.4+rke2r1 && v1.24.6+rke2r1 respectively) and redeploying all workloads, the RKE2 cluster for the v2.6.9 instance came up and was Active.

The v2.7.0 RKE2 cluster is still showing the same status and message.

To Reproduce

Deploy a Rancher HA instance on v2.6.9 && v2.7.0
Create an AWS RKE1 downstream cluster (3 workers, 1 control plane, 1 etcd) using v1.24.8 as the RKE version
Create an AWS RKE2 downstream cluster (3 workers, 2 control plane, 3 etcd) using v1.24.8+rke2r1 as the RKE version
Wait for the downstream clusters to be Active
Install the Rancher Backups chart on your local cluster
Create a backup in your preferred storage (I use an AWS S3 bucket)
Bring up a new HA and point the load balancer to the new HA
Use the backup to restore onto the new HA
Install Rancher with the same version as the original HA
Go to Cluster Management
Check cluster statuses

Result
The RKE2 downstream cluster did not come back to the Active status after the restore/migration

Expected Result

The RKE2 downstream cluster does come back to the Active after the restore/migration

Screenshots

Showing the Status and Message:

Machine Pool:

The text was updated successfully, but these errors were encountered:

nickwsuse · 2023-02-17T17:39:57Z

UPDATE:

The fix that was implemented didn't appear to work, and upon further investigation this seems to be a bug in the rancher-system-agent.

It seems like anything that leads to rancher-system-agent being invoked to apply a change to the nodes has a high chance of hitting this. Doesn’t matter if it’s a restore or just a minor config change.

Currently waiting on the correct engineering team to start looking into it.

snasovich · 2023-02-21T23:08:30Z

@eliyamlevy , duplicating offline conversation, please rollback rancher/backup-restore-operator#296 as we don't want to backup *machine-plan-token secrets (or any secrets of kubernetes.io/service-account-token type) as these are service account token that won't be valid if restored on a new cluster as they are tied to service account ID that will be different on new cluster.

nickwsuse · 2023-02-22T00:32:03Z

Testing update:

Rancher 2.6.9; rke2 v1.24.9+rke2r2
    rancher/system-upgrade-controller:v0.9.1
    Post-Migration Status: Updating - Configuring bootstrap node(s): waiting for plan to be applied
    
    NOTE: After about two hours, the cluster DID come up and is in an Active status with no immediately noticeable errors.

Rancher 2.6.9; rke2 v1.23.15+rke2r1
    rancher/system-upgrade-controller:v0.9.1
    Post-Migration Status: Updating - Configuring bootstrap node(s): waiting for plan to be applied

    NOTE: After about two hours, the cluster DID come up and is in an Active status with no immediately noticeable errors.

Rancher 2.6.5; rke2 (1.23) (In-progress, will update comment with results when I have them)
    rancher/system-upgrade-controller:
    Post-Migration Status:

Seems even the RKE1 clusters are taking a LOT longer to come up than they did previously after migration. Typically it would take around 10min, but now it's taking around an hour or more for the cluster to come up.

Some more screenshots:

Cluster Management Page - Over an hour after migration (no clusters have come back up)

Cluster Management Page - Over two after migration (all clusters have come back up)

Cluster Details Page - Rancher 2.6.9; rke2 v1.24.9+rke2r2 (etcd node is reconciling)

Cluster Details Page - Rancher 2.6.9; rke2 v1.23.15+rke2r1 (etcd node is reconciling)

snasovich · 2023-02-22T03:04:56Z

@nickwsuse , thank you for the update. RKE1 clusters taking longer than usual to come up certainly sounds like a separate issue though.

Oats87 · 2023-02-22T21:48:16Z

Root cause identified, see: #40080 (comment)

The actual fix is pending thought, as there are multiple ways to resolve the issue.

nickwsuse · 2023-02-24T18:01:42Z

Reopening as I still see the same issue on v2.6-head Commit ID: 240554b

Admittedly this test was a bit different than usual as I couldn't get the Jenkins job that builds HAs to work on v2.6-head, so I had to start on v2.6.9 and then upgraded the Rancher instance to v2.6-head.

Once on v2.6-head I created two clusters - one RKE1 cluster and one RKE2 cluster both via AWS node provisioning. While those were provisioning I installed Rancher Backups v2.1.5-rc2 and waited for the cluster provisioning to finish. Once finished, I took a backup and then followed the migration process that I have outlined in Confluence (here, for those curious).

Once the migration is complete, I waited for the clusters to come back up, but the RKE2 cluster started to show the same errors and messages as before the fix, ie. Configuring bootstrap node(s): waiting for plan to be applied and it is once again an etcd node that is reconciling.

Screenshots

Cluster Management - Still waiting for plan to be applied after five hours

Etcd Still Reconciling

eliyamlevy · 2023-02-27T16:13:07Z

@eliyamlevy , duplicating offline conversation, please rollback rancher/backup-restore-operator#296 as we don't want to backup *machine-plan-token secrets (or any secrets of kubernetes.io/service-account-token type) as these are service account token that won't be valid if restored on a new cluster as they are tied to service account ID that will be different on new cluster.

@snasovich reverted in most recent rcs for 2.6 and 2.7

snasovich · 2023-03-02T17:05:30Z

This should be addressed by the same fix done for #40742, that's why it's moved "To Test" at the same time.

nickwsuse · 2023-03-03T19:56:29Z

Verified on v2.6-head ID: 5e33172

The RKE1 and RKE2 clusters both came back to Active status within 10 minutes of the restore being completed.

The RKE1 cluster was waiting for the cluster agent to be connected for about a minute and then was active at about 5 minutes.

The RKE2 cluster was showing an error with the webhooks, but when doing live debugging with @Oats87 he said that's somewhat expected since the webhooks are being reinstalled or something along those lines. Once the webhooks were installed and connected, the UI showed a message that wasn't very clear as it just listed the names of the worker nodes. Looking at the provisioning logs it looks like it was configuring the nodes. Once configured, the cluster came back to the Active state at about 7 minutes.

For transparency and clarity, listed below are the steps I took to actually migrate from HA1 to HA2

Pre-Migration:

Set up an HA (HA1)
On that HA cluster, provision an RKE2 EKS cluster (3 workers, 2 control plane, 3 etcd) and one k3s EKS cluster (3 workers, 2 control plane, 3 etcd)
While those are provisioning, install the Rancher Backups chart
Once the downstream clusters are done provisioning, create a backup using the Rancher Backups chart
Migrate from HA1 -> HA2 using the backup that was just taken as the restore point for HA2

Migration Steps

Launch 3 EC2 instances
use rke up --config config.yml where config.yml consists of the settings below

// set path to ssh private key so rke can ssh into each node for provisioning
ssh_key_path: <redacted>
// kubernetes_version is not required, if null rke uses latest
kubernetes_version:
// NODES:
// For Amazon, user is ubuntu, DigitalOcean user is root
// Internal IPs are used for node-to-node traffic with lowest latency
// All-in-One or a one-box is all three roles on one node.
nodes:
   - address: <EXTERNAL_IP_1>  
      internal_address: <INTERNAL_IP_1>  
      user: ubuntu  
      role: [etcd, controlplane, worker]
   - address: <EXTERNAL_IP_2>  
      internal_address: <INTERNAL_IP_2>  
      user: ubuntu  
      role: [etcd, controlplane, worker]
   - address: <EXTERNAL_IP_3>  
      internal_address: <INTERNAL_IP_3>  
      user: ubuntu  
      role: [etcd, controlplane, worker]

run export KUBECONFIG=kube_config_<name_of_config_file>.yaml
Apply the credentials of the s3 bucket where the backup is stored
Install the Rancher Backup crd and chart with the version that was used in the original HA
Terminate the EC2 instances for HA1
Deregister the HA1 instances in the target group(s) for the NLB of HA1
Register the HA2 instances in the target group(s) for the NLB of HA1
Wait for the health checks to come back as healthy
Create a migrationResource yaml file

apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: restore-migration
spec:
  backupFilename: <backupFile>
  prune: false
  storageLocation:
    s3:
      credentialSecretName: 
      credentialSecretNamespace: 
      bucketName: 
      folder: 
      region: 
      endpoint:

Either install cert manager or set up the byo-valid tls secret
Install Rancher

rancherbot added area/backup-recover kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 regression team/area3 labels Jan 25, 2023

rancherbot added this to the 2023-Q1-v2.6x milestone Jan 25, 2023

rancherbot assigned eliyamlevy Jan 25, 2023

eliyamlevy mentioned this issue Jan 25, 2023

Updated machine-plan secret regex to capture new naming scheme rancher/backup-restore-operator#294

Closed

zube bot added the [zube]: Review label Jan 25, 2023

eliyamlevy mentioned this issue Jan 25, 2023

Updated machine-plan secret regex to capture new naming scheme rancher/backup-restore-operator#296

Merged

prachidamle mentioned this issue Jan 26, 2023

[dev-v2.6] Bumping rancher-backups to v2.1.5-rc1 rancher/charts#2358

Merged

eliyamlevy added [zube]: To Test and removed [zube]: Review labels Jan 26, 2023

ronhorton assigned nickwsuse Feb 7, 2023

ronhorton added [zube]: QA Next up and removed [zube]: To Test labels Feb 7, 2023

ronhorton added [zube]: QA Working and removed [zube]: QA Next up labels Feb 14, 2023

nickwsuse added [zube]: Reopened and removed [zube]: QA Working labels Feb 17, 2023

zube bot added the team/rke2 label Feb 21, 2023

zube bot assigned brandond Feb 21, 2023

snasovich changed the title ~~[Backport ] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration~~ [Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration Feb 21, 2023

Jono-SUSE-Rancher added the status/release-blocker label Feb 21, 2023

snasovich assigned Oats87 Feb 21, 2023

snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Feb 21, 2023

Jono-SUSE-Rancher unassigned brandond and eliyamlevy Feb 22, 2023

Sahota1225 added [zube]: Working and removed [zube]: Reopened labels Feb 22, 2023

Oats87 mentioned this issue Feb 23, 2023

[release/v2.6] Use one instance of the kubeconfig manager #40660

Merged

snasovich added the [zube]: Review label Feb 23, 2023

zube bot removed the [zube]: Working label Feb 23, 2023

Sahota1225 added [zube]: To Test and removed [zube]: Review labels Feb 23, 2023

nickwsuse added [zube]: Reopened and removed [zube]: To Test labels Feb 24, 2023

Jono-SUSE-Rancher added the release-note Note this issue in the milestone's release notes label Feb 28, 2023

Sahota1225 added [zube]: To Test and removed [zube]: Reopened labels Mar 2, 2023

ronhorton mentioned this issue Mar 2, 2023

[Backport v2.6] [BUG] Updating internal server URL, internal server CA, and user token invalidates CAPI kubeconfig secret #40742

Closed

nickwsuse closed this as completed Mar 3, 2023

zube bot added [zube]: Done and removed [zube]: To Test labels Mar 3, 2023

zube bot removed the [zube]: Done label Jun 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300

[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300

rancherbot commented Jan 25, 2023

nickwsuse commented Feb 17, 2023

snasovich commented Feb 21, 2023

nickwsuse commented Feb 22, 2023

snasovich commented Feb 22, 2023

Oats87 commented Feb 22, 2023

nickwsuse commented Feb 24, 2023

eliyamlevy commented Feb 27, 2023 •

edited

snasovich commented Mar 2, 2023

nickwsuse commented Mar 3, 2023

[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300

[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300

Comments

rancherbot commented Jan 25, 2023

nickwsuse commented Feb 17, 2023

snasovich commented Feb 21, 2023

nickwsuse commented Feb 22, 2023

Testing update:

Some more screenshots:

snasovich commented Feb 22, 2023

Oats87 commented Feb 22, 2023

nickwsuse commented Feb 24, 2023

Screenshots

eliyamlevy commented Feb 27, 2023 • edited

snasovich commented Mar 2, 2023

nickwsuse commented Mar 3, 2023

Pre-Migration:

Migration Steps

eliyamlevy commented Feb 27, 2023 •

edited