Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300

Closed
rancherbot opened this issue Jan 25, 2023 · 9 comments
Assignees
Labels
area/backup-recover kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 regression release-note Note this issue in the milestone's release notes status/release-blocker team/area3 team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/rke2
Milestone

Comments

@rancherbot
Copy link
Collaborator

This is a backport issue for #40080, automatically created via rancherbot by @eliyamlevy

Original issue description:

Rancher Server Setup

  • Rancher version: v2.6.9 && v2.7.0
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): RKE2
  • Proxy/Cert Details: byo-valid

Information about the Cluster

  • Kubernetes version: v1.24.8+rke2r1
  • Cluster Type (Local/Downstream): Downstream
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider): AWS

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom): Admin

Describe the bug
When migrating rancher servers from one HA to another HA using Rancher Backups and Restore, the RKE2 downstream cluster is not coming back up. The status is stuck at Updating with the following message: Configuring bootstrap node(s) <redacted>: waiting for plan to be applied

The RKE2 version I used for both Rancher versions is v1.24.8+rke2r1 and after changing the version to the default RKE2 version for each Rancher version (v1.24.4+rke2r1 && v1.24.6+rke2r1 respectively) and redeploying all workloads, the RKE2 cluster for the v2.6.9 instance came up and was Active.

The v2.7.0 RKE2 cluster is still showing the same status and message.

To Reproduce

  1. Deploy a Rancher HA instance on v2.6.9 && v2.7.0
  2. Create an AWS RKE1 downstream cluster (3 workers, 1 control plane, 1 etcd) using v1.24.8 as the RKE version
  3. Create an AWS RKE2 downstream cluster (3 workers, 2 control plane, 3 etcd) using v1.24.8+rke2r1 as the RKE version
  4. Wait for the downstream clusters to be Active
  5. Install the Rancher Backups chart on your local cluster
  6. Create a backup in your preferred storage (I use an AWS S3 bucket)
  7. Bring up a new HA and point the load balancer to the new HA
  8. Use the backup to restore onto the new HA
  9. Install Rancher with the same version as the original HA
  10. Go to Cluster Management
  11. Check cluster statuses

Result
The RKE2 downstream cluster did not come back to the Active status after the restore/migration

Expected Result

The RKE2 downstream cluster does come back to the Active after the restore/migration

Screenshots

Showing the Status and Message:
image

Machine Pool:
image

@nickwsuse
Copy link
Contributor

UPDATE:

The fix that was implemented didn't appear to work, and upon further investigation this seems to be a bug in the rancher-system-agent.

It seems like anything that leads to rancher-system-agent being invoked to apply a change to the nodes has a high chance of hitting this. Doesn’t matter if it’s a restore or just a minor config change.

Currently waiting on the correct engineering team to start looking into it.

@zube zube bot added the team/rke2 label Feb 21, 2023
@zube zube bot assigned brandond Feb 21, 2023
@snasovich snasovich changed the title [Backport ] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration [Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration Feb 21, 2023
@snasovich snasovich added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Feb 21, 2023
@snasovich
Copy link
Collaborator

@eliyamlevy , duplicating offline conversation, please rollback rancher/backup-restore-operator#296 as we don't want to backup *machine-plan-token secrets (or any secrets of kubernetes.io/service-account-token type) as these are service account token that won't be valid if restored on a new cluster as they are tied to service account ID that will be different on new cluster.

@nickwsuse
Copy link
Contributor

Testing update:

Rancher 2.6.9; rke2 v1.24.9+rke2r2
    rancher/system-upgrade-controller:v0.9.1
    Post-Migration Status: Updating - Configuring bootstrap node(s): waiting for plan to be applied
    
    NOTE: After about two hours, the cluster DID come up and is in an Active status with no immediately noticeable errors.
Rancher 2.6.9; rke2 v1.23.15+rke2r1
    rancher/system-upgrade-controller:v0.9.1
    Post-Migration Status: Updating - Configuring bootstrap node(s): waiting for plan to be applied

    NOTE: After about two hours, the cluster DID come up and is in an Active status with no immediately noticeable errors.
Rancher 2.6.5; rke2 (1.23) (In-progress, will update comment with results when I have them)
    rancher/system-upgrade-controller:
    Post-Migration Status: 

Seems even the RKE1 clusters are taking a LOT longer to come up than they did previously after migration. Typically it would take around 10min, but now it's taking around an hour or more for the cluster to come up.

Some more screenshots:

Cluster Management Page - Over an hour after migration (no clusters have come back up)
image

Cluster Management Page - Over two after migration (all clusters have come back up)
image

Cluster Details Page - Rancher 2.6.9; rke2 v1.24.9+rke2r2 (etcd node is reconciling)
image

Cluster Details Page - Rancher 2.6.9; rke2 v1.23.15+rke2r1 (etcd node is reconciling)
image

@snasovich
Copy link
Collaborator

@nickwsuse , thank you for the update. RKE1 clusters taking longer than usual to come up certainly sounds like a separate issue though.

@Oats87
Copy link
Contributor

Oats87 commented Feb 22, 2023

Root cause identified, see: #40080 (comment)

The actual fix is pending thought, as there are multiple ways to resolve the issue.

@nickwsuse
Copy link
Contributor

Reopening as I still see the same issue on v2.6-head Commit ID: 240554b

Admittedly this test was a bit different than usual as I couldn't get the Jenkins job that builds HAs to work on v2.6-head, so I had to start on v2.6.9 and then upgraded the Rancher instance to v2.6-head.

Once on v2.6-head I created two clusters - one RKE1 cluster and one RKE2 cluster both via AWS node provisioning. While those were provisioning I installed Rancher Backups v2.1.5-rc2 and waited for the cluster provisioning to finish. Once finished, I took a backup and then followed the migration process that I have outlined in Confluence (here, for those curious).

Once the migration is complete, I waited for the clusters to come back up, but the RKE2 cluster started to show the same errors and messages as before the fix, ie. Configuring bootstrap node(s): waiting for plan to be applied and it is once again an etcd node that is reconciling.

Screenshots

Cluster Management - Still waiting for plan to be applied after five hours
image

Etcd Still Reconciling
image

@eliyamlevy
Copy link
Contributor

eliyamlevy commented Feb 27, 2023

@eliyamlevy , duplicating offline conversation, please rollback rancher/backup-restore-operator#296 as we don't want to backup *machine-plan-token secrets (or any secrets of kubernetes.io/service-account-token type) as these are service account token that won't be valid if restored on a new cluster as they are tied to service account ID that will be different on new cluster.

@snasovich reverted in most recent rcs for 2.6 and 2.7

@Jono-SUSE-Rancher Jono-SUSE-Rancher added the release-note Note this issue in the milestone's release notes label Feb 28, 2023
@snasovich
Copy link
Collaborator

This should be addressed by the same fix done for #40742, that's why it's moved "To Test" at the same time.

@nickwsuse
Copy link
Contributor

Verified on v2.6-head ID: 5e33172

The RKE1 and RKE2 clusters both came back to Active status within 10 minutes of the restore being completed.

The RKE1 cluster was waiting for the cluster agent to be connected for about a minute and then was active at about 5 minutes.

The RKE2 cluster was showing an error with the webhooks, but when doing live debugging with @Oats87 he said that's somewhat expected since the webhooks are being reinstalled or something along those lines. Once the webhooks were installed and connected, the UI showed a message that wasn't very clear as it just listed the names of the worker nodes. Looking at the provisioning logs it looks like it was configuring the nodes. Once configured, the cluster came back to the Active state at about 7 minutes.

For transparency and clarity, listed below are the steps I took to actually migrate from HA1 to HA2

Pre-Migration:

  1. Set up an HA (HA1)
  2. On that HA cluster, provision an RKE2 EKS cluster (3 workers, 2 control plane, 3 etcd) and one k3s EKS cluster (3 workers, 2 control plane, 3 etcd)
  3. While those are provisioning, install the Rancher Backups chart
  4. Once the downstream clusters are done provisioning, create a backup using the Rancher Backups chart
  5. Migrate from HA1 -> HA2 using the backup that was just taken as the restore point for HA2

Migration Steps

  1. Launch 3 EC2 instances
  2. use rke up --config config.yml where config.yml consists of the settings below
// set path to ssh private key so rke can ssh into each node for provisioning
ssh_key_path: <redacted>
// kubernetes_version is not required, if null rke uses latest
kubernetes_version:
// NODES:
// For Amazon, user is ubuntu, DigitalOcean user is root
// Internal IPs are used for node-to-node traffic with lowest latency
// All-in-One or a one-box is all three roles on one node.
nodes:
   - address: <EXTERNAL_IP_1>  
      internal_address: <INTERNAL_IP_1>  
      user: ubuntu  
      role: [etcd, controlplane, worker]
   - address: <EXTERNAL_IP_2>  
      internal_address: <INTERNAL_IP_2>  
      user: ubuntu  
      role: [etcd, controlplane, worker]
   - address: <EXTERNAL_IP_3>  
      internal_address: <INTERNAL_IP_3>  
      user: ubuntu  
      role: [etcd, controlplane, worker]
  1. run export KUBECONFIG=kube_config_<name_of_config_file>.yaml
  2. Apply the credentials of the s3 bucket where the backup is stored
  3. Install the Rancher Backup crd and chart with the version that was used in the original HA
  4. Terminate the EC2 instances for HA1
  5. Deregister the HA1 instances in the target group(s) for the NLB of HA1
  6. Register the HA2 instances in the target group(s) for the NLB of HA1
  7. Wait for the health checks to come back as healthy
  8. Create a migrationResource yaml file
apiVersion: resources.cattle.io/v1
kind: Restore
metadata:
  name: restore-migration
spec:
  backupFilename: <backupFile>
  prune: false
  storageLocation:
    s3:
      credentialSecretName: 
      credentialSecretNamespace: 
      bucketName: 
      folder: 
      region: 
      endpoint: 
  1. Either install cert manager or set up the byo-valid tls secret
  2. Install Rancher

@zube zube bot removed the [zube]: Done label Jun 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/backup-recover kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 regression release-note Note this issue in the milestone's release notes status/release-blocker team/area3 team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support team/rke2
Projects
None yet
Development

No branches or pull requests

9 participants