New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Backport v2.6] [BUG] RKE2 Downstream Clusters Not Coming Up After Rancher Migration #40300
Comments
UPDATE: The fix that was implemented didn't appear to work, and upon further investigation this seems to be a bug in the rancher-system-agent. It seems like anything that leads to rancher-system-agent being invoked to apply a change to the nodes has a high chance of hitting this. Doesn’t matter if it’s a restore or just a minor config change. Currently waiting on the correct engineering team to start looking into it. |
@eliyamlevy , duplicating offline conversation, please rollback rancher/backup-restore-operator#296 as we don't want to backup |
@nickwsuse , thank you for the update. RKE1 clusters taking longer than usual to come up certainly sounds like a separate issue though. |
Root cause identified, see: #40080 (comment) The actual fix is pending thought, as there are multiple ways to resolve the issue. |
Reopening as I still see the same issue on Admittedly this test was a bit different than usual as I couldn't get the Jenkins job that builds HAs to work on Once on Once the migration is complete, I waited for the clusters to come back up, but the RKE2 cluster started to show the same errors and messages as before the fix, ie. ScreenshotsCluster Management - Still waiting for plan to be applied after five hours |
@snasovich reverted in most recent rcs for |
This should be addressed by the same fix done for #40742, that's why it's moved "To Test" at the same time. |
The RKE1 and RKE2 clusters both came back to The RKE1 cluster was waiting for the cluster agent to be connected for about a minute and then was active at about 5 minutes. The RKE2 cluster was showing an error with the webhooks, but when doing live debugging with @Oats87 he said that's somewhat expected since the webhooks are being reinstalled or something along those lines. Once the webhooks were installed and connected, the UI showed a message that wasn't very clear as it just listed the names of the worker nodes. Looking at the provisioning logs it looks like it was configuring the nodes. Once configured, the cluster came back to the For transparency and clarity, listed below are the steps I took to actually migrate from HA1 to HA2 Pre-Migration:
Migration Steps
|
This is a backport issue for #40080, automatically created via rancherbot by @eliyamlevy
Original issue description:
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
When migrating rancher servers from one HA to another HA using Rancher Backups and Restore, the RKE2 downstream cluster is not coming back up. The status is stuck at
Updating
with the following message:Configuring bootstrap node(s) <redacted>: waiting for plan to be applied
The RKE2 version I used for both Rancher versions is
v1.24.8+rke2r1
and after changing the version to the default RKE2 version for each Rancher version (v1.24.4+rke2r1
&&v1.24.6+rke2r1
respectively) and redeploying all workloads, the RKE2 cluster for the v2.6.9 instance came up and wasActive
.The v2.7.0 RKE2 cluster is still showing the same status and message.
To Reproduce
v1.24.8
as the RKE versionv1.24.8+rke2r1
as the RKE versionActive
Result
The RKE2 downstream cluster did not come back to the
Active
status after the restore/migrationExpected Result
The RKE2 downstream cluster does come back to the
Active
after the restore/migrationScreenshots
Showing the Status and Message:
Machine Pool:
The text was updated successfully, but these errors were encountered: