RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3 #36859

sowmyav27 · 2022-03-11T23:51:39Z

Rancher Server Setup

Rancher version: 2.6-head commit id: bab14a
Installation option (Docker install/Helm Chart): docker install

Information about the Cluster

Kubernetes version: v1.21.9+rke2r1
Cluster Type (Local/Downstream): 3 etcd, 2 cp and 3 worker nodes

User Information

What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) admin

Describe the bug
RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3

To Reproduce

Deployed a 3 etcd, 2 cp and 3 worker node RKE2 cluster on v1.21.9+rke2r1 version
deployed resources on the downstream cluster
Upgrade rancher to 2.6-head
Cluster is in an error state for sometime, before going back Active and the nodes start getting deleted and reprovisioned (logged bug )
Noticed that the nodes got deleted in parallel - 1 etcd, 1 control and 1 worker nodes got deleted in parallel and reprovisioned.
Existing resources on the clusters worked fine, and new resources were deployed
Rollback rancher to 2.6.3 (via Docker install path per https://rancher.com/docs/rancher/v2.6/en/backups/docker-installs/docker-restores/)
CLuster is stuck in provisioning state

Result

Rancher logs:

x6cb: Node "sowmya-rke2-263-pool3-02226a03-6vtgx" not found
2022/03/11 23:43:06 [ERROR] Unable to retrieve Node status: error retrieving node sowmya-rke2-263-pool3-02226a03-g56wd for machine fleet-default/sowmya-rke2-263-pool3-778dd4448d-cz8fx: Node "sowmya-rke2-263-pool3-02226a03-g56wd" not found
2022/03/11 23:43:06 [ERROR] Unable to retrieve Node status: error retrieving node sowmya-rke2-263-pool3-02226a03-tmwmj for machine fleet-default/sowmya-rke2-263-pool3-778dd4448d-rc2qh: Node "sowmya-rke2-263-pool3-02226a03-tmwmj" not found
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool1-5b8559c8dc-md9x4" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool3-778dd4448d-2x6cb" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool2-78d868bf74-jbrg6" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool1-5b8559c8dc-sh9ff" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool2-78d868bf74-mm5m6" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool3-778dd4448d-rc2qh" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool3-778dd4448d-cz8fx" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool1-5b8559c8dc-kkk5w" in namespace "fleet-default": cannot find node with matching ProviderID

And the rancher logs are flooded with ^^^ these logs

Expected:

I am not sure if on a rancher server upgrade if replacing the nodes is the best option as mentioned here, since on a rollback, the cluster is broke.
Also the rancher logs should NOT flood with the error messages

The text was updated successfully, but these errors were encountered:

snasovich · 2022-03-14T18:14:05Z

@thedadams to look into it time permitting; the fix may be on the backup-restore operator side.
Edit: The backup-restore operator approach is not applicable as the backup/restore was done for Docker Install case per https://rancher.com/docs/rancher/v2.6/en/backups/docker-installs/docker-restores/

snasovich · 2022-03-15T21:10:05Z

Moving this one to 2.6.5 to properly address it there. As of now the plan is to consider this a known issue and release note it.
Edit: Keeping milestone at 2.6.4 for visibility for release notes.

thedadams · 2022-04-11T22:20:09Z

Since there were so many new features added to RKE2 provisioning from v2.6.3 to v2.6.4 and the fact that RKE2 provisioning is in Tech Preview: it was deemed OK for RKE2 clusters to be re-provisioned on upgraded from v2.6.3 to 2.6.4.

Because of this, new machines are provisioned for RKE2 clusters on upgrade. This means that rolling back to v2.6.3 after upgrade will cause this issue; the nodes that were there when Rancher was at v2.6.3 no longer exist.

Manual steps can likely be applied to the cluster and machine objects on rollback to get the cluster to reconcile with the new nodes. However, this has not been tested thoroughly.

slickwarren · 2022-04-13T21:20:58Z

notes:

this is specifically for rke2 provisioned clusters (won't affect 2.5.x if any rke2 clusters are there)
actionable items: we need a documented way to (manually) roll back from 2.6.4+ -> 2.6.3-.
if this isn't done already, we should have a nice alert for customers who are upgrading from 2.6.3- that they will have some manual steps if for whatever reason they roll back from 2.6.4+

snasovich · 2022-04-14T18:52:09Z

This will (should) not be an issue for 2.6.4 -> 2.6.5 upgrade because we're not adding anything extra to machines or deployments (which trigger reprovisioning).
So, we will be addressing this issue if we need to introduce new fields in the following releases.
Meanwhile, we have release noted this behavior in 2.6.4 and it should be sufficient.

Jono-SUSE-Rancher · 2023-10-30T20:43:56Z

This has since been resolved with the improvements made to provisioning. We can re-open or re-create if we stee something similar since we're a long way from v2.6-->Rollback to 2.6.3 happening.

sowmyav27 self-assigned this Mar 11, 2022

sowmyav27 added this to the v2.6.4 milestone Mar 11, 2022

sowmyav27 added [zube]: To Triage status/release-blocker regression labels Mar 12, 2022

snasovich assigned thedadams Mar 14, 2022

snasovich added the [zube]: Next Up label Mar 14, 2022

zube bot removed the [zube]: To Triage label Mar 14, 2022

snasovich added [zube]: Working and removed [zube]: Next Up labels Mar 14, 2022

Sahota1225 modified the milestones: v2.6.4, v2.6.5 Mar 15, 2022

snasovich added the [zube]: Need Info label Mar 15, 2022

zube bot removed the [zube]: Working label Mar 15, 2022

snasovich modified the milestones: v2.6.5, v2.6.4 Mar 15, 2022

snasovich added release-note Note this issue in the milestone's release notes and removed status/release-blocker labels Mar 15, 2022

snasovich modified the milestones: v2.6.4, v2.6.5 Mar 16, 2022

Sahota1225 added [zube]: Team Area 2 and removed [zube]: Need Info labels Apr 6, 2022

thedadams removed the [zube]: Next Up label Apr 6, 2022

thedadams added the [zube]: Need Info label Apr 6, 2022

snasovich modified the milestones: v2.6.5, v2.6.x Apr 14, 2022

snasovich removed the release-note Note this issue in the milestone's release notes label Apr 14, 2022

snasovich unassigned thedadams Apr 14, 2022

Jono-SUSE-Rancher closed this as completed Oct 30, 2023

zube bot added [zube]: Done and removed [zube]: Need Info labels Oct 30, 2023

Jono-SUSE-Rancher modified the milestones: v2.6.x, v2.6.4 Oct 30, 2023

zube bot removed the [zube]: Done label Jan 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3 #36859

RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3 #36859

sowmyav27 commented Mar 11, 2022 •

edited by snasovich

Loading

snasovich commented Mar 14, 2022 •

edited

Loading

snasovich commented Mar 15, 2022 •

edited

Loading

thedadams commented Apr 11, 2022

slickwarren commented Apr 13, 2022

snasovich commented Apr 14, 2022

Jono-SUSE-Rancher commented Oct 30, 2023

RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3 #36859

RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3 #36859

Comments

sowmyav27 commented Mar 11, 2022 • edited by snasovich Loading

snasovich commented Mar 14, 2022 • edited Loading

snasovich commented Mar 15, 2022 • edited Loading

thedadams commented Apr 11, 2022

slickwarren commented Apr 13, 2022

snasovich commented Apr 14, 2022

Jono-SUSE-Rancher commented Oct 30, 2023

sowmyav27 commented Mar 11, 2022 •

edited by snasovich

Loading

snasovich commented Mar 14, 2022 •

edited

Loading

snasovich commented Mar 15, 2022 •

edited

Loading