Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3 #36859

Closed
sowmyav27 opened this issue Mar 11, 2022 · 6 comments
Assignees
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement regression team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@sowmyav27
Copy link
Contributor

sowmyav27 commented Mar 11, 2022

Rancher Server Setup

  • Rancher version: 2.6-head commit id: bab14a
  • Installation option (Docker install/Helm Chart): docker install

Information about the Cluster

  • Kubernetes version: v1.21.9+rke2r1
  • Cluster Type (Local/Downstream): 3 etcd, 2 cp and 3 worker nodes

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom) admin

Describe the bug
RKE2 Cluster is stuck in provisioning state after an upgrade to 2.6-head and rollback to 2.6.3

To Reproduce

  • Deployed a 3 etcd, 2 cp and 3 worker node RKE2 cluster on v1.21.9+rke2r1 version
  • deployed resources on the downstream cluster
  • Upgrade rancher to 2.6-head
  • Cluster is in an error state for sometime, before going back Active and the nodes start getting deleted and reprovisioned (logged bug )
  • Noticed that the nodes got deleted in parallel - 1 etcd, 1 control and 1 worker nodes got deleted in parallel and reprovisioned.
  • Existing resources on the clusters worked fine, and new resources were deployed
  • Rollback rancher to 2.6.3 (via Docker install path per https://rancher.com/docs/rancher/v2.6/en/backups/docker-installs/docker-restores/)
  • CLuster is stuck in provisioning state

Screen Shot 2022-03-12 at 5 13 10 AM

Result

  • Rancher logs:
x6cb: Node "sowmya-rke2-263-pool3-02226a03-6vtgx" not found
2022/03/11 23:43:06 [ERROR] Unable to retrieve Node status: error retrieving node sowmya-rke2-263-pool3-02226a03-g56wd for machine fleet-default/sowmya-rke2-263-pool3-778dd4448d-cz8fx: Node "sowmya-rke2-263-pool3-02226a03-g56wd" not found
2022/03/11 23:43:06 [ERROR] Unable to retrieve Node status: error retrieving node sowmya-rke2-263-pool3-02226a03-tmwmj for machine fleet-default/sowmya-rke2-263-pool3-778dd4448d-rc2qh: Node "sowmya-rke2-263-pool3-02226a03-tmwmj" not found
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool1-5b8559c8dc-md9x4" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool3-778dd4448d-2x6cb" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool2-78d868bf74-jbrg6" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool1-5b8559c8dc-sh9ff" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool2-78d868bf74-mm5m6" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool3-778dd4448d-rc2qh" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool3-778dd4448d-cz8fx" in namespace "fleet-default": cannot find node with matching ProviderID
2022/03/11 23:43:14 [ERROR] Reconciler error: no matching Node for Machine "sowmya-rke2-263-pool1-5b8559c8dc-kkk5w" in namespace "fleet-default": cannot find node with matching ProviderID
  • And the rancher logs are flooded with ^^^ these logs

Expected:

  • I am not sure if on a rancher server upgrade if replacing the nodes is the best option as mentioned here, since on a rollback, the cluster is broke.
  • Also the rancher logs should NOT flood with the error messages
@sowmyav27 sowmyav27 self-assigned this Mar 11, 2022
@sowmyav27 sowmyav27 added kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support area/capr/rke2 RKE2 Provisioning issues involving CAPR labels Mar 11, 2022
@sowmyav27 sowmyav27 added this to the v2.6.4 milestone Mar 11, 2022
@zube zube bot removed the [zube]: To Triage label Mar 14, 2022
@snasovich
Copy link
Collaborator

snasovich commented Mar 14, 2022

@thedadams to look into it time permitting; the fix may be on the backup-restore operator side.
Edit: The backup-restore operator approach is not applicable as the backup/restore was done for Docker Install case per https://rancher.com/docs/rancher/v2.6/en/backups/docker-installs/docker-restores/

@snasovich
Copy link
Collaborator

snasovich commented Mar 15, 2022

Moving this one to 2.6.5 to properly address it there. As of now the plan is to consider this a known issue and release note it.
Edit: Keeping milestone at 2.6.4 for visibility for release notes.

@zube zube bot removed the [zube]: Working label Mar 15, 2022
@snasovich snasovich modified the milestones: v2.6.5, v2.6.4 Mar 15, 2022
@snasovich snasovich added release-note Note this issue in the milestone's release notes and removed status/release-blocker labels Mar 15, 2022
@snasovich snasovich modified the milestones: v2.6.4, v2.6.5 Mar 16, 2022
@thedadams
Copy link
Contributor

Since there were so many new features added to RKE2 provisioning from v2.6.3 to v2.6.4 and the fact that RKE2 provisioning is in Tech Preview: it was deemed OK for RKE2 clusters to be re-provisioned on upgraded from v2.6.3 to 2.6.4.

Because of this, new machines are provisioned for RKE2 clusters on upgrade. This means that rolling back to v2.6.3 after upgrade will cause this issue; the nodes that were there when Rancher was at v2.6.3 no longer exist.

Manual steps can likely be applied to the cluster and machine objects on rollback to get the cluster to reconcile with the new nodes. However, this has not been tested thoroughly.

@slickwarren
Copy link
Contributor

notes:

  • this is specifically for rke2 provisioned clusters (won't affect 2.5.x if any rke2 clusters are there)
  • actionable items: we need a documented way to (manually) roll back from 2.6.4+ -> 2.6.3-.
  • if this isn't done already, we should have a nice alert for customers who are upgrading from 2.6.3- that they will have some manual steps if for whatever reason they roll back from 2.6.4+

@snasovich
Copy link
Collaborator

This will (should) not be an issue for 2.6.4 -> 2.6.5 upgrade because we're not adding anything extra to machines or deployments (which trigger reprovisioning).
So, we will be addressing this issue if we need to introduce new fields in the following releases.
Meanwhile, we have release noted this behavior in 2.6.4 and it should be sufficient.

@snasovich snasovich modified the milestones: v2.6.5, v2.6.x Apr 14, 2022
@snasovich snasovich removed the release-note Note this issue in the milestone's release notes label Apr 14, 2022
@Jono-SUSE-Rancher
Copy link
Contributor

This has since been resolved with the improvements made to provisioning. We can re-open or re-create if we stee something similar since we're a long way from v2.6-->Rollback to 2.6.3 happening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr/rke2 RKE2 Provisioning issues involving CAPR kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement regression team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

6 participants