Rancher backup | Can't rollback (restore) Rancher from 2.6-head to 2.6.3 #36803

sgapanovich · 2022-03-08T18:51:38Z

Rancher Server Setup

Rancher version: 2.6.3
Installation option (Docker install/Helm Chart): Helm
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): rke1
Proxy/Cert Details: self-signed

Describe the bug

Rancher server is not coming up active when rolling back (restoring) rancher to the backup taken in 2.6.3

To Reproduce

Official docs : https://rancher.com/docs/rancher/v2.6/en/installation/install-rancher-on-k8s/rollbacks/

Create Rancher on 2.6.3
Save kubeconfig for the local cluster (you will need it in the future when Rancher UI is not available)
Install Rancher Backup (you have to have a s3 bucket ready and a secret to access it)
Take a backup
Upgrade rancher to 2.6-head (3f9109c)
After rancher is upgraded scale down Rancher deployment to 0
Create a Restore yaml file with your bucket info and the secret to access your bucket (see Rancher docs above for more info) and then run kubectl create -f <file>

Result
Restore is never completed with the following error logged in rancher-backup pod (kubectl logs rancher-backup-6495d4976b-896b5 -n cattle-resources-system)

ERRO[2022/03/08 16:49:39] Error restoring CRDs restoreCRDs: restoreResource: err updating resource CustomResourceDefinition.apiextensions.k8s.io "machinedeployments.cluster.x-k8s.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta1": must appear in spec.versions
ERRO[2022/03/08 16:49:39] error syncing 'restore-migration': handler restore: error restoring CRDs, check logs for exact error, requeuing

apiversion differences between 2.6.3 and 2.6-head

Expected Result

Rancher restores to the state from backup

The text was updated successfully, but these errors were encountered:

SheilaghM · 2022-03-09T22:34:15Z

@superseb - Please look at this. Can this be fixed for 2.6.4 or do we need to push to 2.6.5?

superseb · 2022-03-11T17:25:00Z

This is also seen when deploying v2.6.3, updating image to v2.6-head and going back to v2.6.3 after (just a rollback without restore)

2022/03/10 18:14:11 [FATAL] failed to update clusters.cluster.x-k8s.io apiextensions.k8s.io/v1, Kind=CustomResourceDefinition for  clusters.cluster.x-k8s.io: CustomResourceDefinition.apiextensions.k8s.io "clusters.cluster.x-k8s.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta1": must appear in spec.versions

Seems to be related to newer CRD versions introduced by bumping cluster-api from v0.4.4 to v1.0.2 in aaac1df and changes to v1beta1.

So far it seems to be caused by the CRDs that are updated with a new version (v1beta1), then the rollback to v2.6.3 (or the restore) wants to apply the CRDs from the previous versions that does not have v1beta1 but its trying to merge/modify status.storedVersions but the CRD does not have v1beta1 as its only available in the newer lib (v2.6.4 prereleases).

Some options that were thought of:

Delete/recreate CRDs, this can't happen as we have resources created from these CRDs (both for rollback and restore). For a restore when migrating to a new cluster, we dont have this issue as there are no existing CRDs, but for any other scenario we need to change existing CRDs
Automate the modification needed to CRDs on Rancher startup and in backup-restore-operator, as Rancher only starts after the restore we can't only have it in Rancher but it also needs to exist in backup-restore-operator

Some code examples:

https://github.com/oam-dev/oam-crd-migration/blob/70d9093fb4f8315df422b53bc0b46fe87d62a6a8/remove/remove.go#L27

https://github.com/rohankumardubey/cert-manager/blob/b1180c59ad588e73ac25b0d70a86661cf7c180e1/cmd/ctl/pkg/upgrade/migrateapiversion/migrator.go#L199

For testing purposes, I've used this to get the CRD in a state where a rollback would succeed (haven't tested with restore but should be the same):

# Make sure you have `kubectl proxy` active

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/clusters.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machinedeployments.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machinehealthchecks.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machines.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machinesets.cluster.x-k8s.io/status

While we are discussing/figuring out what we can do next, can QA validate the workaround above?

thedadams · 2022-03-16T20:25:37Z

It seems that this issue only affects HA Rancher.

I followed these steps to upgrade the Docker install of Rancher and then followed these steps to rollback the Docker install of Rancher and was successful with this.

jiaqiluo · 2022-03-23T22:29:37Z

@superseb 's comment has explained the issue and cause pretty well. However, the proposed fix (patching the CRD) does not work, so we have to (manually) clean up the cluster before restoring the backup.

We have made two scripts to make it easier: one for cleaning up the cluster, another one for checking if there is any rancher-related resources in the cluster. They are available here: https://github.com/rancherlabs/support-tools/tree/master/cleanup-rancher-k8s-resources

As the result, we need to update the instruction for restoring Rancher backup in the Docs to add the new requirement.
https://rancher.com/docs/rancher/v2.6/en/backups/restoring-rancher/

Also, add it to the release note.

Jono-SUSE-Rancher · 2022-03-23T23:54:50Z

Thanks for the update Jack. @jtravee - please make sure this gets release noted with the information provided by Jack regarding the scripts. As far as the ticket, we will move it to 2.6.5 once you've release noted it.

MKlimuszka · 2022-03-24T20:40:05Z

There is no fixes going into rancher or rancher-backup-restore.
The cleanup script is ready and passed to QA

anupama2501 · 2022-03-24T20:50:20Z

Cleanup script was hanging previously for namespaces having finalizers.

@superseb provided a new fix rancherlabs/support-tools#160

anupama2501 · 2022-03-24T21:03:59Z

Re opening the ticket:

Rancher upgraded from 2.5.12 >> 2.6-head 85d6925

Created 5 downstream clusters - 2 rke1 node drivers with rke1 templates, 1 rke1 custom cluster and 1 rke1 cluster with no templates on v2.5.12.
Created a few workloads and ingresses on these clusters.
Upgraded rancher to v2.6-head
Ran the cleanup script on the local cluster.
Verified after the script was successful, all the namespaces , crds, etc everything rancher created were removed but see the following were not cleaned up:

 ./verify.sh
NAME                 DATA   AGE
cattle-controllers   0      54m
namespace/fleet-system

jiaqiluo · 2022-03-24T22:20:40Z

The fix is added to the above mentioned PR

anupama2501 · 2022-03-25T15:54:56Z

Validated the issue by
upgrading rancher from 2.6.3 to 2.6.4-rc13 and rolled back to 2.6.3
upgrading rancher from 2.5.12 to 2.6.4-rc13 and rolled back to 2.5.12

Created 5 downstream clusters 1 with 5 nodes 3 worker,1etcd,1cp, 2 with 3 nodes 1 etcd, 1cp,1 worker ,2 with 3 nodes 1 etcd+cp, 2 worker
Ran pre-upgrade checks by creating a few workloads and ingresses
Installed rancher-backup charts 2.1.0 for 2.6.3 and 1.2.100+up1.2.1 for 2.5.12 respectively.
Upgraded rancher to 2.6.4-rc13 and ran post upgrade checks by validating previously created workloads and creating additional workloads and ingresses
Ran the verify script which lists out everything that needs to be cleaned up
Ran the cleanup script and verified all the namespaces, crds etc that were created by Rancher were cleaned up. - Script took approximately 45 minutes to complete.
Verified no remaining namespaces, configmaps, crds etc with the verify script
Followed the docs https://rancher.com/docs/rancher/v2.6/en/backups/migrating-rancher/ excluding step3 from the docs.
Installed rancher on 2.6.3 and 2.5.12 respectively on the clusters.
Ran pre upgrade checks and verified the tests have passed.

anupama2501 · 2022-03-25T15:55:10Z

Keeping this issue open as the PR for the standalone cleanup and verify scripts is not merged.

jiaqiluo · 2022-03-25T20:36:07Z

Moving this issue to to-test because the linked PR is merged.

anupama2501 · 2022-03-25T20:55:15Z

Closing the issue validations noted here #36803 (comment)

andruwa13 · 2023-02-17T15:26:20Z

kubectl delete -n fleet-default clusterregistrations $(kubectl get clusterregistrations -n fleet-default)

sgapanovich added feature/charts-backup-restore kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement labels Mar 8, 2022

sgapanovich added this to the v2.6.4 milestone Mar 8, 2022

Jono-SUSE-Rancher added [zube]: To Triage team/area3 labels Mar 9, 2022

Jono-SUSE-Rancher assigned superseb Mar 9, 2022

Jono-SUSE-Rancher added [zube]: Next Up and removed [zube]: To Triage labels Mar 9, 2022

superseb added [zube]: Working and removed [zube]: Next Up labels Mar 11, 2022

sowmyav27 added the status/release-blocker label Mar 11, 2022

superseb added [zube]: To Test and removed [zube]: Working labels Mar 14, 2022

anupama2501 self-assigned this Mar 15, 2022

anupama2501 added QA/need-info [zube]: Reopened and removed [zube]: To Test QA/need-info labels Mar 15, 2022

anupama2501 mentioned this issue Mar 16, 2022

v2.6.4 Release Task - Release Testing - Week after Bug Complete Date (Team-1) rancher/qa-tasks#260

Closed

7 tasks

Jono-SUSE-Rancher added the release-note Note this issue in the milestone's release notes label Mar 16, 2022

superseb mentioned this issue Mar 18, 2022

Compare and update storedVersions on CRD rancher/backup-restore-operator#202

Closed

Jono-SUSE-Rancher assigned jiaqiluo and unassigned superseb Mar 22, 2022

jiaqiluo added [zube]: Working and removed [zube]: Reopened labels Mar 23, 2022

jtravee assigned cbron Mar 24, 2022

jiaqiluo unassigned cbron Mar 24, 2022

MKlimuszka added the [zube]: To Test label Mar 24, 2022

zube bot removed the [zube]: Working label Mar 24, 2022

deniseschannon added [zube]: Reopened and removed [zube]: To Test labels Mar 24, 2022

anupama2501 added [zube]: QA Working and removed [zube]: Reopened labels Mar 24, 2022

anupama2501 added [zube]: Reopened and removed [zube]: QA Working labels Mar 24, 2022

jiaqiluo added [zube]: Working and removed [zube]: Reopened labels Mar 24, 2022

zube bot removed the [zube]: Working label Mar 24, 2022

anupama2501 mentioned this issue Mar 25, 2022

Improve performance of cleanup script for upgrade and rollback via backup and restore #37040

Closed

jiaqiluo added the [zube]: To Test label Mar 25, 2022

zube bot removed the [zube]: Review label Mar 25, 2022

anupama2501 closed this as completed Mar 25, 2022

zube bot added [zube]: Done and removed [zube]: To Test labels Mar 25, 2022

zube bot removed the [zube]: Done label Jun 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rancher backup | Can't rollback (restore) Rancher from 2.6-head to 2.6.3 #36803

Rancher backup | Can't rollback (restore) Rancher from 2.6-head to 2.6.3 #36803

sgapanovich commented Mar 8, 2022 •

edited

Loading

SheilaghM commented Mar 9, 2022

superseb commented Mar 11, 2022 •

edited

Loading

thedadams commented Mar 16, 2022

jiaqiluo commented Mar 23, 2022

Jono-SUSE-Rancher commented Mar 23, 2022

MKlimuszka commented Mar 24, 2022

anupama2501 commented Mar 24, 2022

anupama2501 commented Mar 24, 2022

jiaqiluo commented Mar 24, 2022

anupama2501 commented Mar 25, 2022 •

edited

Loading

anupama2501 commented Mar 25, 2022 •

edited

Loading

jiaqiluo commented Mar 25, 2022

anupama2501 commented Mar 25, 2022

andruwa13 commented Feb 17, 2023

Rancher backup | Can't rollback (restore) Rancher from 2.6-head to 2.6.3 #36803

Rancher backup | Can't rollback (restore) Rancher from 2.6-head to 2.6.3 #36803

Comments

sgapanovich commented Mar 8, 2022 • edited Loading

SheilaghM commented Mar 9, 2022

superseb commented Mar 11, 2022 • edited Loading

thedadams commented Mar 16, 2022

jiaqiluo commented Mar 23, 2022

Jono-SUSE-Rancher commented Mar 23, 2022

MKlimuszka commented Mar 24, 2022

anupama2501 commented Mar 24, 2022

anupama2501 commented Mar 24, 2022

jiaqiluo commented Mar 24, 2022

anupama2501 commented Mar 25, 2022 • edited Loading

anupama2501 commented Mar 25, 2022 • edited Loading

jiaqiluo commented Mar 25, 2022

anupama2501 commented Mar 25, 2022

andruwa13 commented Feb 17, 2023

sgapanovich commented Mar 8, 2022 •

edited

Loading

superseb commented Mar 11, 2022 •

edited

Loading

anupama2501 commented Mar 25, 2022 •

edited

Loading

anupama2501 commented Mar 25, 2022 •

edited

Loading