Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher backup | Can't rollback (restore) Rancher from 2.6-head to 2.6.3 #36803

Closed
sgapanovich opened this issue Mar 8, 2022 · 15 comments
Closed
Assignees
Labels
feature/charts-backup-restore kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes status/release-blocker team/area3
Milestone

Comments

@sgapanovich
Copy link

sgapanovich commented Mar 8, 2022

Rancher Server Setup

  • Rancher version: 2.6.3
  • Installation option (Docker install/Helm Chart): Helm
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): rke1
  • Proxy/Cert Details: self-signed

Describe the bug

Rancher server is not coming up active when rolling back (restoring) rancher to the backup taken in 2.6.3

To Reproduce

Official docs : https://rancher.com/docs/rancher/v2.6/en/installation/install-rancher-on-k8s/rollbacks/

  1. Create Rancher on 2.6.3
  2. Save kubeconfig for the local cluster (you will need it in the future when Rancher UI is not available)
  3. Install Rancher Backup (you have to have a s3 bucket ready and a secret to access it)
  4. Take a backup
  5. Upgrade rancher to 2.6-head (3f9109c)
  6. After rancher is upgraded scale down Rancher deployment to 0
  7. Create a Restore yaml file with your bucket info and the secret to access your bucket (see Rancher docs above for more info) and then run kubectl create -f <file>

Result
Restore is never completed with the following error logged in rancher-backup pod (kubectl logs rancher-backup-6495d4976b-896b5 -n cattle-resources-system)

ERRO[2022/03/08 16:49:39] Error restoring CRDs restoreCRDs: restoreResource: err updating resource CustomResourceDefinition.apiextensions.k8s.io "machinedeployments.cluster.x-k8s.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta1": must appear in spec.versions
ERRO[2022/03/08 16:49:39] error syncing 'restore-migration': handler restore: error restoring CRDs, check logs for exact error, requeuing

apiversion differences between 2.6.3 and 2.6-head
diffe

Expected Result

Rancher restores to the state from backup

@sgapanovich sgapanovich added feature/charts-backup-restore kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement labels Mar 8, 2022
@sgapanovich sgapanovich added this to the v2.6.4 milestone Mar 8, 2022
@SheilaghM
Copy link

@superseb - Please look at this. Can this be fixed for 2.6.4 or do we need to push to 2.6.5?

@superseb
Copy link
Contributor

superseb commented Mar 11, 2022

This is also seen when deploying v2.6.3, updating image to v2.6-head and going back to v2.6.3 after (just a rollback without restore)

2022/03/10 18:14:11 [FATAL] failed to update clusters.cluster.x-k8s.io apiextensions.k8s.io/v1, Kind=CustomResourceDefinition for  clusters.cluster.x-k8s.io: CustomResourceDefinition.apiextensions.k8s.io "clusters.cluster.x-k8s.io" is invalid: status.storedVersions[1]: Invalid value: "v1beta1": must appear in spec.versions

Seems to be related to newer CRD versions introduced by bumping cluster-api from v0.4.4 to v1.0.2 in aaac1df and changes to v1beta1.

So far it seems to be caused by the CRDs that are updated with a new version (v1beta1), then the rollback to v2.6.3 (or the restore) wants to apply the CRDs from the previous versions that does not have v1beta1 but its trying to merge/modify status.storedVersions but the CRD does not have v1beta1 as its only available in the newer lib (v2.6.4 prereleases).

Some options that were thought of:

  • Delete/recreate CRDs, this can't happen as we have resources created from these CRDs (both for rollback and restore). For a restore when migrating to a new cluster, we dont have this issue as there are no existing CRDs, but for any other scenario we need to change existing CRDs
  • Automate the modification needed to CRDs on Rancher startup and in backup-restore-operator, as Rancher only starts after the restore we can't only have it in Rancher but it also needs to exist in backup-restore-operator

Some code examples:

https://github.com/oam-dev/oam-crd-migration/blob/70d9093fb4f8315df422b53bc0b46fe87d62a6a8/remove/remove.go#L27

https://github.com/rohankumardubey/cert-manager/blob/b1180c59ad588e73ac25b0d70a86661cf7c180e1/cmd/ctl/pkg/upgrade/migrateapiversion/migrator.go#L199


For testing purposes, I've used this to get the CRD in a state where a rollback would succeed (haven't tested with restore but should be the same):

# Make sure you have `kubectl proxy` active

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/clusters.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machinedeployments.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machinehealthchecks.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machines.cluster.x-k8s.io/status

curl -d '[{ "op": "replace", "path":"/status/storedVersions", "value": ["v1alpha4"] }]' \
  -H "Content-Type: application/json-patch+json" \
  -X PATCH \
 http://localhost:8001/apis/apiextensions.k8s.io/v1beta1/customresourcedefinitions/machinesets.cluster.x-k8s.io/status

While we are discussing/figuring out what we can do next, can QA validate the workaround above?

@thedadams
Copy link
Contributor

It seems that this issue only affects HA Rancher.

I followed these steps to upgrade the Docker install of Rancher and then followed these steps to rollback the Docker install of Rancher and was successful with this.

@jiaqiluo
Copy link
Member

@superseb 's comment has explained the issue and cause pretty well. However, the proposed fix (patching the CRD) does not work, so we have to (manually) clean up the cluster before restoring the backup.

We have made two scripts to make it easier: one for cleaning up the cluster, another one for checking if there is any rancher-related resources in the cluster. They are available here: https://github.com/rancherlabs/support-tools/tree/master/cleanup-rancher-k8s-resources

As the result, we need to update the instruction for restoring Rancher backup in the Docs to add the new requirement.
https://rancher.com/docs/rancher/v2.6/en/backups/restoring-rancher/

Also, add it to the release note.

@Jono-SUSE-Rancher
Copy link
Contributor

Thanks for the update Jack. @jtravee - please make sure this gets release noted with the information provided by Jack regarding the scripts. As far as the ticket, we will move it to 2.6.5 once you've release noted it.

@MKlimuszka
Copy link
Collaborator

There is no fixes going into rancher or rancher-backup-restore.
The cleanup script is ready and passed to QA

@anupama2501
Copy link
Contributor

Cleanup script was hanging previously for namespaces having finalizers.

@superseb provided a new fix rancherlabs/support-tools#160

@anupama2501
Copy link
Contributor

Re opening the ticket:

Rancher upgraded from 2.5.12 >> 2.6-head 85d6925

  • Created 5 downstream clusters - 2 rke1 node drivers with rke1 templates, 1 rke1 custom cluster and 1 rke1 cluster with no templates on v2.5.12.
  • Created a few workloads and ingresses on these clusters.
  • Upgraded rancher to v2.6-head
  • Ran the cleanup script on the local cluster.
  • Verified after the script was successful, all the namespaces , crds, etc everything rancher created were removed but see the following were not cleaned up:
 ./verify.sh
NAME                 DATA   AGE
cattle-controllers   0      54m
namespace/fleet-system

@jiaqiluo
Copy link
Member

The fix is added to the above mentioned PR

@anupama2501
Copy link
Contributor

anupama2501 commented Mar 25, 2022

Validated the issue by
upgrading rancher from 2.6.3 to 2.6.4-rc13 and rolled back to 2.6.3
upgrading rancher from 2.5.12 to 2.6.4-rc13 and rolled back to 2.5.12

  1. Created 5 downstream clusters 1 with 5 nodes 3 worker,1etcd,1cp, 2 with 3 nodes 1 etcd, 1cp,1 worker ,2 with 3 nodes 1 etcd+cp, 2 worker
  2. Ran pre-upgrade checks by creating a few workloads and ingresses
  3. Installed rancher-backup charts 2.1.0 for 2.6.3 and 1.2.100+up1.2.1 for 2.5.12 respectively.
  4. Upgraded rancher to 2.6.4-rc13 and ran post upgrade checks by validating previously created workloads and creating additional workloads and ingresses
  5. Ran the verify script which lists out everything that needs to be cleaned up
  6. Ran the cleanup script and verified all the namespaces, crds etc that were created by Rancher were cleaned up. - Script took approximately 45 minutes to complete.
  7. Verified no remaining namespaces, configmaps, crds etc with the verify script
  8. Followed the docs https://rancher.com/docs/rancher/v2.6/en/backups/migrating-rancher/ excluding step3 from the docs.
  9. Installed rancher on 2.6.3 and 2.5.12 respectively on the clusters.
  10. Ran pre upgrade checks and verified the tests have passed.

@anupama2501
Copy link
Contributor

anupama2501 commented Mar 25, 2022

Keeping this issue open as the PR for the standalone cleanup and verify scripts is not merged.

@jiaqiluo
Copy link
Member

Moving this issue to to-test because the linked PR is merged.

@anupama2501
Copy link
Contributor

Closing the issue validations noted here #36803 (comment)

@andruwa13
Copy link

kubectl delete -n fleet-default clusterregistrations $(kubectl get clusterregistrations -n fleet-default)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature/charts-backup-restore kind/bug-qa Issues that have not yet hit a real release. Bugs introduced by a new feature or enhancement release-note Note this issue in the milestone's release notes status/release-blocker team/area3
Projects
None yet
Development

No branches or pull requests