Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster migration: User is not able to take a backup after a restore #24

Closed
sowmyav27 opened this issue Sep 4, 2020 · 5 comments
Closed

Comments

@sowmyav27
Copy link

On master-head - commit id: 5d5ef3f8f and backup-restore tag: v0.0.1-rc9

  • Take a backup into S3
  • Deploy Rancher in a HA setup in EKS cluster
  • Deploy backup-restore app.
  • Restore from backup.
  • Rancher is restored successfully.
  • Create another backup by creating a backup CR.
  • No Backup is taken
  • Error seen in backup-restore-operator logs: E0904 03:36:51.157283 1 reflector.go:178] pkg/mod/k8s.io/client-go@v0.18.0/tools/cache/reflector.go:125: Failed to list *v1.Backup: Unauthorized
@sowmyav27 sowmyav27 added this to the v2.5 milestone Sep 4, 2020
@mrajashree mrajashree changed the title User is not able to take a backup after a restore Cluster migration: User is not able to take a backup after a restore Sep 4, 2020
@mrajashree
Copy link
Contributor

mrajashree commented Sep 6, 2020

I think this will be solved by including the Backup-restore operator CRDs in the resourceSet. Will check this and update the resourceset accordingly
This is not needed

@mrajashree
Copy link
Contributor

mrajashree commented Sep 13, 2020

The main reason behind "Unauthorized" errors is the service account tied to the pod.
We configure the operator pod to use the serviceaccount that has cluster-admin role. When this service account is created, k8s also creates a secret associated with it and mounts it in the pod. During restore, since prune is enabled by default, this secret gets deleted.
So if we restore with prune=false we shouldn't see this error. But that leads to the duplicate "Default" and "System" projects issue

@mrajashree
Copy link
Contributor

mrajashree commented Sep 14, 2020

The following steps should be used for restoring to a new cluster for the DR use case, which will ensure the operator pod retains its serviceaccount and associated secret

  1. Install the backup-restore-operator on the new cluster using Helm CLI
  2. Restore from backup AND set prune=false
  3. This restore also adds in the secret associated with the helm release of rancher from cluster 1. So run helm upgrade instead of helm install and bring up rancher.

Discussed this offline with @cloudnautique and there is no need to bring up rancher first on the new cluster and then launch the operator from dashboard, if we're restoring from backup, it makes sense for the operator to bring up the entire setup. Will test these steps once again

@mrajashree
Copy link
Contributor

mrajashree commented Sep 14, 2020

Steps

  1. helm install backup-restore-operator-crd rancherchart/backup-restore-operator-crd -n cattle-resources-system --create-namespace
  2. helm install backup-restore-operator rancherchart/backup-restore-operator -n cattle-resources-system
  3. kubectl apply -f migrationResource.yaml where prune=false
    Helm3 stores chart release info as a secret, so rancher chart from cluster1 is stored as secret in cattle-system namespace, which gets backed up and created on the new cluster due to restore. So now no need to reinstall rancher, we just need to upgrade it
  4. (If needed, also follow steps to install cert-manager from rancher HA install docs)
  5. helm upgrade rancher rancher-alpha/rancher --version 2.5.0-alpha1 --namespace cattle-system --set hostname= --set rancherImageTag=master-head --set webhook.enabled=false

should work with above steps, moving to test as no actual change is needed in the operator or the chart

@sowmyav27
Copy link
Author

Verified on master-head - commit id: ad697207

  • Deploy rancher HA setup.
  • Deploy a couple of user clusters.
  • deploy backup restore chart/app in the local cluster.
  • Take a backup b1 which is saved
  • Delete the local cluster nodes for this HA setup
  • Deploy a new RKE cluster (3 nodes all roles). Add this node to the target groups/load balancer.
  • Install the backup-restore-operator chart on the new cluster using Helm CLI
helm repo add rancherchart https://charts.rancher.io
helm repo update
helm install backup-restore-operator-crd rancherchart/backup-restore-operator-crd -n cattle-resources-system --create-namespace
helm install backup-restore-operator rancherchart/backup-restore-operator -n cattle-resources-system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants