Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent multiple restores from happening in parallel #9

Closed
sowmyav27 opened this issue Aug 29, 2020 · 4 comments
Closed

Prevent multiple restores from happening in parallel #9

sowmyav27 opened this issue Aug 29, 2020 · 4 comments
Assignees
Milestone

Comments

@sowmyav27
Copy link

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

  • on 2.5.0-alpha1
  • Deploy custom charts for enabling backup MCM
  • Deploy resourceset and create a backup
  • Currently if a restore fails and is stuck in "In Progress" state, user is able create another restore CR.

Expected Result:
User must be prevented from creating multiple restore CRs

Other details that may be helpful:

Environment information

  • Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.0-alpha1
  • Installation option (single install/HA): HA

Cluster information

  • Cluster type (Hosted/Infrastructure Provider/Custom/Imported): HA rke cluster
  • Kubernetes version (use kubectl version):
1.18
@sowmyav27 sowmyav27 changed the title Prevent 2 restores from happening simultaneously Prevent multiple restores from happening in parallel Aug 29, 2020
@maggieliu maggieliu transferred this issue from rancher/rancher Sep 2, 2020
@maggieliu maggieliu added this to the v2.5 milestone Sep 2, 2020
@mrajashree
Copy link
Contributor

We don't need to prevent this. The bug causing parallel restores to fail is fixed here #5

@sowmyav27
Copy link
Author

sowmyav27 commented Sep 26, 2020

On master-head - commit id: 27000be7e

  • Deploy backup restore app - with incorrect s3 details (bucketname)
  • s3 - has some backup files - file1, file2
  • Deploy a couple of restore CRs --> to restore from file1 and file2
  • Restore CRs will be in Error state because S3 details are incorrect in the default storage target
  • Upgrade the backup restore app and give in the right details of S3 bucket name.
  • 2 restores happen in parallel
  • Logs from restore-backup-operator:
INFO[2020/09/26 23:18:56] Scaling down controllerRef apps/v1/deployments/rancher to 0 
ERRO[2020/09/26 23:18:56] error syncing 'ssss': handler backups: resourcesets.resources.cattle.io "rancher-resource-sets" not found, requeuing 
INFO[2020/09/26 23:18:56] Starting to restore CRDs for restore CR restore-sx6f9 
INFO[2020/09/26 23:18:56] restoreResource: Restoring dynamicschemas.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:56] Processing backup back                       
INFO[2020/09/26 23:18:56] For backup CR back, filename: back-cf9d746d-551b-490a-b801-0bb85ab9a0be-2020-09-26T23-18-56Z 
INFO[2020/09/26 23:18:56] Temporary backup path for storing all contents for backup CR back is /tmp/back-cf9d746d-551b-490a-b801-0bb85ab9a0be-2020-09-26T23-18-56Z331815839 
INFO[2020/09/26 23:18:56] Using resourceSet rancher-resource-setdd for gathering resources for backup CR back 
INFO[2020/09/26 23:18:57] Successfully restored dynamicschemas.management.cattle.io 
INFO[2020/09/26 23:18:57] restoreResource: Restoring projectalerts.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:57] Gathering resources for groupVersion: rbac.authorization.k8s.io/v1 
ERRO[2020/09/26 23:18:57] error syncing 'back': handler backups: resourcesets.resources.cattle.io "rancher-resource-setdd" not found, requeuing 
INFO[2020/09/26 23:18:57] resource kind clusterrolebindings, matched regex ^clusterrolebindings$ 
INFO[2020/09/26 23:18:57] Successfully restored projectalerts.management.cattle.io 
INFO[2020/09/26 23:18:57] restoreResource: Restoring globaldnses.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:57] Processing controllerRef apps/v1/deployments/rancher 
INFO[2020/09/26 23:18:57] Successfully restored globaldnses.management.cattle.io 
INFO[2020/09/26 23:18:57] restoreResource: Restoring globalroles.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:57] Scaling down controllerRef apps/v1/deployments/rancher to 0 
INFO[2020/09/26 23:18:57] Starting to restore CRDs for restore CR restore-tv6v8 

Expected:
multiple restores must NOT happen in parallel

@mrajashree
Copy link
Contributor

mrajashree commented Oct 1, 2020

Prevention of creation of a Restore CR, when another restore CR is failing cannot be avoided. Because relying on state of other CRs can lead to bugs. So even if more than one restore CRs are created at the same time, the fix will ensure only one is processed at any time. Whichever restore first acquires a lock/lease starts getting processed. In the meanwhile, for the second (parallel) restore, there will be this error in logs:

restore <restoreName:restoreUID> is in progress

this only means that this restore will be processed only after the current restore is done processing.
Fix available with v1.0.100 chart version

@sowmyav27
Copy link
Author

sowmyav27 commented Oct 1, 2020

On Master-head - commit id: cb9e8a105c

  • Deploy rancher-backup app
  • Take 2 backups b1 and b2
  • Edit rancher-backup app and add in invalid folder details
  • Deploy 4 restore CRs
  • The restore CRs will be stuck in Error state
  • Edit rancher backup app and change the folder to the right value and Save changes made.
  • The restores kick in
  • operator logs:
INFO[2020/10/01 17:48:52] Processing Restore CR restore-8n5t7          
INFO[2020/10/01 17:48:52] Restoring from backup backup-02-ec4e885e-ee41-4ec7-8326-a35d24101b8b-2020-10-01T17-37-59Z.tar.gz 
ERRO[2020/10/01 17:48:52] error syncing 'restore-bs2lz': handler restore: leases.coordination.k8s.io "restore-controller" already exists, requeuing 
INFO[2020/10/01 17:48:52] invoking set s3 service client                s3-accessKey=<> s3-bucketName=<> s3-endpoint=s3.us-east-2.amazonaws.com s3-endpoint-ca= s3-folder= s3-region=
ERRO[2020/10/01 17:48:52] error syncing 'restore-tp84l': handler restore: restore restore-8n5t7:aef48efe-1ee2-448c-8b79-3f153dd1847c is in progress, requeuing 
ERRO[2020/10/01 17:48:52] error syncing 'restore-9gtp4': handler restore: restore restore-8n5t7:aef48efe-1ee2-448c-8b79-3f153dd1847c is in progress, requeuing 
ERRO[2020/10/01 17:48:52] error syncing 'restore-bs2lz': handler restore: restore restore-8n5t7:aef48efe-1ee2-448c-8b79-3f153dd1847c is in progress, requeuing
  • Noticed these error in the logs. While one restore is in progress, the other one is seen waiting.
  • Say restore1 is in progress, res2, res3 and res4 are in a waiting state.
  • When res1 has completed, res3 is picked and res2, res4 are in a waiting state.
  • No to restores happen in parallel
  • Rancher server comes up successfully.

@zube zube bot removed the [zube]: Done label Dec 31, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants