Prevent multiple restores from happening in parallel #9

sowmyav27 · 2020-08-29T00:30:15Z

What kind of request is this (question/bug/enhancement/feature request): bug

Steps to reproduce (least amount of steps as possible):

on 2.5.0-alpha1
Deploy custom charts for enabling backup MCM
Deploy resourceset and create a backup
Currently if a restore fails and is stuck in "In Progress" state, user is able create another restore CR.

Expected Result:
User must be prevented from creating multiple restore CRs

Other details that may be helpful:

Environment information

Rancher version (rancher/rancher/rancher/server image tag or shown bottom left in the UI): 2.5.0-alpha1
Installation option (single install/HA): HA

Cluster information

Cluster type (Hosted/Infrastructure Provider/Custom/Imported): HA rke cluster
Kubernetes version (use kubectl version):

1.18

The text was updated successfully, but these errors were encountered:

mrajashree · 2020-09-04T04:56:54Z

We don't need to prevent this. The bug causing parallel restores to fail is fixed here #5

sowmyav27 · 2020-09-26T23:28:54Z

On master-head - commit id: 27000be7e

Deploy backup restore app - with incorrect s3 details (bucketname)
s3 - has some backup files - file1, file2
Deploy a couple of restore CRs --> to restore from file1 and file2
Restore CRs will be in Error state because S3 details are incorrect in the default storage target
Upgrade the backup restore app and give in the right details of S3 bucket name.
2 restores happen in parallel
Logs from restore-backup-operator:

INFO[2020/09/26 23:18:56] Scaling down controllerRef apps/v1/deployments/rancher to 0 
ERRO[2020/09/26 23:18:56] error syncing 'ssss': handler backups: resourcesets.resources.cattle.io "rancher-resource-sets" not found, requeuing 
INFO[2020/09/26 23:18:56] Starting to restore CRDs for restore CR restore-sx6f9 
INFO[2020/09/26 23:18:56] restoreResource: Restoring dynamicschemas.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:56] Processing backup back                       
INFO[2020/09/26 23:18:56] For backup CR back, filename: back-cf9d746d-551b-490a-b801-0bb85ab9a0be-2020-09-26T23-18-56Z 
INFO[2020/09/26 23:18:56] Temporary backup path for storing all contents for backup CR back is /tmp/back-cf9d746d-551b-490a-b801-0bb85ab9a0be-2020-09-26T23-18-56Z331815839 
INFO[2020/09/26 23:18:56] Using resourceSet rancher-resource-setdd for gathering resources for backup CR back 
INFO[2020/09/26 23:18:57] Successfully restored dynamicschemas.management.cattle.io 
INFO[2020/09/26 23:18:57] restoreResource: Restoring projectalerts.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:57] Gathering resources for groupVersion: rbac.authorization.k8s.io/v1 
ERRO[2020/09/26 23:18:57] error syncing 'back': handler backups: resourcesets.resources.cattle.io "rancher-resource-setdd" not found, requeuing 
INFO[2020/09/26 23:18:57] resource kind clusterrolebindings, matched regex ^clusterrolebindings$ 
INFO[2020/09/26 23:18:57] Successfully restored projectalerts.management.cattle.io 
INFO[2020/09/26 23:18:57] restoreResource: Restoring globaldnses.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:57] Processing controllerRef apps/v1/deployments/rancher 
INFO[2020/09/26 23:18:57] Successfully restored globaldnses.management.cattle.io 
INFO[2020/09/26 23:18:57] restoreResource: Restoring globalroles.management.cattle.io of type apiextensions.k8s.io/v1beta1, Resource=customresourcedefinitions 
INFO[2020/09/26 23:18:57] Scaling down controllerRef apps/v1/deployments/rancher to 0 
INFO[2020/09/26 23:18:57] Starting to restore CRDs for restore CR restore-tv6v8

Expected:
multiple restores must NOT happen in parallel

mrajashree · 2020-10-01T02:41:04Z

Prevention of creation of a Restore CR, when another restore CR is failing cannot be avoided. Because relying on state of other CRs can lead to bugs. So even if more than one restore CRs are created at the same time, the fix will ensure only one is processed at any time. Whichever restore first acquires a lock/lease starts getting processed. In the meanwhile, for the second (parallel) restore, there will be this error in logs:

restore <restoreName:restoreUID> is in progress

this only means that this restore will be processed only after the current restore is done processing.
Fix available with v1.0.100 chart version

sowmyav27 · 2020-10-01T18:25:00Z

On Master-head - commit id: cb9e8a105c

Deploy rancher-backup app
Take 2 backups b1 and b2
Edit rancher-backup app and add in invalid folder details
Deploy 4 restore CRs
The restore CRs will be stuck in Error state
Edit rancher backup app and change the folder to the right value and Save changes made.
The restores kick in
operator logs:

INFO[2020/10/01 17:48:52] Processing Restore CR restore-8n5t7          
INFO[2020/10/01 17:48:52] Restoring from backup backup-02-ec4e885e-ee41-4ec7-8326-a35d24101b8b-2020-10-01T17-37-59Z.tar.gz 
ERRO[2020/10/01 17:48:52] error syncing 'restore-bs2lz': handler restore: leases.coordination.k8s.io "restore-controller" already exists, requeuing 
INFO[2020/10/01 17:48:52] invoking set s3 service client                s3-accessKey=<> s3-bucketName=<> s3-endpoint=s3.us-east-2.amazonaws.com s3-endpoint-ca= s3-folder= s3-region=
ERRO[2020/10/01 17:48:52] error syncing 'restore-tp84l': handler restore: restore restore-8n5t7:aef48efe-1ee2-448c-8b79-3f153dd1847c is in progress, requeuing 
ERRO[2020/10/01 17:48:52] error syncing 'restore-9gtp4': handler restore: restore restore-8n5t7:aef48efe-1ee2-448c-8b79-3f153dd1847c is in progress, requeuing 
ERRO[2020/10/01 17:48:52] error syncing 'restore-bs2lz': handler restore: restore restore-8n5t7:aef48efe-1ee2-448c-8b79-3f153dd1847c is in progress, requeuing

Noticed these error in the logs. While one restore is in progress, the other one is seen waiting.
Say restore1 is in progress, res2, res3 and res4 are in a waiting state.
When res1 has completed, res3 is picked and res2, res4 are in a waiting state.
No to restores happen in parallel
Rancher server comes up successfully.

sowmyav27 assigned sowmyav27 and mrajashree Aug 29, 2020

sowmyav27 changed the title ~~Prevent 2 restores from happening simultaneously~~ Prevent multiple restores from happening in parallel Aug 29, 2020

maggieliu transferred this issue from rancher/rancher Sep 2, 2020

maggieliu added this to the v2.5 milestone Sep 2, 2020

maggieliu added [zube]: Next Up kind/bug-qa labels Sep 2, 2020

mrajashree closed this as completed Sep 4, 2020

leodotcloud added [zube]: Done and removed [zube]: Next Up labels Sep 4, 2020

sowmyav27 added [zube]: Reopened and removed [zube]: Done labels Sep 26, 2020

zube bot reopened this Sep 26, 2020

mrajashree added the [zube]: Working label Sep 27, 2020

zube bot removed the [zube]: Reopened label Sep 27, 2020

This was referenced Sep 28, 2020

Process one restore at a time #40

Closed

Prevent multiple restores #41

Merged

maggieliu added [zube]: Review and removed [zube]: Working labels Sep 28, 2020

mrajashree added the [zube]: Waiting for RC label Sep 30, 2020

zube bot removed the [zube]: Review label Sep 30, 2020

mrajashree added the [zube]: To Test label Oct 1, 2020

zube bot removed the [zube]: Waiting for RC label Oct 1, 2020

sowmyav27 closed this as completed Oct 1, 2020

zube bot added [zube]: Done and removed [zube]: To Test labels Oct 1, 2020

zube bot removed the [zube]: Done label Dec 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent multiple restores from happening in parallel #9

Prevent multiple restores from happening in parallel #9

sowmyav27 commented Aug 29, 2020

mrajashree commented Sep 4, 2020

sowmyav27 commented Sep 26, 2020 •

edited

mrajashree commented Oct 1, 2020 •

edited

sowmyav27 commented Oct 1, 2020 •

edited

Prevent multiple restores from happening in parallel #9

Prevent multiple restores from happening in parallel #9

Comments

sowmyav27 commented Aug 29, 2020

mrajashree commented Sep 4, 2020

sowmyav27 commented Sep 26, 2020 • edited

mrajashree commented Oct 1, 2020 • edited

sowmyav27 commented Oct 1, 2020 • edited

sowmyav27 commented Sep 26, 2020 •

edited

mrajashree commented Oct 1, 2020 •

edited

sowmyav27 commented Oct 1, 2020 •

edited