DataMover - Failed to complete restore , few datadownloads are stuck 'InProgress' status #6733

duduvaa · 2023-08-31T13:07:33Z

What steps did you take and what happened:

Running datamover restore of 100 PVs. A few 'datadownloads' resources are stuck in 'InProgress' status although the total bytes already done. The restore CR status is 'WaitingForPluginOperations' till the 4hrs timeout is reached.
100 PVC were created - 98 are 'Bound' and 2 are 'Pending'.

Steps & Commands:
datadownloadsissue.tar.gz

Create a namespace with 100PVs (+pods). Each PV size is 6GB with data usage 2GB.
Running datamover backup:
./velero backup create dm-backup-100pods-2gb --include-namespaces perf-busy-data-cephrbd-100pods-2gb --data-mover "velero" --snapshot-move-data=true -nupstream-velero
Backup completed in 12:05 minutes
Delete the namespace , volumesnapshot & volumesnapshotcontent
Running restore:
./velero restore create dm-restore-100pods-2gb --from-backup dm-backup-100pods-2gb -nupstream-velero
98 restore completed
2 restore failed ( 2 datadownloads resources are 'InProgress' status.

What did you expect to happen:
Restore all pods & PVs successfully

The following information will help us better understand what's going on:

./velero debug --restore dm-restore-100pods-2gb -nupstream-velero
2023/08/31 11:05:54 Collecting velero resources in namespace: upstream-velero
2023/08/31 11:05:55 Collecting velero deployment logs in namespace: upstream-velero
2023/08/31 11:05:58 Collecting log and information for restore: dm-restore-100pods-2gb
2023/08/31 11:05:58 Generated debug information bundle: bundle-2023-08-31-11-05-54.tar.gz

DEBUG folder (attached) includes:

node-agent logs
bundle-2023-08-31-11-05-54.tar.gz (velero debug command)
dm-data-debug.log: test datadownloads progress.
datadownloads_issue.log: 2 problematic datadownloads progress
test_monitor.txt: restore CR & 2 datadownloads resources status

DM-Data folder includes all datadownloads yaml file (oc get datadownloads {NAME} -nupstream-velero -oyaml command)

Anything else you would like to add:

I ran the same test with 50, 80 ,90, 100 , 120 PVs per namespace.
50 & 80 PVs - Completed successfully
90 , 100 & 120 - Restore failed.

Environment:

Velero version: main (Velero-1.12) , last commit
commit 30e54b0 (HEAD -> main, origin/main, origin/HEAD)
Author: Daniel Jiang jiangd@vmware.com
Date: Wed Aug 16 15:45:00 2023 +0800
Velero features (use velero client config get features):

./velero client config get features

features:
Kubernetes version (use kubectl version):

oc version

Client Version: 4.12.9
Kustomize Version: v4.5.7
Server Version: 4.12.9
Kubernetes Version: v1.25.7+eab9cc9

Cloud provider or hardware configuration:
OCP running over BM servers
3 masters & 6 workers nodes
oc get nodes

NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 148d v1.25.7+eab9cc9
master-1 Ready control-plane,master 148d v1.25.7+eab9cc9
master-2 Ready control-plane,master 148d v1.25.7+eab9cc9
worker000-r640 Ready worker 148d v1.25.7+eab9cc9
worker001-r640 Ready worker 148d v1.25.7+eab9cc9
worker002-r640 Ready worker 148d v1.25.7+eab9cc9
worker003-r640 Ready worker 148d v1.25.7+eab9cc9
worker004-r640 Ready worker 148d v1.25.7+eab9cc9
worker005-r640 Ready worker 148d v1.25.7+eab9cc9

OS (e.g. from /etc/os-release):
Red Hat Enterprise Linux CoreOS 412.86.202303211731-0
Part of OpenShift 4.12, RHCOS is a Kubernetes native operating system

cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202303211731-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202303211731-0"

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2023-09-01T06:25:05Z

The error is as below:
time="2023-08-31T06:47:22Z" level=error msg="Failed to rebind PV to target PVC on completion" controller=DataDownload datadownload=dm-restore-100pods-2gb-rmgrg error="error to wait restore PV bound, restore PV pvc-ac7cf4cb-b58d-48b4-a3df-6f8931b17150: error to wait for bound of PV: pv has been bound by unexpected pvc upstream-velero/dm-restore-100pods-2gb-gjwrd" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/util/kube/pvc_pv.go:287" error.function=github.com/vmware-tanzu/velero/pkg/util/kube.WaitPVBound.func1 logSource="pkg/controller/data_download_controller.go:293"

The restore PV is bound by another restore PVC instead of the target PVC.

duduvaa · 2023-09-01T09:43:04Z

Thanks @Lyndon-Li

According to the above error
The error includes the 2 problematic datadownloads resoures:
dm-restore-100pods-2gb-rmgrg
dm-restore-100pods-2gb-gjwrd

output from DEBUG/datadownloads_issue.log
Thu Aug 31 06:47:04 UTC 2023
dm-restore-100pods-2gb-rmgrg InProgress 21s 1353830747 2119680347 default 21s worker000-r640
Thu Aug 31 06:48:05 UTC 2023
dm-restore-100pods-2gb-gjwrd Prepared 44s default 44s worker002-r640
dm-restore-100pods-2gb-rmgrg InProgress 81s 2119680347 2119680347 default 81s worker000-r640

Thu Aug 31 07:03:06 UTC 2023
dm-restore-100pods-2gb-gjwrd InProgress 15m 2119680347 2119680347 default 15m worker002-r640
dm-restore-100pods-2gb-rmgrg InProgress 16m 2119680347 2119680347 default 16m worker000-r640

Lyndon-Li · 2023-09-01T10:02:00Z

@duduvaa
The second error (for dm-restore-100pods-2gb-gjwrd) is a consequence of the first error (for dm-restore-100pods-2gb-rmgrg):

As you can see above, there was one restorePVC wrongly bound to the PV and cause DD dm-restore-100pods-2gb-rmgrg failed.
So by the end of handling the failed DD dm-restore-100pods-2gb-rmgrg, the controller deleted all the intermediate resources, including the wrongly bound restorePV pvc-ac7cf4cb-b58d-48b4-a3df-6f8931b17150
The restorePVC upstream-velero/dm-restore-100pods-2gb-gjwrd that did things wrong was right for DD dm-restore-100pods-2gb-gjwrd. It means PVC upstream-velero/dm-restore-100pods-2gb-gjwrd was binding to PV pvc-ac7cf4cb-b58d-48b4-a3df-6f8931b17150
Since the first DD had deleted the PV, the second DD would definitely fail

Lyndon-Li · 2023-09-04T01:14:04Z

Fixed by #6738 and #6743

duduvaa · 2023-09-06T12:54:23Z

Bug verified OK.
All 100 PVs were created and restore ended with 'Completed' status.

Thank you @Lyndon-Li for the quick fix

Lyndon-Li self-assigned this Sep 1, 2023

Lyndon-Li added this to the v1.12 milestone Sep 1, 2023

Lyndon-Li added the area/datamover label Sep 1, 2023

Lyndon-Li mentioned this issue Sep 1, 2023

Fix issue 6733 #6738

Merged

Lyndon-Li added the target/1.12-rc.2 label Sep 1, 2023

Lyndon-Li mentioned this issue Sep 1, 2023

[1.12] Fix issue 6733 #6743

Merged

Lyndon-Li closed this as completed Sep 4, 2023

duduvaa mentioned this issue Sep 4, 2023

Kopia - Failed to restore more than 11 PVs , Error: "Unable to load snapshot {ID}: snapshot not found" #6748

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataMover - Failed to complete restore , few datadownloads are stuck 'InProgress' status #6733

DataMover - Failed to complete restore , few datadownloads are stuck 'InProgress' status #6733

duduvaa commented Aug 31, 2023

./velero client config get features

oc version

oc get nodes

Lyndon-Li commented Sep 1, 2023

duduvaa commented Sep 1, 2023

Lyndon-Li commented Sep 1, 2023

Lyndon-Li commented Sep 4, 2023

duduvaa commented Sep 6, 2023

DataMover - Failed to complete restore , few datadownloads are stuck 'InProgress' status #6733

DataMover - Failed to complete restore , few datadownloads are stuck 'InProgress' status #6733

Comments

duduvaa commented Aug 31, 2023

./velero client config get features

oc version

oc get nodes

Lyndon-Li commented Sep 1, 2023

duduvaa commented Sep 1, 2023

Lyndon-Li commented Sep 1, 2023

Lyndon-Li commented Sep 4, 2023

duduvaa commented Sep 6, 2023