-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kopia - Failed to restore more than 11 PVs , Error: "Unable to load snapshot {ID}: snapshot not found" #6748
Comments
@duduvaa You don't see the same problem for data mover, right? |
@Lyndon-Li , right |
Here are all the errors I found in the attached logs, 28 in total:
|
@duduvaa "snapshotID" should be one of the snapshot IDs list above. BTW, please use Kopia 0.13 binary and before running |
@Lyndon-Li , none of the "snapshotID" exists in "manifest list" The below output shows the successful restore namespace "perf-busy-data-cephrbd-20pods-2gb" - 11 ids - match to succeed restore: ./kopia manifest list010a01993b0934f019805f6ae347d426 213 2023-08-31 15:59:59 UTC type:maintenance namespace "perf-busy-data-cephrbd-30pods-2gb" - 11 ids - match to succeed restore: ./kopia manifest list561b1374fded9774751ebbd35cba46ac 213 2023-08-31 16:14:50 UTC type:maintenance |
@duduvaa |
./kopia snapshot listNo snapshots found. Pass --all to show snapshots from all users/hosts. ./kopia snapshot list --all |
The problem was caused by the misuse of Kopia default retention policy. As a result, for one source, only 10 latest snapshots are preserved, as the number in Kopia default retention policy is 10. Velero has its own retention policy on the backup level, the snapshots saved in the Kopia repo should follow the backup retention policy instead of using Kopia's default policy. This is a regression due to code changes in 1.12. Both fs-backup and data mover are affected. |
1st namespace kopia-backup-20pods-2gb Completed 0 0 2023-08-31 15:59:52 +0000 UTC 25d default ./kopia repository connect s3 --bucket="oadp-bucket" --access-key=Cvaz1RguhpAyuAYnAETQ --secret-access-key=+BSgIgkn0EvUT+kSeXcH8u4xo6MoHzSVRPG3R5js --disable-tls --endpoint="minio-minio.apps.vlan611.rdu2.scalelab.redhat.com" --prefix=velero/kopia/perf-busy-data-cephrbd-20pods-2gb/ ./kopia manifest list ./kopia snapshot list --all 2nd namespace kopia-backup-30pods-2gb Completed 0 0 2023-08-31 16:14:38 +0000 UTC 25d default ./kopia repository connect s3 --bucket="oadp-bucket" --access-key=Cvaz1RguhpAyuAYnAETQ --secret-access-key=+BSgIgkn0EvUT+kSeXcH8u4xo6MoHzSVRPG3R5js --disable-tls --endpoint="minio-minio.apps.vlan611.rdu2.scalelab.redhat.com" --prefix=velero/kopia/perf-busy-data-cephrbd-30pods-2gb ./kopia manifest list ./kopia snapshot list --all |
Bug verification failed. Restore 12 out of 20 Pods Errors: |
The problem is on backup side, so need to create new backups and the run the restore. |
running backup&restore cycles - Restore still failed image: Version: release-1.12-dev
Velero: pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 299e8a649026d1cbb122cee437cea184: snapshot not found backup plans statusNAME STATUS ERRORS WARNINGS CREATED EXPIRES STORAGE LOCATION SELECTOR NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR kopia-restore-30pods-2gb.tar.gz |
Looks like the number of failed PVRs has reduced significantly. @duduvaa Could you run the same |
20podskopia manifest list kopia snapshot list --all 30podskopia manifest list kopia snapshot list --all |
@duduvaa |
Bug verified OK. @Lyndon-Li , thanks for the quick fix |
What steps did you take and what happened:
Running kopia backup & restore.
Create ns with 5,10,20 & 30 Pods & PVs.
All backups were passed with 'Completed' status.
While running restore:
ns with 5 & 10 PVs - 'Completed' status. All Pods with 'Running' status
ns with 20 & 30 PVs - 'PartiallyFailed' status, only 11 Pods with 'Running' status, the rest Pods with 'Init:0/1' status.
All PVCs are with 'Bound' status.
Steps & Commands:
./velero backup create kopia-backup-5pods-2gb --include-namespaces perf-busy-data-cephrbd-5pods-2gb --default-volumes-to-fs-backup --snapshot-volumes=false -nupstream-velero
./velero restore create kopia-restore-5pods-2gb --from-backup kopia-backup-5pods-2gb -nupstream-velero
restore of ns with 5 & 10 PVs - Completed
restore of ns with 20 & 30 Pvs - PartiallyFailed.
NAME BACKUP STATUS STARTED COMPLETED ERRORS WARNINGS CREATED SELECTOR
kopia-restore-10pods-2gb kopia-backup-10pods-2gb Completed 2023-09-03 08:07:42 +0000 UTC 2023-09-03 08:13:24 +0000 UTC 0 10 2023-09-03 08:07:42 +0000 UTC
kopia-restore-20pods-2gb kopia-backup-20pods-2gb PartiallyFailed 2023-09-03 08:24:37 +0000 UTC 2023-09-03 08:31:16 +0000 UTC 9 10 2023-09-03 08:24:37 +0000 UTC
kopia-restore-30pods-2gb kopia-backup-30pods-2gb PartiallyFailed 2023-09-03 08:33:22 +0000 UTC 2023-09-03 08:36:23 +0000 UTC 19 10 2023-09-03 08:33:22 +0000 UTC
kopia-restore-5pods-2gb kopia-backup-5pods-2gb Completed 2023-09-03 08:01:39 +0000 UTC 2023-09-03 08:06:15 +0000 UTC 0 10 2023-09-03 08:01:39 +0000 UTC
Errors:
pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 12f5d4eaa6f8f7dcd1c7c330adc1a07f: snapshot not found
kopia-restore-20pods-2gb-describe.txt: pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot f29e0d7bd0e89a0e690505032f9f5184: snapshot not found
kopia-restore-20pods-2gb-describe.txt: pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 176ef4dc665a8bd8375ef78a6f08c318: snapshot not found
time="2023-09-03T08:25:02Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot 12f5d4eaa6f8f7dcd1c7c330adc1a07f: snapshot not found" logSource="pkg/restore/restore.go:1731" restore=upstream-velero/kopia-restore-20pods-2gb
kopia-restore-20pods-2gb.log:time="2023-09-03T08:25:03Z" level=error msg="unable to successfully complete pod volume restores of pod's volumes" error="pod volume restore failed: data path restore failed: Failed to run kopia restore: Unable to load snapshot f29e0d7bd0e89a0e690505032f9f5184: snapshot not found" logSource="pkg/restore/restore.go:1731" restore=upstream-velero/kopia-restore-20pods-2gb
What did you expect to happen:
Restore all pods & PVs successfully
The following information will help us better understand what's going on:
Attached Velero debug of all 4 restores.
kopia-restore-5pods-2gb.tar.gz
kopia-restore-30pods-2gb.tar.gz
kopia-restore-20pods-2gb.tar.gz
kopia-restore-10pods-2gb.tar.gz
Anything else you would like to add:
Pod with 'Init:0/1' status:
oc describe pod busy-data-rbd-30pods-2gb-16-b77d9cd98-rb6fl -nperf-busy-data-cephrbd-30pods-2gb
Init Containers:
restore-wait:
Container ID: cri-o://68f159b3a97d59aed6f7cc2c12417041bc8a96bf41432f00c9927d3dcae9e5ef
Image: velero/velero-restore-helper:main
Image ID: docker.io/velero/velero-restore-helper@sha256:2ddad48ce9bbbbc965bf8a27d70cb9c303aa31c10f0ce06eab9a72f7399adaf7
Port:
Host Port:
Command:
/velero-restore-helper
Args:
3e44e282-2e2a-4082-87fc-52f13f5afbfe
State: Running
Started: Sun, 03 Sep 2023 08:33:53 +0000
Ready: False
Restart Count: 0
Limits:
cpu: 100m
memory: 128Mi
Requests:
cpu: 100m
memory: 128Mi
Environment:
POD_NAMESPACE: perf-busy-data-cephrbd-30pods-2gb (v1:metadata.namespace)
POD_NAME: busy-data-rbd-30pods-2gb-16-b77d9cd98-rb6fl (v1:metadata.name)
Mounts:
/restores/vol-0 from vol-0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-xwftf (ro)
oc logs pod/busy-data-rbd-30pods-2gb-1-c945fcd9d-dqkqv -c restore-wait -nperf-busy-data-cephrbd-30pods-2gb
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
oc logs pod/busy-data-rbd-30pods-2gb-1-c945fcd9d-dqkqv -c restore-wait -nperf-busy-data-cephrbd-30pods-2gb | grep -c "Not found"
20883
oc logs pod/busy-data-rbd-30pods-2gb-1-c945fcd9d-dqkqv -c restore-wait -nperf-busy-data-cephrbd-30pods-2gb > /root/restore-wait
restore-wait.gz
Pod with 'Running' status:
oc describe pod busy-data-rbd-30pods-2gb-7-66fbb475f6-d2v5l -nperf-busy-data-cephrbd-30pods-2gb
Init Containers:
restore-wait:
Container ID: cri-o://fc1ed2b6adf59ee6395cd420ffcb19e3b64ab054d5cf644bf063d7638f1fd376
Image: velero/velero-restore-helper:main
Image ID: docker.io/velero/velero-restore-helper@sha256:3c01a8c5efcc7374481b21f19dfe949724ca5149801af549dd64e2b3a2cc2d95
Port:
Host Port:
Command:
/velero-restore-helper
Args:
3e44e282-2e2a-4082-87fc-52f13f5afbfe
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sun, 03 Sep 2023 08:34:11 +0000
Finished: Sun, 03 Sep 2023 08:34:42 +0000
Ready: True
Restart Count: 0
Limits:
cpu: 100m
memory: 128Mi
Requests:
cpu: 100m
memory: 128Mi
Environment:
POD_NAMESPACE: perf-busy-data-cephrbd-30pods-2gb (v1:metadata.namespace)
POD_NAME: busy-data-rbd-30pods-2gb-7-66fbb475f6-d2v5l (v1:metadata.name)
Mounts:
/restores/vol-0 from vol-0 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-mhv2f (ro)
#oc logs pod/busy-data-rbd-30pods-2gb-7-66fbb475f6-d2v5l -c restore-wait -nperf-busy-data-cephrbd-30pods-2gb
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Not found: /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe
Found /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfeAll restic restores are done
unlinkat /restores/vol-0/.velero/3e44e282-2e2a-4082-87fc-52f13f5afbfe: permission denied
oc logs pod/busy-data-rbd-30pods-2gb-7-66fbb475f6-d2v5l -c restore-wait -nperf-busy-data-cephrbd-30pods-2gb | grep -c "Not found"
30
Environment:
Velero version: main (Velero-1.12) , last commit
commit 30e54b0 (HEAD -> main, origin/main, origin/HEAD)
Author: Daniel Jiang jiangd@vmware.com
Date: Wed Aug 16 15:45:00 2023 +0800
Velero features (use
velero client config get features
):./velero client config get features
features:
Kubernetes version (use
kubectl version
):oc version
Client Version: 4.12.9
Kustomize Version: v4.5.7
Server Version: 4.12.9
Kubernetes Version: v1.25.7+eab9cc9
OCP running over BM servers
3 masters & 6 workers nodes
oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready control-plane,master 148d v1.25.7+eab9cc9
master-1 Ready control-plane,master 148d v1.25.7+eab9cc9
master-2 Ready control-plane,master 148d v1.25.7+eab9cc9
worker000-r640 Ready worker 148d v1.25.7+eab9cc9
worker001-r640 Ready worker 148d v1.25.7+eab9cc9
worker002-r640 Ready worker 148d v1.25.7+eab9cc9
worker003-r640 Ready worker 148d v1.25.7+eab9cc9
worker004-r640 Ready worker 148d v1.25.7+eab9cc9
worker005-r640 Ready worker 148d v1.25.7+eab9cc9
/etc/os-release
):Red Hat Enterprise Linux CoreOS 412.86.202303211731-0
Part of OpenShift 4.12, RHCOS is a Kubernetes native operating system
cat /etc/os-release
NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="412.86.202303211731-0"
VERSION_ID="4.12"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/"
BUG_REPORT_URL="https://access.redhat.com/labs/rhir/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.12"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.12"
OPENSHIFT_VERSION="4.12"
RHEL_VERSION="8.6"
OSTREE_VERSION="412.86.202303211731-0"
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.
The text was updated successfully, but these errors were encountered: