Restore fails to complete with error on AzureFile volumes #3027

sivaramsk · 2020-10-21T07:12:05Z

What steps did you take and what happened:

Installed Minio with Azure gateway.
Installed velero with aws plugin with minio endpoint, along with --use-restic and --default-volumes-to-restic flags
Triggered a backup(excluding namespaces velero and kube-system). The backup was successful the podvolumebackup did not show error on the volumes that I wanted to backup(for ex: "volume: datadir"). podvolumebackup output attached to the gist.
Restore of the above backup got struck after a point. The podvolumerestore had this error for the volumes with SC azurefile. podvolumerestore output attached to the gist.

message: |-
        error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://minio-server.minio.svc:9000/velero/restic/org2-net --password-file=/tmp/velero-restic-credentials-org2-net122979599 --cach    e-dir=/scratch/.cache/restic 54f54e51 --target=., stdout=, stderr=: error getting snapshot size: error running command, stderr=Fatal: error loading snapshot: no matching ID found

What did you expect to happen:
Restore to become successful.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

https://gist.github.com/sivaramsk/62ce61275335de296989378482f5aeb4

Anything else you would like to add:
Storageclass was patched with nouser_xattr before the backup.

Environment:

Velero version (use velero version):
Client:
Version: v1.5.1
Git commit: -
Server:
Version: v1.5.1
Velero features (use velero client config get features): features:
Kubernetes version (use kubectl version):
Client Version: 1.18.8
Server Version: 1.17.11
Kubernetes installer & version: AKS, created with terraform aks modules
Cloud provider or hardware configuration: AKS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

dsu-igeek · 2020-10-21T17:44:10Z

This error is in the backup-logs.txt:

lookup minio-server.minio.svc: no such host

There's some issue with resolving the Minio service. Can you verify that that is the correct service name and that it is resolvable from the Velero deployment?

sivaramsk · 2020-10-21T17:53:29Z

I confirmed the address minio-server.minio.svc is reachable from a different pod. I see the same error in the backup logs, but the backup is at-least partially successful and I see the files in my blob-container, so I assumed the error is momentary and cleared up later. Also the actual error in the restore is error loading snapshot: no matching ID found which means atleast the minio server is reachable right?

dsu-igeek · 2020-10-21T18:02:58Z

OK, so that appears to be a Kubernetes DNS resolution issue. Please see if you can get a successful restore. Closing for now, if the problem continues please re-open and we can help you troubleshoot it a bit.

sivaramsk · 2020-10-21T18:25:03Z

I commented this is not a DNS issue. The actual error in the restore logs is this error loading snapshot: no matching ID found. The address minio-server.minio.svc is reachable within the cluster and I assume the the error log in the backup is temporary and also this is not a backup issue.

zubron · 2020-10-21T19:23:13Z

The same error is being reported in #2691 and #2956. It's an error being reported from restic and I'm investigating #2691 further and will update if I find anything which I think is relevant here.

zubron · 2020-10-21T22:33:51Z

Hi @sivaramsk. I'd like to check that the snapshot that is being restored exists in your storage location, that will help us in determining why this error is being triggered.

Can you check the following path in your storage location /restic/ordorg-net/snapshots/ for a file beginning with the snapshot ID d4c5e980? That is the snapshot that restic is attempting to restore but can't find. Thanks!

sivaramsk · 2020-10-22T01:23:19Z

I can confirm the snapshot file do exists at /restic/ordorg-net/snapshots. Please see attached image. Let me know any other information would help in this issue.

sivaramsk · 2020-10-22T09:46:53Z

One thing I did not mention in the original description. After the backup is completed, I cleared the resources running in the cluster, like delete the deployments, statefulsets, pv, pvcs, etc, and then did the restore in the same kubernetes cluster. Would the restic snapshot continue to work even after the source data volume is deleted?

zubron · 2020-10-22T21:50:06Z

Hi @sivaramsk. Thanks for providing that information and extra context! Removing those resources should not be an issue as the restic backup should be in your configured backup storage location.

ashish-amarnath · 2020-10-29T00:49:47Z

@zubron I am not sure where you got the snapshot ID d4c5e980 from.
The original error in the issue is
https://gist.github.com/sivaramsk/62ce61275335de296989378482f5aeb4#file-podvolumerestore-txt-L146-L148

error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://minio-server.minio.svc:9000/velero/restic/org2-net --password-file=/tmp/velero-restic-credentials-org2-net122979599 --cach    e-dir=/scratch/.cache/restic 54f54e51 --target=., stdout=, stderr=: error getting snapshot size: error running command, stderr=Fatal: error loading snapshot: no matching ID found

Where the snapshot ID causing the restore to fail is 54f54e51.
@sivaramsk Can you please navigate to /restic/org2-net/snapshots/ to look for a snapshot with ID 54f54e51

sivaramsk · 2020-10-29T12:35:32Z

Unfortunately I don't have the setup as of now, but will try to reproduce it again and update.

zubron · 2020-10-29T13:47:55Z

@ashish-amarnath You're right, I messed up there, sorry 😞 There were a number of restores that failed in this particular setup and I don't know how I managed to copy the details for a pod volume restore that succeeded 🤦‍♀️

Looking through the list of PodVolumeRestores again though, I noticed something suspicious for all the snapshots that couldn't be found (82e47c5a, 54f54e51, and a99dd83b) - they are being used to restore the correct pod and volume name but in the wrong namespace.

For example:

Snapshot ID 54f54e51 was created for the volume certificates for pod peer0-0 in namespace org1-net. It is successfully used to restore that pod volume in that namespace however it is also used to attempt to restore the same volume name for the same pod name but in the other namespace (org2-net/peer0-0, volume certificates) which is the failed restore that we see. This restore should be using Snapshot ID fca27581.

This is the case for all the failed restores showing the no matching ID found error. The snapshot IDs in question are being used to restore a pod volume in the wrong namespace.

Snapshot ID 82e47c5a is used twice for two pod volume restores but in the restore where it is being used correctly to restore its associated pod volume it fails for a different reason.

It seems like this is definitely a bug in Velero where it is attempting to use the wrong Snapshot ID for certain PodVolumeRestores.

zubron · 2020-10-29T23:01:06Z

I took another look at this and found where it is failing.

We call GetVolumeBackupsForPod from RestorePodVolumes. When we call that function, we pass in all the pod volumes backups for a restore. GetVolumeBackupsForPd only checks that the name of the pods match, not that the namespaces match. This means that, depending on the order of the PodVolumeBackups passed into GetVolumeBackupsForPod, we will choose the incorrect volumes for one of the pods where the name is the same.

In the case where there pods of the same name, but have differently named volumes, the restic restore will fail because it will attempt to mount all volumes in the restic init container that match that a given pod name, even if it is for a pod in a different namespace: https://gist.github.com/zubron/dce63c10b5aea026f988b8a14d2934c5

ashish-amarnath · 2020-11-02T20:09:27Z

@zubron Thanks for debugging this. I've put up PR #3051 with the fix, PTAL.

dsu-igeek self-assigned this Oct 21, 2020

dsu-igeek added the Needs info Waiting for information label Oct 21, 2020

dsu-igeek removed their assignment Oct 21, 2020

dsu-igeek closed this as completed Oct 21, 2020

zubron self-assigned this Oct 21, 2020

zubron reopened this Oct 21, 2020

zubron removed their assignment Oct 21, 2020

zubron added Restic Relates to the restic integration Needs investigation and removed Needs info Waiting for information labels Oct 22, 2020

carlisia added Area/Cloud/Azure Info Received labels Oct 22, 2020

zubron self-assigned this Oct 22, 2020

ashish-amarnath added Needs info Waiting for information and removed Info Received labels Oct 29, 2020

zubron added the Bug label Oct 29, 2020

ashish-amarnath mentioned this issue Nov 2, 2020

🐛 Use namespace and name to match PVB to Pod restore #3051

Merged

ashish-amarnath self-assigned this Nov 2, 2020

ashish-amarnath added this to the v1.6.0 milestone Nov 2, 2020

nrb added this to To do in v1.6.0 Nov 2, 2020

ashish-amarnath moved this from To do to In progress in v1.6.0 Nov 3, 2020

nrb closed this as completed in #3051 Nov 10, 2020

v1.6.0 automation moved this from In progress to Done Nov 10, 2020

sivaramsk mentioned this issue Nov 11, 2020

Disaster Recovery in BAF hyperledger/bevel#1132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restore fails to complete with error on AzureFile volumes #3027

Restore fails to complete with error on AzureFile volumes #3027

sivaramsk commented Oct 21, 2020 •

edited

Loading

dsu-igeek commented Oct 21, 2020

sivaramsk commented Oct 21, 2020 •

edited

Loading

dsu-igeek commented Oct 21, 2020

sivaramsk commented Oct 21, 2020 •

edited

Loading

zubron commented Oct 21, 2020 •

edited

Loading

zubron commented Oct 21, 2020

sivaramsk commented Oct 22, 2020 •

edited

Loading

sivaramsk commented Oct 22, 2020 •

edited

Loading

zubron commented Oct 22, 2020

ashish-amarnath commented Oct 29, 2020

sivaramsk commented Oct 29, 2020

zubron commented Oct 29, 2020

zubron commented Oct 29, 2020

ashish-amarnath commented Nov 2, 2020 •

edited

Loading

Restore fails to complete with error on AzureFile volumes #3027

Restore fails to complete with error on AzureFile volumes #3027

Comments

sivaramsk commented Oct 21, 2020 • edited Loading

dsu-igeek commented Oct 21, 2020

sivaramsk commented Oct 21, 2020 • edited Loading

dsu-igeek commented Oct 21, 2020

sivaramsk commented Oct 21, 2020 • edited Loading

zubron commented Oct 21, 2020 • edited Loading

zubron commented Oct 21, 2020

sivaramsk commented Oct 22, 2020 • edited Loading

sivaramsk commented Oct 22, 2020 • edited Loading

zubron commented Oct 22, 2020

ashish-amarnath commented Oct 29, 2020

sivaramsk commented Oct 29, 2020

zubron commented Oct 29, 2020

zubron commented Oct 29, 2020

ashish-amarnath commented Nov 2, 2020 • edited Loading

sivaramsk commented Oct 21, 2020 •

edited

Loading

sivaramsk commented Oct 21, 2020 •

edited

Loading

sivaramsk commented Oct 21, 2020 •

edited

Loading

zubron commented Oct 21, 2020 •

edited

Loading

sivaramsk commented Oct 22, 2020 •

edited

Loading

sivaramsk commented Oct 22, 2020 •

edited

Loading

ashish-amarnath commented Nov 2, 2020 •

edited

Loading