Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restore fails to complete with error on AzureFile volumes #3027

Closed
sivaramsk opened this issue Oct 21, 2020 · 14 comments · Fixed by #3051
Closed

Restore fails to complete with error on AzureFile volumes #3027

sivaramsk opened this issue Oct 21, 2020 · 14 comments · Fixed by #3051
Assignees
Labels
Area/Cloud/Azure Bug Needs info Waiting for information Needs investigation Restic Relates to the restic integration
Projects
Milestone

Comments

@sivaramsk
Copy link

sivaramsk commented Oct 21, 2020

What steps did you take and what happened:

  1. Installed Minio with Azure gateway.
  2. Installed velero with aws plugin with minio endpoint, along with --use-restic and --default-volumes-to-restic flags
  3. Triggered a backup(excluding namespaces velero and kube-system). The backup was successful the podvolumebackup did not show error on the volumes that I wanted to backup(for ex: "volume: datadir"). podvolumebackup output attached to the gist.
  4. Restore of the above backup got struck after a point. The podvolumerestore had this error for the volumes with SC azurefile. podvolumerestore output attached to the gist.
message: |-
        error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://minio-server.minio.svc:9000/velero/restic/org2-net --password-file=/tmp/velero-restic-credentials-org2-net122979599 --cach    e-dir=/scratch/.cache/restic 54f54e51 --target=., stdout=, stderr=: error getting snapshot size: error running command, stderr=Fatal: error loading snapshot: no matching ID found

What did you expect to happen:
Restore to become successful.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

https://gist.github.com/sivaramsk/62ce61275335de296989378482f5aeb4

Anything else you would like to add:
Storageclass was patched with nouser_xattr before the backup.

Environment:

  • Velero version (use velero version):
    Client:
    Version: v1.5.1
    Git commit: -
    Server:
    Version: v1.5.1

  • Velero features (use velero client config get features): features:

  • Kubernetes version (use kubectl version):
    Client Version: 1.18.8
    Server Version: 1.17.11

  • Kubernetes installer & version: AKS, created with terraform aks modules

  • Cloud provider or hardware configuration: AKS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@dsu-igeek dsu-igeek self-assigned this Oct 21, 2020
@dsu-igeek dsu-igeek added the Needs info Waiting for information label Oct 21, 2020
@dsu-igeek
Copy link
Contributor

This error is in the backup-logs.txt:

lookup minio-server.minio.svc: no such host

There's some issue with resolving the Minio service. Can you verify that that is the correct service name and that it is resolvable from the Velero deployment?

@dsu-igeek dsu-igeek removed their assignment Oct 21, 2020
@sivaramsk
Copy link
Author

sivaramsk commented Oct 21, 2020

I confirmed the address minio-server.minio.svc is reachable from a different pod. I see the same error in the backup logs, but the backup is at-least partially successful and I see the files in my blob-container, so I assumed the error is momentary and cleared up later. Also the actual error in the restore is error loading snapshot: no matching ID found which means atleast the minio server is reachable right?

@dsu-igeek
Copy link
Contributor

OK, so that appears to be a Kubernetes DNS resolution issue. Please see if you can get a successful restore. Closing for now, if the problem continues please re-open and we can help you troubleshoot it a bit.

@sivaramsk
Copy link
Author

sivaramsk commented Oct 21, 2020

I commented this is not a DNS issue. The actual error in the restore logs is this error loading snapshot: no matching ID found. The address minio-server.minio.svc is reachable within the cluster and I assume the the error log in the backup is temporary and also this is not a backup issue.

@zubron
Copy link
Contributor

zubron commented Oct 21, 2020

The same error is being reported in #2691 and #2956. It's an error being reported from restic and I'm investigating #2691 further and will update if I find anything which I think is relevant here.

@zubron zubron self-assigned this Oct 21, 2020
@zubron
Copy link
Contributor

zubron commented Oct 21, 2020

Hi @sivaramsk. I'd like to check that the snapshot that is being restored exists in your storage location, that will help us in determining why this error is being triggered.

Can you check the following path in your storage location /restic/ordorg-net/snapshots/ for a file beginning with the snapshot ID d4c5e980? That is the snapshot that restic is attempting to restore but can't find. Thanks!

@zubron zubron reopened this Oct 21, 2020
@zubron zubron removed their assignment Oct 21, 2020
@sivaramsk
Copy link
Author

sivaramsk commented Oct 22, 2020

I can confirm the snapshot file do exists at /restic/ordorg-net/snapshots. Please see attached image. Let me know any other information would help in this issue.

Screenshot 2020-10-22 at 6 51 57 AM

@sivaramsk
Copy link
Author

sivaramsk commented Oct 22, 2020

One thing I did not mention in the original description. After the backup is completed, I cleared the resources running in the cluster, like delete the deployments, statefulsets, pv, pvcs, etc, and then did the restore in the same kubernetes cluster. Would the restic snapshot continue to work even after the source data volume is deleted?

@zubron zubron added Restic Relates to the restic integration Needs investigation and removed Needs info Waiting for information labels Oct 22, 2020
@zubron
Copy link
Contributor

zubron commented Oct 22, 2020

Hi @sivaramsk. Thanks for providing that information and extra context! Removing those resources should not be an issue as the restic backup should be in your configured backup storage location.

@zubron zubron self-assigned this Oct 22, 2020
@ashish-amarnath
Copy link
Member

@zubron I am not sure where you got the snapshot ID d4c5e980 from.
The original error in the issue is
https://gist.github.com/sivaramsk/62ce61275335de296989378482f5aeb4#file-podvolumerestore-txt-L146-L148

error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://minio-server.minio.svc:9000/velero/restic/org2-net --password-file=/tmp/velero-restic-credentials-org2-net122979599 --cach    e-dir=/scratch/.cache/restic 54f54e51 --target=., stdout=, stderr=: error getting snapshot size: error running command, stderr=Fatal: error loading snapshot: no matching ID found

Where the snapshot ID causing the restore to fail is 54f54e51.
@sivaramsk Can you please navigate to /restic/org2-net/snapshots/ to look for a snapshot with ID 54f54e51

@ashish-amarnath ashish-amarnath added Needs info Waiting for information and removed Info Received labels Oct 29, 2020
@sivaramsk
Copy link
Author

Unfortunately I don't have the setup as of now, but will try to reproduce it again and update.

@zubron
Copy link
Contributor

zubron commented Oct 29, 2020

@ashish-amarnath You're right, I messed up there, sorry 😞 There were a number of restores that failed in this particular setup and I don't know how I managed to copy the details for a pod volume restore that succeeded 🤦‍♀️

Looking through the list of PodVolumeRestores again though, I noticed something suspicious for all the snapshots that couldn't be found (82e47c5a, 54f54e51, and a99dd83b) - they are being used to restore the correct pod and volume name but in the wrong namespace.

For example:

Snapshot ID 54f54e51 was created for the volume certificates for pod peer0-0 in namespace org1-net. It is successfully used to restore that pod volume in that namespace however it is also used to attempt to restore the same volume name for the same pod name but in the other namespace (org2-net/peer0-0, volume certificates) which is the failed restore that we see. This restore should be using Snapshot ID fca27581.

This is the case for all the failed restores showing the no matching ID found error. The snapshot IDs in question are being used to restore a pod volume in the wrong namespace.

Snapshot ID 82e47c5a is used twice for two pod volume restores but in the restore where it is being used correctly to restore its associated pod volume it fails for a different reason.

It seems like this is definitely a bug in Velero where it is attempting to use the wrong Snapshot ID for certain PodVolumeRestores.

@zubron zubron added the Bug label Oct 29, 2020
@zubron
Copy link
Contributor

zubron commented Oct 29, 2020

I took another look at this and found where it is failing.

We call GetVolumeBackupsForPod from RestorePodVolumes. When we call that function, we pass in all the pod volumes backups for a restore. GetVolumeBackupsForPd only checks that the name of the pods match, not that the namespaces match. This means that, depending on the order of the PodVolumeBackups passed into GetVolumeBackupsForPod, we will choose the incorrect volumes for one of the pods where the name is the same.

In the case where there pods of the same name, but have differently named volumes, the restic restore will fail because it will attempt to mount all volumes in the restic init container that match that a given pod name, even if it is for a pod in a different namespace: https://gist.github.com/zubron/dce63c10b5aea026f988b8a14d2934c5

@ashish-amarnath
Copy link
Member

ashish-amarnath commented Nov 2, 2020

@zubron Thanks for debugging this. I've put up PR #3051 with the fix, PTAL.

@ashish-amarnath ashish-amarnath self-assigned this Nov 2, 2020
@ashish-amarnath ashish-amarnath added this to the v1.6.0 milestone Nov 2, 2020
@nrb nrb added this to To do in v1.6.0 Nov 2, 2020
@ashish-amarnath ashish-amarnath moved this from To do to In progress in v1.6.0 Nov 3, 2020
@nrb nrb closed this as completed in #3051 Nov 10, 2020
v1.6.0 automation moved this from In progress to Done Nov 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Cloud/Azure Bug Needs info Waiting for information Needs investigation Restic Relates to the restic integration
Projects
No open projects
v1.6.0
  
Done
Development

Successfully merging a pull request may close this issue.

5 participants