PartiallyFailed hit when restoring 2 namespaces with restic plugin in PKS env #2691

luwang-vmware · 2020-07-07T02:33:54Z

What steps did you take and what happened:
This issue can be reproducible.

create 2 namespace, named as ns-1g-1 and ns-20g-1.
in the two namespaces, I both created statefulset with PV, one is with 1G PV, and the other is with 20G PV.
Write data to the 1G PV and 20G PV seperatly.
annotate the two pods

kubectl --context=pks-st-2 -n ns-1g-1 annotate pod/postgres-wl-0 backup.velero.io/backup-volumes=postgredb
kubectl --context=pks-st-2 -n ns-20g-1 annotate pod/postgres-wl-0 backup.velero.io/backup-volumes=postgredb

backup is succeed
velero --kubecontext=pks-st-2 backup create 1g-20g-backup --include-namespaces ns-1g-1,ns-20g-1

root@6c6fe943453d:~/velero-v1.4.0-linux-amd64/post/scale-testing# velero --kubecontext=pks-st-2  backup describe 1g-20g-backup --details
Name:         1g-20g-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.17.5+vmware.1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=17

Phase:  Completed

Namespaces:
  Included:  ns-1g-1, ns-20g-1
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1

Started:    2020-07-06 10:52:06 +0000 UTC
Completed:  2020-07-06 10:58:48 +0000 UTC

Expiration:  2020-08-05 10:52:06 +0000 UTC

Total items to be backed up:  42
Items backed up:              42

Resource List:
  apps/v1/ControllerRevision:
    - ns-1g-1/postgres-wl-64dbb9bc58
    - ns-20g-1/postgres-wl-64dbb9bc58
  apps/v1/StatefulSet:
    - ns-1g-1/postgres-wl
    - ns-20g-1/postgres-wl
  v1/ConfigMap:
    - ns-1g-1/postgres-config
    - ns-20g-1/postgres-config
  v1/Endpoints:
    - ns-1g-1/postgres
    - ns-20g-1/postgres
  v1/Event:
    - ns-1g-1/postgredb-postgres-wl-0.161f222e554d82e8
    - ns-1g-1/postgres-wl-0.161f222e3c86e6df
    - ns-1g-1/postgres-wl-0.161f222e3d377dea
    - ns-1g-1/postgres-wl-0.161f222eadd16565
    - ns-1g-1/postgres-wl-0.161f222ec8b25f83
    - ns-1g-1/postgres-wl-0.161f2232ea55931e
    - ns-1g-1/postgres-wl-0.161f2232ef6b7dbe
    - ns-1g-1/postgres-wl-0.161f2232fae56755
    - ns-1g-1/postgres-wl.161f222e3c37d11b
    - ns-1g-1/postgres-wl.161f222e3c8b9763
    - ns-20g-1/postgredb-postgres-wl-0.161f223b124de0e8
    - ns-20g-1/postgres-wl-0.161f223af892aba9
    - ns-20g-1/postgres-wl-0.161f223af9976d8d
    - ns-20g-1/postgres-wl-0.161f223b40eeb4d0
    - ns-20g-1/postgres-wl-0.161f223b5afb2680
    - ns-20g-1/postgres-wl-0.161f223fe59a01a7
    - ns-20g-1/postgres-wl-0.161f223fed9ec730
    - ns-20g-1/postgres-wl-0.161f223ff808a105
    - ns-20g-1/postgres-wl.161f223af84b3e6b
    - ns-20g-1/postgres-wl.161f223af8968619
  v1/Namespace:
    - ns-1g-1
    - ns-20g-1
  v1/PersistentVolume:
    - pvc-a80a895e-b7d1-4ac2-beca-1c753990f971
    - pvc-df29ba02-e565-472d-b4b1-01a638ae5cdf
  v1/PersistentVolumeClaim:
    - ns-1g-1/postgredb-postgres-wl-0
    - ns-20g-1/postgredb-postgres-wl-0
  v1/Pod:
    - ns-1g-1/postgres-wl-0
    - ns-20g-1/postgres-wl-0
  v1/Secret:
    - ns-1g-1/default-token-lfnkw
    - ns-20g-1/default-token-59dh2
  v1/Service:
    - ns-1g-1/postgres
    - ns-20g-1/postgres
  v1/ServiceAccount:
    - ns-1g-1/default
    - ns-20g-1/default

Velero-Native Snapshots: <none included>

Restic Backups:
  Completed:
    ns-1g-1/postgres-wl-0: postgredb
    ns-20g-1/postgres-wl-0: postgredb

delete ns ns-1g-1 and ns-20g-1.
restore hit PartiallyFailed.
velero --kubecontext=pks-st-2 restore create 1g-20g-restore --from-backup 1g-20g-backup

root@6c6fe943453d:~/velero-v1.4.0-linux-amd64/post/scale-testing# velero --kubecontext=pks-st-2 restore describe 1g-20g-restore --details
Name:         1g-20g-restore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs 1g-20g-restore' for more information)

Errors:
  Velero:  pod volume restore failed: error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://20.20.233.44:9000/velero/restic/ns-20g-1 --password-file=/tmp/velero-restic-credentials-ns-20g-1153599904 --cache-dir=/scratch/.cache/restic 6654cb34 --target=., stdout=, stderr=: error getting snapshot size: error running command, stderr=Fatal: error loading snapshot: no matching ID found
: exit status 1
  Cluster:    <none>
  Namespaces: <none>

Backup:  1g-20g-backup

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Restic Restores:
  Completed:
    ns-1g-1/postgres-wl-0: postgredb
  Failed:
    ns-20g-1/postgres-wl-0: postgredb

What did you expect to happen:
restore should be succeed.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
logs are uploaded here https://gist.github.com/luwang-vmware/9bef01478e11ade628fa53f6b98f8261

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version):
1.4.0
Velero features (use velero client config get features):
Kubernetes version (use kubectl version):
1.17.5
Kubernetes installer & version:
Cloud provider or hardware configuration:
VCP
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

carlisia · 2020-07-18T01:46:47Z

Hi @luwang-vmware, I'm curious if you have tried to backup each namespace individually and if so how did that go.

In the mean time, marking this as something to investigate. Seems we might have to look thru the code and piece together what the source of this problem could be.

luwang-vmware · 2020-07-18T04:05:12Z

Thanks for following it up @carlisia. I have done to backup each namespace individually and it worked as expected. The issue reported is easily to reproduce. Below is my yaml file for reference. Please let me know what anything else I need provide.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: postgres-sc
provisioner: kubernetes.io/vsphere-volume
parameters:
    datastore: vsanDatastore
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-config
  labels:
    app: postgres
data:
  POSTGRES_DB: db
  POSTGRES_USER: admin
  POSTGRES_PASSWORD: admin
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  ports:
  - port: 5432
    name: postgres
  clusterIP: None
  selector:
    app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-wl
spec:
  serviceName: "postgres"
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:12.3
        envFrom:
          - configMapRef:
              name: postgres-config
        ports:
        - containerPort: 5432
          name: postgredb
        volumeMounts:
        - name: postgredb
          mountPath: /var/lib/postgresql/data
          subPath: postgres
  volumeClaimTemplates:
  - metadata:
      name: postgredb
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: postgres-sc
      resources:
        requests:
          storage: 1Gi

nrb · 2020-08-11T18:19:36Z

@luwang-vmware In Velero v1.4.2, we increased the restic timeout, as well as the amount of CPU/memory the restic daemonset has by default. This should help with larger data sets on restic. We got to the numbers by testing with datasets at 100GB.

We've also recently added documentation for modifying Velero & restic CPU/memory requests and limits so that restic operations can be given more resources to be performed in a more timely manner given the size of the volumes.

You can see more information about the new defaults and adjusting the values in our docs.

Can you please retry with v1.4.2 and let us know if this resolves your issue?

luwang-vmware · 2020-08-12T06:43:31Z

Thanks @nrb. I would have a try on velero 1.4.2. One question, which rule/principle could I follow when I need to use velero+restic to backup large PV or more PVs(in the same namespace) in one velero backup cli. More specific is that 100G PV is tested with your team with restic memory set as 512M(limited by 1G), if I want to use restic to backup 1T or 2T PV or 10 * 100G PV, how do I need to calculate the memory resource uses for restic?

zubron · 2020-10-21T21:36:55Z

Hi @luwang-vmware. We don't have guidelines for calculating the amount of memory required for performing restic backups. Were you able to retry using velero 1.4.2 or the latest release (v1.5.2) using adjusted timeouts or memory limits?

zubron · 2020-10-21T22:46:28Z

Hi @luwang-vmware! If you still have access to the environment where this issue happened, I'd like to check that the snapshot that is being restored exists in your storage location. That will help us in determining why this error is being triggered.

Can you verify that the snapshot ID of your PodVolumeRestore is 6654cb34? You can check this by running:

kubectl -n velero get podvolumerestores -lvelero.io/restore-name=1g-20g-restore -ojson | jq .items[].spec

and checking the SnapshotID field.

If it matches, can you check the following path in your storage location /restic/1g-20g-restore/snapshots/ for a file beginning with the snapshot ID 6654cb34?

This is the snapshot that restic is attempting to restore but can't find. Thanks!

SVronskiy · 2020-11-28T17:18:27Z

Hello!
The same problem here: restoring PV from backup with several namespaces fails with error "error loading snapshot: no matching ID found". Restoring from backup with one namespace works fine. Upgrade to 1.5.2 doesn't help at all. Restic initContainers have cpu: 100m, memory: 128Mi.
Thanks to @carlisia !

carlisia added the Needs investigation label Jul 18, 2020

zubron self-assigned this Oct 21, 2020

zubron mentioned this issue Oct 21, 2020

Restore fails to complete with error on AzureFile volumes #3027

Closed

zubron added the Needs info Waiting for information label Oct 21, 2020

zubron removed their assignment Oct 21, 2020

nrb added the Bug label Oct 29, 2020

zubron mentioned this issue Nov 2, 2020

🐛 Use namespace and name to match PVB to Pod restore #3051

Merged

nrb closed this as completed in #3051 Nov 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PartiallyFailed hit when restoring 2 namespaces with restic plugin in PKS env #2691

PartiallyFailed hit when restoring 2 namespaces with restic plugin in PKS env #2691

luwang-vmware commented Jul 7, 2020 •

edited

Loading

carlisia commented Jul 18, 2020

luwang-vmware commented Jul 18, 2020

nrb commented Aug 11, 2020

luwang-vmware commented Aug 12, 2020

zubron commented Oct 21, 2020

zubron commented Oct 21, 2020

SVronskiy commented Nov 28, 2020

PartiallyFailed hit when restoring 2 namespaces with restic plugin in PKS env #2691

PartiallyFailed hit when restoring 2 namespaces with restic plugin in PKS env #2691

Comments

luwang-vmware commented Jul 7, 2020 • edited Loading

carlisia commented Jul 18, 2020

luwang-vmware commented Jul 18, 2020

nrb commented Aug 11, 2020

luwang-vmware commented Aug 12, 2020

zubron commented Oct 21, 2020

zubron commented Oct 21, 2020

SVronskiy commented Nov 28, 2020

luwang-vmware commented Jul 7, 2020 •

edited

Loading