Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PartiallyFailed hit when restoring 2 namespaces with restic plugin in PKS env #2691

Closed
luwang-vmware opened this issue Jul 7, 2020 · 7 comments · Fixed by #3051
Closed
Labels

Comments

@luwang-vmware
Copy link

luwang-vmware commented Jul 7, 2020

What steps did you take and what happened:
This issue can be reproducible.

  1. create 2 namespace, named as ns-1g-1 and ns-20g-1.
  2. in the two namespaces, I both created statefulset with PV, one is with 1G PV, and the other is with 20G PV.
  3. Write data to the 1G PV and 20G PV seperatly.
  4. annotate the two pods
kubectl --context=pks-st-2 -n ns-1g-1 annotate pod/postgres-wl-0 backup.velero.io/backup-volumes=postgredb
kubectl --context=pks-st-2 -n ns-20g-1 annotate pod/postgres-wl-0 backup.velero.io/backup-volumes=postgredb
  1. backup is succeed
    velero --kubecontext=pks-st-2 backup create 1g-20g-backup --include-namespaces ns-1g-1,ns-20g-1
root@6c6fe943453d:~/velero-v1.4.0-linux-amd64/post/scale-testing# velero --kubecontext=pks-st-2  backup describe 1g-20g-backup --details
Name:         1g-20g-backup
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.17.5+vmware.1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=17

Phase:  Completed

Namespaces:
  Included:  ns-1g-1, ns-20g-1
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1

Started:    2020-07-06 10:52:06 +0000 UTC
Completed:  2020-07-06 10:58:48 +0000 UTC

Expiration:  2020-08-05 10:52:06 +0000 UTC

Total items to be backed up:  42
Items backed up:              42

Resource List:
  apps/v1/ControllerRevision:
    - ns-1g-1/postgres-wl-64dbb9bc58
    - ns-20g-1/postgres-wl-64dbb9bc58
  apps/v1/StatefulSet:
    - ns-1g-1/postgres-wl
    - ns-20g-1/postgres-wl
  v1/ConfigMap:
    - ns-1g-1/postgres-config
    - ns-20g-1/postgres-config
  v1/Endpoints:
    - ns-1g-1/postgres
    - ns-20g-1/postgres
  v1/Event:
    - ns-1g-1/postgredb-postgres-wl-0.161f222e554d82e8
    - ns-1g-1/postgres-wl-0.161f222e3c86e6df
    - ns-1g-1/postgres-wl-0.161f222e3d377dea
    - ns-1g-1/postgres-wl-0.161f222eadd16565
    - ns-1g-1/postgres-wl-0.161f222ec8b25f83
    - ns-1g-1/postgres-wl-0.161f2232ea55931e
    - ns-1g-1/postgres-wl-0.161f2232ef6b7dbe
    - ns-1g-1/postgres-wl-0.161f2232fae56755
    - ns-1g-1/postgres-wl.161f222e3c37d11b
    - ns-1g-1/postgres-wl.161f222e3c8b9763
    - ns-20g-1/postgredb-postgres-wl-0.161f223b124de0e8
    - ns-20g-1/postgres-wl-0.161f223af892aba9
    - ns-20g-1/postgres-wl-0.161f223af9976d8d
    - ns-20g-1/postgres-wl-0.161f223b40eeb4d0
    - ns-20g-1/postgres-wl-0.161f223b5afb2680
    - ns-20g-1/postgres-wl-0.161f223fe59a01a7
    - ns-20g-1/postgres-wl-0.161f223fed9ec730
    - ns-20g-1/postgres-wl-0.161f223ff808a105
    - ns-20g-1/postgres-wl.161f223af84b3e6b
    - ns-20g-1/postgres-wl.161f223af8968619
  v1/Namespace:
    - ns-1g-1
    - ns-20g-1
  v1/PersistentVolume:
    - pvc-a80a895e-b7d1-4ac2-beca-1c753990f971
    - pvc-df29ba02-e565-472d-b4b1-01a638ae5cdf
  v1/PersistentVolumeClaim:
    - ns-1g-1/postgredb-postgres-wl-0
    - ns-20g-1/postgredb-postgres-wl-0
  v1/Pod:
    - ns-1g-1/postgres-wl-0
    - ns-20g-1/postgres-wl-0
  v1/Secret:
    - ns-1g-1/default-token-lfnkw
    - ns-20g-1/default-token-59dh2
  v1/Service:
    - ns-1g-1/postgres
    - ns-20g-1/postgres
  v1/ServiceAccount:
    - ns-1g-1/default
    - ns-20g-1/default

Velero-Native Snapshots: <none included>

Restic Backups:
  Completed:
    ns-1g-1/postgres-wl-0: postgredb
    ns-20g-1/postgres-wl-0: postgredb
  1. delete ns ns-1g-1 and ns-20g-1.
  2. restore hit PartiallyFailed.
    velero --kubecontext=pks-st-2 restore create 1g-20g-restore --from-backup 1g-20g-backup
root@6c6fe943453d:~/velero-v1.4.0-linux-amd64/post/scale-testing# velero --kubecontext=pks-st-2 restore describe 1g-20g-restore --details
Name:         1g-20g-restore
Namespace:    velero
Labels:       <none>
Annotations:  <none>

Phase:  PartiallyFailed (run 'velero restore logs 1g-20g-restore' for more information)

Errors:
  Velero:  pod volume restore failed: error restoring volume: error running restic restore, cmd=restic restore --repo=s3:http://20.20.233.44:9000/velero/restic/ns-20g-1 --password-file=/tmp/velero-restic-credentials-ns-20g-1153599904 --cache-dir=/scratch/.cache/restic 6654cb34 --target=., stdout=, stderr=: error getting snapshot size: error running command, stderr=Fatal: error loading snapshot: no matching ID found
: exit status 1
  Cluster:    <none>
  Namespaces: <none>

Backup:  1g-20g-backup

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Restic Restores:
  Completed:
    ns-1g-1/postgres-wl-0: postgredb
  Failed:
    ns-20g-1/postgres-wl-0: postgredb

What did you expect to happen:
restore should be succeed.

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
logs are uploaded here https://gist.github.com/luwang-vmware/9bef01478e11ade628fa53f6b98f8261

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version):
    1.4.0
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
    1.17.5
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
    VCP
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@carlisia
Copy link
Contributor

Hi @luwang-vmware, I'm curious if you have tried to backup each namespace individually and if so how did that go.

In the mean time, marking this as something to investigate. Seems we might have to look thru the code and piece together what the source of this problem could be.

@luwang-vmware
Copy link
Author

Thanks for following it up @carlisia. I have done to backup each namespace individually and it worked as expected. The issue reported is easily to reproduce. Below is my yaml file for reference. Please let me know what anything else I need provide.

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: postgres-sc
provisioner: kubernetes.io/vsphere-volume
parameters:
    datastore: vsanDatastore
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgres-config
  labels:
    app: postgres
data:
  POSTGRES_DB: db
  POSTGRES_USER: admin
  POSTGRES_PASSWORD: admin
---
apiVersion: v1
kind: Service
metadata:
  name: postgres
  labels:
    app: postgres
spec:
  ports:
  - port: 5432
    name: postgres
  clusterIP: None
  selector:
    app: postgres
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres-wl
spec:
  serviceName: "postgres"
  replicas: 1
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      containers:
      - name: postgres
        image: postgres:12.3
        envFrom:
          - configMapRef:
              name: postgres-config
        ports:
        - containerPort: 5432
          name: postgredb
        volumeMounts:
        - name: postgredb
          mountPath: /var/lib/postgresql/data
          subPath: postgres
  volumeClaimTemplates:
  - metadata:
      name: postgredb
    spec:
      accessModes: [ "ReadWriteOnce" ]
      storageClassName: postgres-sc
      resources:
        requests:
          storage: 1Gi

@nrb
Copy link
Contributor

nrb commented Aug 11, 2020

@luwang-vmware In Velero v1.4.2, we increased the restic timeout, as well as the amount of CPU/memory the restic daemonset has by default. This should help with larger data sets on restic. We got to the numbers by testing with datasets at 100GB.

We've also recently added documentation for modifying Velero & restic CPU/memory requests and limits so that restic operations can be given more resources to be performed in a more timely manner given the size of the volumes.

You can see more information about the new defaults and adjusting the values in our docs.

Can you please retry with v1.4.2 and let us know if this resolves your issue?

@luwang-vmware
Copy link
Author

Thanks @nrb. I would have a try on velero 1.4.2. One question, which rule/principle could I follow when I need to use velero+restic to backup large PV or more PVs(in the same namespace) in one velero backup cli. More specific is that 100G PV is tested with your team with restic memory set as 512M(limited by 1G), if I want to use restic to backup 1T or 2T PV or 10 * 100G PV, how do I need to calculate the memory resource uses for restic?

@zubron
Copy link
Contributor

zubron commented Oct 21, 2020

Hi @luwang-vmware. We don't have guidelines for calculating the amount of memory required for performing restic backups. Were you able to retry using velero 1.4.2 or the latest release (v1.5.2) using adjusted timeouts or memory limits?

@zubron
Copy link
Contributor

zubron commented Oct 21, 2020

Hi @luwang-vmware! If you still have access to the environment where this issue happened, I'd like to check that the snapshot that is being restored exists in your storage location. That will help us in determining why this error is being triggered.

Can you verify that the snapshot ID of your PodVolumeRestore is 6654cb34? You can check this by running:

kubectl -n velero get podvolumerestores -lvelero.io/restore-name=1g-20g-restore -ojson | jq .items[].spec

and checking the SnapshotID field.

If it matches, can you check the following path in your storage location /restic/1g-20g-restore/snapshots/ for a file beginning with the snapshot ID 6654cb34?

This is the snapshot that restic is attempting to restore but can't find. Thanks!

@zubron zubron added the Needs info Waiting for information label Oct 21, 2020
@zubron zubron removed their assignment Oct 21, 2020
@nrb nrb added the Bug label Oct 29, 2020
@nrb nrb closed this as completed in #3051 Nov 10, 2020
@SVronskiy
Copy link

Hello!
The same problem here: restoring PV from backup with several namespaces fails with error "error loading snapshot: no matching ID found". Restoring from backup with one namespace works fine. Upgrade to 1.5.2 doesn't help at all. Restic initContainers have cpu: 100m, memory: 128Mi.
Thanks to @carlisia !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants