Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default-volumes-to-restic Stuck #2967

Closed
whitepiratebaku opened this issue Sep 23, 2020 · 24 comments
Closed

Default-volumes-to-restic Stuck #2967

whitepiratebaku opened this issue Sep 23, 2020 · 24 comments
Labels
Needs info Waiting for information Needs investigation Restic Relates to the restic integration Volumes Relating to volume backup and restore

Comments

@whitepiratebaku
Copy link

What steps did you take and what happened:
[A clear and concise description of what the bug is, and what commands you ran.)
I have installed velero with --use-restic and version was 1.4.2 ,I was able to create volume backups with opt-in approach by annotating pods.I was looking forward for 1.5.2 release so I can use opt-out methot.I have upgraded velero it works normally When I create normal backups but when I try --default-volumes-to-restic option during backup creation it stucked in "In Progress", describe shows total items 765 but backed up items stuck in 35.Size of Data in volumes are just megabytes.

What did you expect to happen:
I expected it to backup volumes

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)
#2966 describe shows this
Name: full-volume21
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion: v1.16.3
velero.io/source-cluster-k8s-major-version: 1
velero.io/source-cluster-k8s-minor-version: 16
API Version: velero.io/v1
Kind: Backup
Metadata:
Creation Timestamp: 2020-09-23T11:55:00Z
Generation: 10
Resource Version: 1311467
Self Link: /apis/velero.io/v1/namespaces/velero/backups/full-volume21
UID: 5a609312-3fff-4d68-91d4-0b82c302ccb2
Spec:
Default Volumes To Restic: true
Hooks:
Included Namespaces:
*
Storage Location: default
Ttl: 720h0m0s
Status:
Expiration: 2020-10-23T11:55:00Z
Format Version: 1.1.0
Phase: InProgress
Progress:
Items Backed Up: 35
Total Items: 765
Start Timestamp: 2020-09-23T11:55:00Z
Version: 1
Events:

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • Velero version (use velero version):
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version):
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@nrb
Copy link
Contributor

nrb commented Sep 24, 2020

What does the output of kubectl logs deploy/velero -n velero show?

Also, can you supply the output of kubectl get velero.io/podvolumebackups -n velero? That should show the PodVolumeBackups that are being attempted, which is part of Velero's restic process.

@nrb nrb added the Needs info Waiting for information label Sep 24, 2020
@whitepiratebaku
Copy link
Author

whitepiratebaku commented Sep 24, 2020

Note: Since last time I have deleted old backup and created new one name of this one is full-volume200.

Everytime backup stucks at 35th of total items.I am going to share logs around this 35th item

Velero logs:

time="2020-09-24T17:56:26Z" level=info msg="Executing podAction" backup=velero/full-volume200 cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero
time="2020-09-24T17:56:26Z" level=info msg="Done executing podAction" backup=velero/full-volume200 cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero
time="2020-09-24T17:56:26Z" level=info msg="Backed up 34 items out of an estimated total of 980 (estimate will change throughout the backup)" backup=velero/full-volume200 logSource="pkg/ba
ckup/backup.go:418" name=kube-proxy-xbgzc namespace=kube-system progress= resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Processing item" backup=velero/full-volume200 logSource="pkg/backup/backup.go:378" name=kube-scheduler-new-kube-master namespace=kube-system pro
gress= resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Backing up item" backup=velero/full-volume200 logSource="pkg/backup/item_backupper.go:121" name=kube-scheduler-new-kube-master namespace=kube-sy
stem resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Executing custom action" backup=velero/full-volume200 logSource="pkg/backup/item_backupper.go:327" name=kube-scheduler-new-kube-master namespace
=kube-system resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Executing podAction" backup=velero/full-volume200 cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero
time="2020-09-24T17:56:26Z" level=info msg="Done executing podAction" backup=velero/full-volume200 cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero
time="2020-09-24T17:56:26Z" level=info msg="Backed up 35 items out of an estimated total of 980 (estimate will change throughout the backup)" backup=velero/full-volume200 logSource="pkg/ba
ckup/backup.go:418" name=kube-scheduler-new-kube-master namespace=kube-system progress= resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Processing item" backup=velero/full-volume200 logSource="pkg/backup/backup.go:378" name=dashboard-metrics-scraper-c79c65bb7-6v26b namespace=kube
rnetes-dashboard progress= resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Backing up item" backup=velero/full-volume200 logSource="pkg/backup/item_backupper.go:121" name=dashboard-metrics-scraper-c79c65bb7-6v26b namesp
ace=kubernetes-dashboard resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Executing custom action" backup=velero/full-volume200 logSource="pkg/backup/item_backupper.go:327" name=dashboard-metrics-scraper-c79c65bb7-6v26
b namespace=kubernetes-dashboard resource=pods
time="2020-09-24T17:56:26Z" level=info msg="Executing podAction" backup=velero/full-volume200 cmd=/velero logSource="pkg/backup/pod_action.go:51" pluginName=velero
time="2020-09-24T17:56:26Z" level=info msg="Done executing podAction" backup=velero/full-volume200 cmd=/velero logSource="pkg/backup/pod_action.go:77" pluginName=velero
time="2020-09-24T17:56:42Z" level=info msg="Found 1 backups in the backup location that do not exist in the cluster and need to be synced" backupLocation=default controller=backup-sync log
Source="pkg/controller/backup_sync_controller.go:197"
time="2020-09-24T17:56:42Z" level=info msg="Attempting to sync backup into cluster" backup=full-volume1.2 backupLocation=default controller=backup-sync logSource="pkg/controller/backup_syn
c_controller.go:205"
time="2020-09-24T17:56:42Z" level=info msg="Successfully synced backup into cluster" backup=full-volume1.2 backupLocation=default controller=backup-sync logSource="pkg/controller/backup_sy
nc_controller.go:235"
time="2020-09-24T17:56:42Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstorageloc
ation logSource="pkg/controller/backupstoragelocation_controller.go:58"
time="2020-09-24T17:56:42Z" level=info msg="Checking for existing backup locations ready to be verified; there needs to be at least 1 backup location available" controller=backupstorageloc
ation logSource="pkg/controller/backupstoragelocation_controller.go:58"
time="2020-09-24T17:56:42Z" level=info msg="No backup locations were ready to be verified" controller=backupstoragelocation logSource="pkg/controller/backupstoragelocation_controller.go:12:

kubectl get podvolumebackups -n velero

kubectl get podvolumebackups -n velero
NAME                   AGE
full-volume1.2-8tdhw   12m
full-volume1.2-ckx7h   12m
full-volume1.2-mj8lz   12m
full-volume1.2-pfvj5   12m
full-volume1.2-v82pd   12m
full-volume1.2-zfqhp   12m
full-volume200-5sd7j   13m
full-volume200-8fm9l   13m
full-volume200-bj6wm   12m
full-volume200-chrgj   12m
full-volume200-hx4kh   13m
full-volume200-llvfs   13m
full-volume200-xqqtb   13m

@whitepiratebaku
Copy link
Author

whitepiratebaku commented Sep 24, 2020

When I use --include-namespaces and specify namespace which PVCs locate everything works fine.

velero create backup full-volume-test --default-volumes-to-restic --include-namespaces=test
Works fine

velero create backup full
Works fine

velero create backup full-volume-full --default-volumes-to-restic
Stuck in Progress, everytime stuck on 35th item.

Note: only test namespace has two PVCs, no other namespaces contains PVC. Also only two PVs exists which binded to PVCs.

Edit: I have checked namespaces , problem related with kubernetes-dashboard namespace.When I use --default-volumes-to-restic with kubernetes-dashboard it is stuck in progress and 0 items backed up out of 40.

@nrb
Copy link
Contributor

nrb commented Sep 24, 2020

For the backup that was stuck, can you run velero backup describe <backup-name> --details and provide that output?

This should report the pod volumes that Velero tried to back up with restic, and their current status.

Thanks for the summary of the different commands you tried, that's helpful in trying to narrow down this issue.

@nrb nrb added the Restic Relates to the restic integration label Sep 24, 2020
@nrb
Copy link
Contributor

nrb commented Sep 24, 2020

One more thought I had - are pod volumes you're trying to back up still annotated?

@whitepiratebaku
Copy link
Author

whitepiratebaku commented Sep 25, 2020

No, they do not have annotations.

Output of "velero describe backup full-volume200 --details"
Name: full-volume200
Namespace: velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.16.3
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=16

Phase: InProgress

Errors: 0
Warnings: 0

Namespaces:
Included: kubernetes-dashboard
Excluded:

Resources:
Included: *
Excluded:
Cluster-scoped: auto

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 720h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2020-09-24 23:49:45 +0400 +04
Completed: <n/a>

Expiration: 2020-10-24 23:49:45 +0400 +04

Estimated total items to be backed up: 40
Items backed up so far: 0

Resource List:

Velero-Native Snapshots:

Restic Backups:
New:
kubernetes-dashboard/dashboard-metrics-scraper-c79c65bb7-rt8tz: tmp-volume

Trimmed output of "kubectl get deploy dashboard-metrics-scraper -o yaml -n kubernetes-dashboard"

volumeMounts:
- mountPath: /tmp
name: tmp-volume
volumes:
- emptyDir: {}
name: tmp-volume

So it is clear something related with that 'tmp-volume', yes?

Note: there is no /bin/sh or /bin/bash available for "dashboard-metrics-scraper" pods , does restic need either of them?

@nrb
Copy link
Contributor

nrb commented Sep 25, 2020

No, restic shouldn't need a shell inside the pod to work. It gets at the volume mounts via the node's hostPath.

You said that the backup works without error when you run without the default-volumes-to-restic option; by "work" here, do you mean that the restic data is present? If so, that would imply to me that the pod volumes are annotated.

Also, another thing to check here: the change to default-volumes-to-restic is not solely on the client-side. The velero server command also needs the --default-volumes-to-restic argument added in order to enable the behavior. On an upgrade, you would have to use kubectl edit deploy/velero -n velero.

I double checked out documentation, and it appears that this is something we forgot to add. It's mentioned for new installs, but we should also mention it for upgrades.

@whitepiratebaku
Copy link
Author

If do kubectl edit deploy/velero -n velero and add "--default-volumes-to-restic" will this flag apply to all backup by default? I want to call --default-volumes-to-restic only for some backups not all.

@nrb
Copy link
Contributor

nrb commented Sep 30, 2020

It will apply to all backups, yes. It's required to make the client-side flag work right now, as the design goal was to make sure users who were using 100% restic had an easier path.

There may be changes we have to make to to accomodate mixed use cases with this flag, but for the moment the solution there is to use the opt-in approach rather than the default-volumes-to-restic flag.

@carlisia carlisia added Needs investigation Volumes Relating to volume backup and restore labels Oct 22, 2020
@ashish-amarnath
Copy link
Member

ashish-amarnath commented Oct 29, 2020

@whitepiratebaku You can also use the --default-volumes-to-restic flag to the velero backup create command to enable this feature on a per-backup basis without using the same flag to the velero server.
Please refer to https://velero.io/docs/v1.5/restic/#using-the-opt-out-approach
image

@ashish-amarnath
Copy link
Member

@nrb

Also, another thing to check here: the change to default-volumes-to-restic is not solely on the client-side. The velero server command also needs the --default-volumes-to-restic argument added in order to enable the behavior. On an upgrade, you would have to use kubectl edit deploy/velero -n velero.

Is not true.
When using Veler v1.5 container image, and V1.5 client binaries, users have the option of enabling default-volumes-to-restic to all Velero backups by providing this flag to the velero server. However, it is possible to use this feature on a per-backup basis by using the same flag to the velero backup create command

The DefaultVolumesToRestic field in the BackupSpec is set from the CLI, in the velero backup create command, and can be used to override the flag value set in the server. If this is not overridden from the CLI, from the velero backup create command, the flag value set in the server is copied into the backup spec field.

if request.Spec.DefaultVolumesToRestic == nil {
request.Spec.DefaultVolumesToRestic = &c.defaultVolumesToRestic
}

Can you also please clarify what changes you were thinking of wrt

There may be changes we have to make to to accomodate mixed use cases with this flag, but for the moment the solution there is to use the opt-in approach rather than the default-volumes-to-restic flag.

@ashish-amarnath
Copy link
Member

@whitepiratebaku Would you be open to trying out a development build of Velero, with added instrumentation and debug messages to troubleshoot this further?

@ashish-amarnath
Copy link
Member

You can also run this command to identify the pod volume that is probably not making progress.

$ kubectl -n velero get podvolumebackups -ojson | jq '.items[] |"\(.spec.pod.namespace)/\(.spec.pod.name) \(.status.phase) \(.status.progress)"'

Please run this command in a loop for about 30m to watch progress. Here is a handy script for you to use

$ for i in {1..1800}; do 
    kubectl -n velero get podvolumebackups -ojson | jq '.items[] |"\(.spec.pod.namespace)/\(.spec.pod.name) \(.status.phase) \(.status.progress)"' > pvb-progress-$i.txt;
    sleep 1;
done

Further, you can also use the restic metrics to identify if there is a particular node in your cluster where restic backups may be slowing the backup to a halt.
If you have prometheus and grafana installed in your cluster you can inspect the restic_pod_volume_backup_dequeue_count and restic_pod_volume_backup_enqueue_count for each node. If there is a node for which these metrics are skewed then inspect the restic_restic_operation_latency_seconds gauge metric to confirm that restic operations on the node has been running slow.

If you don't have prometheus and grafana installed in your cluster you can dump all the restic metrics from the restic daemonset into a file and share that with us for troubleshooting.

To do this

  1. Setup port-forwarding on the restic daemonset in a terminal window
    $ kubectl -n velero port-forward ds/restic 8085:8085
  2. While the port-forward is setup in a terminal, in a different terminal use curl to obtain all the metrics into a file
    $ curl http://localhost:8085/metrics | grep -E '^restic_' > restic-metrics-2967.txt
    and share the restic-metrics-2967.txt file

NOTE: If kubectl -n velero port-forward ds/restic 8085:8085 doesn't collect metrics from all the restic daemon set pods, please repeat the above steps 1 and 2 for every restic pod.

@carlisia
Copy link
Contributor

This issue has been inactive for a while so I'm closing it.

@talha0324
Copy link

Hey, I'm facing same behavior on velero 1.5.3, did someone found any working solution?

@andyg54321
Copy link

andyg54321 commented Mar 22, 2021

We have also run into this in 1.5.3. Seems to be related to projected volume type. Projected volumes seem to fail, some hang. Perhaps this volume type should be ignored like the other unsupported volume types.

@muzakh
Copy link

muzakh commented Apr 22, 2021

I have the same issue where --default-volumes-to-restic get stuck when I use it with Velero create backup bkp-name --include-namespaces namespace-name.

It gets stuck forever and never completes. I am using Velero 1.6. I even tried passing the same parameter --default-volumes-to-restic in velero install command along with --use-restic flag but it failed with error unknown flag.

Does anyone have witnessed the same behaviour?

I am trying locally with a single node k8s cluster. Without using --default-volumes-to-restic flag, the backups succeeds but it doesn't take the backups of contents of Pod volumes

@irizzant
Copy link

I have the same issue since I updated to 1.6.

@QcFe
Copy link

QcFe commented May 18, 2021

I'm having the same issue, seems to get stuck on kube-system/cilium-gtg9n: hubble-tls, or at least that's what it seems to me.

Current setup:

  • fake cluster made of 2 virtual machines set up with kubeadm
  • cilium as CNI provider
  • minio running on a third vm
  • ceph-rook example/testing setup with a volume on ceph-rook-block (that gets backed up correctly it seems)

Here's the full description of the backup:

netlab@xubuntu-base:~/velero_tests/velero_release/velero-v1.6.0-linux-amd64$ ./velero backup describe fullbktest --details
Name:         fullbktest
Namespace:    velero
Labels:       velero.io/storage-location=default
Annotations:  velero.io/source-cluster-k8s-gitversion=v1.19.11
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=19

Phase:  InProgress

Errors:    0
Warnings:  0

Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Storage Location:  default

Velero-Native Snapshot PVs:  auto

TTL:  720h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2021-05-18 21:48:10 +0000 UTC
Completed:  <n/a>

Expiration:  2021-06-17 21:48:10 +0000 UTC

Estimated total items to be backed up:  575
Items backed up so far:                 3

Resource List:  <backup resource list not found>

Velero-Native Snapshots: <none included>

Restic Backups:
  Completed:
    default/nginx-deployment-66689547d-d4wqj: nginx-logs
  New:
    kube-system/cilium-gtg9n: hubble-tls

@ibot3
Copy link

ibot3 commented Sep 26, 2021

I have the same issue with 1.6.3.

@talha0324
Copy link

@ibot3 if you can use CSI it would be a great relief

@ibot3
Copy link

ibot3 commented Sep 26, 2021

The problem occurs as soon as kube-system is in the included namespaces.

I can't use CSI, because my provider does not support snapshots.

@jochbru
Copy link

jochbru commented Oct 13, 2021

Same issues here, only occurs when kube-system in included.

@timbuchwaldt
Copy link

This occurs if there are daemonsets missing on nodes that run pods to be backed up, adding tolerations fixed it for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs info Waiting for information Needs investigation Restic Relates to the restic integration Volumes Relating to volume backup and restore
Projects
None yet
Development

No branches or pull requests