Velero losing track of backups #7268

weakcamel · 2024-01-03T10:41:25Z

name: Bug report
about: Velero losing track of backups

What steps did you take and what happened:

I installed Velero on a Kubernetes (K3s to be specific) cluster using the latest Helm chart

It's been installed to a non-default namespace sm-opensource. Backups are being made to an S3 bucket (on MinIO to be specific).

After performing a couple of Velero / kubectl operations, Velero seems to be losing track of existing backups. I've encountered that in several scenarios (e.g. a combination of backup success + backup failure usually triggers that too, if not always).

One of the scenarios which seem to trigger the issue:

create a new namespace for the test and a deployment

$ kubectl create namespace velero-test
namespace/velero-test created
$ kubectl create -f https://k8s.io/examples/application/deployment.yaml -n velero-test
deployment.apps/nginx-deployment created
$ kubectl get all -n velero-test
NAME                                    READY   STATUS    RESTARTS   AGE
pod/nginx-deployment-86dcfdf4c6-m9dpb   1/1     Running   0          3s
pod/nginx-deployment-86dcfdf4c6-mtbjh   1/1     Running   0          3s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-deployment   2/2     2            2           3s

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/nginx-deployment-86dcfdf4c6   2         2         2       3s

Make 2 backups

$ velero client config get
namespace: sm-opensource
$ velero backup-location get
NAME              PROVIDER   BUCKET/PREFIX                           PHASE       LAST VALIDATED                  ACCESS MODE   DEFAULT
cluster-backups   aws        velero-backups/velero-backups/k3s-lab   Available   2024-01-03 09:49:51 +0000 GMT   ReadWrite     true
$ velero backup create foo1
Backup request "foo1" submitted successfully.
Run `velero backup describe foo1` or `velero backup logs foo1` for more details.
$ velero backup get foo1
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo1   Completed   0        0          2024-01-03 09:49:59 +0000 GMT   29d       cluster-backups    <none>
$ velero backup create foo2
Backup request "foo2" submitted successfully.
Run `velero backup describe foo2` or `velero backup logs foo2` for more details.
$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo1   Completed   0        0          2024-01-03 09:49:59 +0000 GMT   29d       cluster-backups    <none>
foo2   Completed   0        0          2024-01-03 09:50:17 +0000 GMT   29d       cluster-backups    <none>

delete the resources and check the backups:

$ kubectl delete deployments.apps -n velero-test nginx-deployment
deployment.apps "nginx-deployment" deleted
$ kubectl delete namespaces velero-test
namespace "velero-test" deleted
$ velero backup get
$

Moreover, when I try to create a new backup with the same name and it fails, I can see the original backup log again (and it's clear of any errors)

$ velero backup create foo1
Backup request "foo1" submitted successfully.
Run `velero backup describe foo1` or `velero backup logs foo1` for more details.
$ velero backup get
NAME   STATUS   ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo1   Failed   0        0          2024-01-03 09:56:56 +0000 GMT   29d       cluster-backups    <none>

$ velero backup describe foo1
Name:         foo1
Namespace:    sm-opensource
Labels:       velero.io/storage-location=cluster-backups
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.2+k3s1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28

Phase:  Failed (run `velero backup logs foo1` for more information)


Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  cluster-backups

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-01-03 09:56:56 +0000 GMT
Completed:  2024-01-03 09:56:56 +0000 GMT

Expiration:  2024-02-02 09:56:56 +0000 GMT

Velero-Native Snapshots: <none included>

output from velero backup logs foo1 --insecure-skip-tls-verify

https://gist.github.com/weakcamel/bc75ac6d2de9ec0525e79294ed222a82

What did you expect to happen:

I would've expected a list of backups I can work with.

The following information will help us better understand what's going on:

Output from kubectl logs deployments/sm-opensource-velero -n sm-opensource

https://gist.github.com/weakcamel/ab0756cf00e68b576543fb3280862b16

Anything else you would like to add:

Environment:

Velero version (use velero version):v1.12.2 -

$ velero version
Client:
	Version: v1.12.2
	Git commit: -
Server:
	Version: v1.12.2

Velero features (use velero client config get features):

$ velero client config get features
features: <NOT SET>

Kubernetes version (use kubectl version):

Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2+k3s1

Kubernetes installer & version:
Cloud provider or hardware configuration:

vSphere VMs

OS (e.g. from /etc/os-release):

Ubuntu 22.04 for the server (K3s deployment), MacOS 13 for Velero client

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

weakcamel · 2024-01-03T10:57:25Z

Since the scenario wasn't very logical to me, I experimented a little more and the problem just happens after some time from the backup - no actual action needed:

$ velero backup create foo4
Backup request "foo4" submitted successfully.
Run `velero backup describe foo4` or `velero backup logs foo4` for more details.
$ velero backup get
NAME   STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   InProgress   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ velero backup get
NAME   STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   InProgress   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ while true; do date; velero backup get; sleep 5; done
Wed  3 Jan 2024 10:54:37 GMT
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
Wed  3 Jan 2024 10:54:42 GMT
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
Wed  3 Jan 2024 10:54:47 GMT
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
Wed  3 Jan 2024 10:54:52 GMT
Wed  3 Jan 2024 10:54:57 GMT
Wed  3 Jan 2024 10:55:03 GMT
Wed  3 Jan 2024 10:55:08 GMT
^C

Edit: What I'm seeing in the velero logs just before this happens is:

$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo8   Completed   0        0          2024-01-03 11:34:02 +0000 GMT   29d       cluster-backups    <none>

$ kubectl logs deployments/sm-opensource-velero -f
...
# silence for several seconds...

time="2024-01-03T11:34:42Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:152"
time="2024-01-03T11:34:42Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:137"
^C

$ velero backup get
$

weakcamel · 2024-01-03T14:22:38Z

After enabling debug logs, it appears that:

Velero runs the AWS plugin to sync the state of the cluster with the state of S3
operation fails and plugin assumes there are no remote backups
plugin deletes the local CRDs and loses track

The backups are actually there in S3, uploded quite successfully.

Debug log:

time="2024-01-03T12:10:21Z" level=debug msg="waiting for RPC address" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v1.RestoreList"
time="2024-01-03T12:10:21Z" level=debug msg="no resources, skip" logSource="pkg/util/kube/periodical_enqueue_source.go:77" resource="*v1.RestoreList"
time="2024-01-03T12:10:21Z" level=debug msg="Setting log level to DEBUG" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.12.0/pkg/plugin/framework/server.go:242" pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="using plugin" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" version=2
time="2024-01-03T12:10:21Z" level=debug msg="plugin address" address=/tmp/plugin4058150283 backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" network=unix pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="waiting for stdio data" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:152"
time="2024-01-03T12:10:21Z" level=debug msg="Setting log level to DEBUG" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.12.0/pkg/plugin/framework/server.go:242" pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="plugin address" address=/tmp/plugin695811238 backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" network=unix pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="using plugin" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" version=2
time="2024-01-03T12:10:21Z" level=debug msg="waiting for stdio data" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:137"
time="2024-01-03T12:10:21Z" level=debug msg="Got backups from backup store" backupCount=0 backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:111"
time="2024-01-03T12:10:21Z" level=debug msg="Got backups from cluster" backupCount=1 backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:124"
time="2024-01-03T12:10:21Z" level=debug msg="No backups found in the backup location that need to be synced into the cluster" backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:137"
time="2024-01-03T12:10:21Z" level=debug msg="received EOF, stopping recv loop" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=debug msg="Deleted orphaned backup from cluster" backup=foo1 backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:341"
time="2024-01-03T12:10:21Z" level=debug msg="Getting backup" backuprequest=sm-opensource/foo1 controller=backup logSource="pkg/controller/backup_controller.go:212"
time="2024-01-03T12:10:21Z" level=debug msg="backup not found" backuprequest=sm-opensource/foo1 controller=backup logSource="pkg/controller/backup_controller.go:218"
time="2024-01-03T12:10:21Z" level=debug msg="Getting Backup" backup=sm-opensource/foo1 controller=backup-finalizer logSource="pkg/controller/backup_finalizer_controller.go:90"
time="2024-01-03T12:10:21Z" level=debug msg="Unable to find Backup" backup=sm-opensource/foo1 controller=backup-finalizer logSource="pkg/controller/backup_finalizer_controller.go:94"
time="2024-01-03T12:10:21Z" level=debug msg="plugin process exited" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=81
time="2024-01-03T12:10:21Z" level=debug msg="plugin exited" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75"
time="2024-01-03T12:10:21Z" level=debug msg="received EOF, stopping recv loop" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=debug msg="plugin process exited" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=80
time="2024-01-03T12:10:21Z" level=debug msg="plugin exited" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75"

blackpiglet · 2024-01-04T02:56:53Z

The backup sync controller deletes the backups. The backup sync controller is defaulted to run per minute.
This shouldn't happen in the normal case.
Please check whether multiple Velero instances exist in the same k8s cluster environment.
It looks like there are some conflicts between different Velero servers.

weakcamel · 2024-01-04T09:18:15Z

Hello,

@blackpiglet thanks for the suggestion alas - that's not it. There's only one Velero pod present.

$ kubectl get pods -A | grep velero
velero-test     nginx-deployment-86dcfdf4c6-pmfns                                1/1     Running     0          22h
velero-test     nginx-deployment-86dcfdf4c6-drmpf                                1/1     Running     0          22h
sm-opensource   sm-opensource-velero-747b7bcd8c-cwvsm                            1/1     Running     0          16h
$

velero-test is just a namespace I use to test backup/restore with (nginx instances).

I wonder, are the connection parameters/coordinates exactly the same for backup upload and backup sync?

weakcamel · 2024-01-04T09:42:27Z

Phew! I think I found where the problem came from - it was the config.s3ForcePathStyle=true missing from my config (I'm using MinIO as the storage location).

I worked on my config following https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml and it didn't say what this parameter means. Since it worked well for uploads, I assumed it's safe to leave it at the default.

After stumbling across https://velero.io/docs/main/contributions/minio/#set-up-server however, I've noticed the example does set this value to true (although with no explanation as to why) which made me look further and find https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/backupstoragelocation.md

I'd like to suggest 2 things:

improve the documentation at https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml and https://velero.io/docs/main/contributions/minio/#set-up-server to explain this parameter and point out the consequences of misconfiguration
since uploads to MinIO worked just fine without this flag, perhaps it could be possible to fix the sync process to also
handle it the same way? Alternatively, reject this as unsupported configuration for either upload and downloads?

blackpiglet · 2024-01-04T10:15:37Z

Which version of the velero-plugin-for-aws you are using?

I found a MinIO issue reporting that AWS SDK version 2 cannot work well with MinIO to list objects.
minio/minio#12027
And velero-plugin-for-aws plugin recently bumped the AWS SDK version to v2 on the main branch.

weakcamel · 2024-01-04T10:51:45Z

I used 1.8.0 initially and tried 1.8.2 afterwards.

blackpiglet · 2024-01-05T09:03:20Z

I will update the document to give more information about the setting.

But I don't think we can make the behavior of data upload and download aligned for this setting.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html
The path style and the virtual host style of address are how the data is served. It is not related to how data is uploaded.
The issue is introduced because the MinIO server only supports path style address by default, but the AWS SDK used by the Velero AWS plugin tries to query the data by the virtual-host style without the force-path-style parameter.

We can resolve this in two ways.

Pass the config.s3ForcePathStyle=true parameter to make the AWS SDK query in path-style.
Make the MinIO server support the virtual-host style path. https://min.io/docs/minio/linux/reference/minio-server/settings/core.html#envvar.MINIO_DOMAIN

weakcamel · 2024-01-05T09:14:41Z

That makes sense; if there's no logic in the AWS plugin to handle the different path styles (it's all in the SDK) then indeed, not much can be done about it other than good documentation.

Thank you!

blackpiglet · 2024-01-08T03:08:50Z

Document PR is created:
#7279.

vvanouytsel · 2024-01-08T11:53:01Z

Interesting, I have a similar issue using the velero plugin for AWS. I've tried with both version 1.8.0 and 1.8.2.
I get a similar behaviour. The backup is being created, but it does not complete. The velero logs indicate that the status field of the backup could not be updated because the backup does not exist.

The backup CRD indeed does not exist, however the backup is available in the S3 bucket.
The velero controller also logs that here is a backup in the S3 bucket that is not in the cluster and it attempts to sync it, but without any result.

Edit:
It seems that my issue is somehow related to our ArgoCD. Whenever I create a syncWindow which prevents ArgoCD from changing altered manifests back to their original state, creation a backup works without any issues.

blackpiglet · 2024-01-08T12:52:52Z

Yeah, there was a similar issue related to ArgoCD too.
vmware-tanzu/helm-charts#503

blackpiglet · 2024-01-30T05:06:51Z

The related PR is merged. Close the issue.

weakcamel · 2024-01-30T09:54:54Z

Fantastic, many thanks - the docs are now explaining this very well.

blackpiglet added the Needs info Waiting for information label Jan 4, 2024

blackpiglet self-assigned this Jan 4, 2024

blackpiglet added the Area/Storage/Minio For marking the issues where backend storage is minio label Jan 5, 2024

blackpiglet closed this as completed Jan 30, 2024

ywk253100 mentioned this issue Apr 8, 2024

velero backup disappear after backup completed about 10s? #7605

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Velero losing track of backups #7268

Velero losing track of backups #7268

weakcamel commented Jan 3, 2024 •

edited

weakcamel commented Jan 3, 2024 •

edited

weakcamel commented Jan 3, 2024

blackpiglet commented Jan 4, 2024

weakcamel commented Jan 4, 2024 •

edited

weakcamel commented Jan 4, 2024

blackpiglet commented Jan 4, 2024

weakcamel commented Jan 4, 2024

blackpiglet commented Jan 5, 2024 •

edited

weakcamel commented Jan 5, 2024 •

edited

blackpiglet commented Jan 8, 2024

vvanouytsel commented Jan 8, 2024 •

edited

blackpiglet commented Jan 8, 2024

blackpiglet commented Jan 30, 2024

weakcamel commented Jan 30, 2024

Velero losing track of backups #7268

Velero losing track of backups #7268

Comments

weakcamel commented Jan 3, 2024 • edited

weakcamel commented Jan 3, 2024 • edited

weakcamel commented Jan 3, 2024

blackpiglet commented Jan 4, 2024

weakcamel commented Jan 4, 2024 • edited

weakcamel commented Jan 4, 2024

blackpiglet commented Jan 4, 2024

weakcamel commented Jan 4, 2024

blackpiglet commented Jan 5, 2024 • edited

weakcamel commented Jan 5, 2024 • edited

blackpiglet commented Jan 8, 2024

vvanouytsel commented Jan 8, 2024 • edited

blackpiglet commented Jan 8, 2024

blackpiglet commented Jan 30, 2024

weakcamel commented Jan 30, 2024

weakcamel commented Jan 3, 2024 •

edited

weakcamel commented Jan 3, 2024 •

edited

weakcamel commented Jan 4, 2024 •

edited

blackpiglet commented Jan 5, 2024 •

edited

weakcamel commented Jan 5, 2024 •

edited

vvanouytsel commented Jan 8, 2024 •

edited