Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Velero losing track of backups #7268

Closed
weakcamel opened this issue Jan 3, 2024 · 14 comments
Closed

Velero losing track of backups #7268

weakcamel opened this issue Jan 3, 2024 · 14 comments
Assignees
Labels
Area/Storage/Minio For marking the issues where backend storage is minio Needs info Waiting for information

Comments

@weakcamel
Copy link

weakcamel commented Jan 3, 2024


name: Bug report
about: Velero losing track of backups


What steps did you take and what happened:

I installed Velero on a Kubernetes (K3s to be specific) cluster using the latest Helm chart

It's been installed to a non-default namespace sm-opensource. Backups are being made to an S3 bucket (on MinIO to be specific).

After performing a couple of Velero / kubectl operations, Velero seems to be losing track of existing backups. I've encountered that in several scenarios (e.g. a combination of backup success + backup failure usually triggers that too, if not always).


One of the scenarios which seem to trigger the issue:

  • create a new namespace for the test and a deployment
$ kubectl create namespace velero-test
namespace/velero-test created
$ kubectl create -f https://k8s.io/examples/application/deployment.yaml -n velero-test
deployment.apps/nginx-deployment created
$ kubectl get all -n velero-test
NAME                                    READY   STATUS    RESTARTS   AGE
pod/nginx-deployment-86dcfdf4c6-m9dpb   1/1     Running   0          3s
pod/nginx-deployment-86dcfdf4c6-mtbjh   1/1     Running   0          3s

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/nginx-deployment   2/2     2            2           3s

NAME                                          DESIRED   CURRENT   READY   AGE
replicaset.apps/nginx-deployment-86dcfdf4c6   2         2         2       3s
  • Make 2 backups
$ velero client config get
namespace: sm-opensource
$ velero backup-location get
NAME              PROVIDER   BUCKET/PREFIX                           PHASE       LAST VALIDATED                  ACCESS MODE   DEFAULT
cluster-backups   aws        velero-backups/velero-backups/k3s-lab   Available   2024-01-03 09:49:51 +0000 GMT   ReadWrite     true
$ velero backup create foo1
Backup request "foo1" submitted successfully.
Run `velero backup describe foo1` or `velero backup logs foo1` for more details.
$ velero backup get foo1
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo1   Completed   0        0          2024-01-03 09:49:59 +0000 GMT   29d       cluster-backups    <none>
$ velero backup create foo2
Backup request "foo2" submitted successfully.
Run `velero backup describe foo2` or `velero backup logs foo2` for more details.
$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo1   Completed   0        0          2024-01-03 09:49:59 +0000 GMT   29d       cluster-backups    <none>
foo2   Completed   0        0          2024-01-03 09:50:17 +0000 GMT   29d       cluster-backups    <none>
  • delete the resources and check the backups:
$ kubectl delete deployments.apps -n velero-test nginx-deployment
deployment.apps "nginx-deployment" deleted
$ kubectl delete namespaces velero-test
namespace "velero-test" deleted
$ velero backup get
$
  • Moreover, when I try to create a new backup with the same name and it fails, I can see the original backup log again (and it's clear of any errors)
$ velero backup create foo1
Backup request "foo1" submitted successfully.
Run `velero backup describe foo1` or `velero backup logs foo1` for more details.
$ velero backup get
NAME   STATUS   ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo1   Failed   0        0          2024-01-03 09:56:56 +0000 GMT   29d       cluster-backups    <none>

$ velero backup describe foo1
Name:         foo1
Namespace:    sm-opensource
Labels:       velero.io/storage-location=cluster-backups
Annotations:  velero.io/resource-timeout=10m0s
              velero.io/source-cluster-k8s-gitversion=v1.28.2+k3s1
              velero.io/source-cluster-k8s-major-version=1
              velero.io/source-cluster-k8s-minor-version=28

Phase:  Failed (run `velero backup logs foo1` for more information)


Namespaces:
  Included:  *
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        <none>
  Cluster-scoped:  auto

Label selector:  <none>

Or label selector:  <none>

Storage Location:  cluster-backups

Velero-Native Snapshot PVs:  auto
Snapshot Move Data:          false
Data Mover:                  velero

TTL:  720h0m0s

CSISnapshotTimeout:    10m0s
ItemOperationTimeout:  4h0m0s

Hooks:  <none>

Backup Format Version:  1.1.0

Started:    2024-01-03 09:56:56 +0000 GMT
Completed:  2024-01-03 09:56:56 +0000 GMT

Expiration:  2024-02-02 09:56:56 +0000 GMT

Velero-Native Snapshots: <none included>

  • output from velero backup logs foo1 --insecure-skip-tls-verify

https://gist.github.com/weakcamel/bc75ac6d2de9ec0525e79294ed222a82


What did you expect to happen:

I would've expected a list of backups I can work with.

The following information will help us better understand what's going on:

Output from kubectl logs deployments/sm-opensource-velero -n sm-opensource

https://gist.github.com/weakcamel/ab0756cf00e68b576543fb3280862b16

Anything else you would like to add:

Environment:

  • Velero version (use velero version):v1.12.2 -
$ velero version
Client:
	Version: v1.12.2
	Git commit: -
Server:
	Version: v1.12.2
  • Velero features (use velero client config get features):
$ velero client config get features
features: <NOT SET>
  • Kubernetes version (use kubectl version):
Client Version: v1.28.2
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.2+k3s1
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:

vSphere VMs

  • OS (e.g. from /etc/os-release):

Ubuntu 22.04 for the server (K3s deployment), MacOS 13 for Velero client

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@weakcamel
Copy link
Author

weakcamel commented Jan 3, 2024

Since the scenario wasn't very logical to me, I experimented a little more and the problem just happens after some time from the backup - no actual action needed:

$ velero backup create foo4
Backup request "foo4" submitted successfully.
Run `velero backup describe foo4` or `velero backup logs foo4` for more details.
$ velero backup get
NAME   STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   InProgress   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ velero backup get
NAME   STATUS       ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   InProgress   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
$ while true; do date; velero backup get; sleep 5; done
Wed  3 Jan 2024 10:54:37 GMT
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
Wed  3 Jan 2024 10:54:42 GMT
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
Wed  3 Jan 2024 10:54:47 GMT
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo4   Completed   0        0          2024-01-03 10:54:21 +0000 GMT   29d       cluster-backups    <none>
Wed  3 Jan 2024 10:54:52 GMT
Wed  3 Jan 2024 10:54:57 GMT
Wed  3 Jan 2024 10:55:03 GMT
Wed  3 Jan 2024 10:55:08 GMT
^C

Edit: What I'm seeing in the velero logs just before this happens is:

$ velero backup get
NAME   STATUS      ERRORS   WARNINGS   CREATED                         EXPIRES   STORAGE LOCATION   SELECTOR
foo8   Completed   0        0          2024-01-03 11:34:02 +0000 GMT   29d       cluster-backups    <none>

$ kubectl logs deployments/sm-opensource-velero -f
...
# silence for several seconds...

time="2024-01-03T11:34:42Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:152"
time="2024-01-03T11:34:42Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:137"
^C

$ velero backup get
$

@weakcamel
Copy link
Author

After enabling debug logs, it appears that:

  • Velero runs the AWS plugin to sync the state of the cluster with the state of S3
  • operation fails and plugin assumes there are no remote backups
  • plugin deletes the local CRDs and loses track

The backups are actually there in S3, uploded quite successfully.

Debug log:

time="2024-01-03T12:10:21Z" level=debug msg="waiting for RPC address" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="*v1.RestoreList"
time="2024-01-03T12:10:21Z" level=debug msg="no resources, skip" logSource="pkg/util/kube/periodical_enqueue_source.go:77" resource="*v1.RestoreList"
time="2024-01-03T12:10:21Z" level=debug msg="Setting log level to DEBUG" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.12.0/pkg/plugin/framework/server.go:242" pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="using plugin" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" version=2
time="2024-01-03T12:10:21Z" level=debug msg="plugin address" address=/tmp/plugin4058150283 backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" network=unix pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="waiting for stdio data" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:152"
time="2024-01-03T12:10:21Z" level=debug msg="Setting log level to DEBUG" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="/go/pkg/mod/github.com/vmware-tanzu/velero@v1.12.0/pkg/plugin/framework/server.go:242" pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="plugin address" address=/tmp/plugin695811238 backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" network=unix pluginName=velero-plugin-for-aws
time="2024-01-03T12:10:21Z" level=debug msg="using plugin" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" version=2
time="2024-01-03T12:10:21Z" level=debug msg="waiting for stdio data" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=sm-opensource/cluster-backups controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:137"
time="2024-01-03T12:10:21Z" level=debug msg="Got backups from backup store" backupCount=0 backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:111"
time="2024-01-03T12:10:21Z" level=debug msg="Got backups from cluster" backupCount=1 backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:124"
time="2024-01-03T12:10:21Z" level=debug msg="No backups found in the backup location that need to be synced into the cluster" backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:137"
time="2024-01-03T12:10:21Z" level=debug msg="received EOF, stopping recv loop" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=debug msg="Deleted orphaned backup from cluster" backup=foo1 backupLocation=sm-opensource/cluster-backups controller=backup-sync logSource="pkg/controller/backup_sync_controller.go:341"
time="2024-01-03T12:10:21Z" level=debug msg="Getting backup" backuprequest=sm-opensource/foo1 controller=backup logSource="pkg/controller/backup_controller.go:212"
time="2024-01-03T12:10:21Z" level=debug msg="backup not found" backuprequest=sm-opensource/foo1 controller=backup logSource="pkg/controller/backup_controller.go:218"
time="2024-01-03T12:10:21Z" level=debug msg="Getting Backup" backup=sm-opensource/foo1 controller=backup-finalizer logSource="pkg/controller/backup_finalizer_controller.go:90"
time="2024-01-03T12:10:21Z" level=debug msg="Unable to find Backup" backup=sm-opensource/foo1 controller=backup-finalizer logSource="pkg/controller/backup_finalizer_controller.go:94"
time="2024-01-03T12:10:21Z" level=debug msg="plugin process exited" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=81
time="2024-01-03T12:10:21Z" level=debug msg="plugin exited" backup-storage-location=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-storage-location logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75"
time="2024-01-03T12:10:21Z" level=debug msg="received EOF, stopping recv loop" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync err="rpc error: code = Unavailable desc = error reading from server: EOF" logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" pluginName=stdio
time="2024-01-03T12:10:21Z" level=debug msg="plugin process exited" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75" path=/plugins/velero-plugin-for-aws pid=80
time="2024-01-03T12:10:21Z" level=debug msg="plugin exited" backupLocation=sm-opensource/cluster-backups cmd=/plugins/velero-plugin-for-aws controller=backup-sync logSource="pkg/plugin/clientmgmt/process/logrus_adapter.go:75"

@blackpiglet
Copy link
Contributor

The backup sync controller deletes the backups. The backup sync controller is defaulted to run per minute.
This shouldn't happen in the normal case.
Please check whether multiple Velero instances exist in the same k8s cluster environment.
It looks like there are some conflicts between different Velero servers.

@blackpiglet blackpiglet added the Needs info Waiting for information label Jan 4, 2024
@blackpiglet blackpiglet self-assigned this Jan 4, 2024
@weakcamel
Copy link
Author

weakcamel commented Jan 4, 2024

Hello,

@blackpiglet thanks for the suggestion alas - that's not it. There's only one Velero pod present.

$ kubectl get pods -A | grep velero
velero-test     nginx-deployment-86dcfdf4c6-pmfns                                1/1     Running     0          22h
velero-test     nginx-deployment-86dcfdf4c6-drmpf                                1/1     Running     0          22h
sm-opensource   sm-opensource-velero-747b7bcd8c-cwvsm                            1/1     Running     0          16h
$

velero-test is just a namespace I use to test backup/restore with (nginx instances).

I wonder, are the connection parameters/coordinates exactly the same for backup upload and backup sync?

@weakcamel
Copy link
Author

Phew! I think I found where the problem came from - it was the config.s3ForcePathStyle=true missing from my config (I'm using MinIO as the storage location).

I worked on my config following https://github.com/vmware-tanzu/helm-charts/blob/main/charts/velero/values.yaml and it didn't say what this parameter means. Since it worked well for uploads, I assumed it's safe to leave it at the default.

After stumbling across https://velero.io/docs/main/contributions/minio/#set-up-server however, I've noticed the example does set this value to true (although with no explanation as to why) which made me look further and find https://github.com/vmware-tanzu/velero-plugin-for-aws/blob/main/backupstoragelocation.md

I'd like to suggest 2 things:

@blackpiglet
Copy link
Contributor

Which version of the velero-plugin-for-aws you are using?

I found a MinIO issue reporting that AWS SDK version 2 cannot work well with MinIO to list objects.
minio/minio#12027
And velero-plugin-for-aws plugin recently bumped the AWS SDK version to v2 on the main branch.

@weakcamel
Copy link
Author

I used 1.8.0 initially and tried 1.8.2 afterwards.

@blackpiglet blackpiglet added the Area/Storage/Minio For marking the issues where backend storage is minio label Jan 5, 2024
@blackpiglet
Copy link
Contributor

blackpiglet commented Jan 5, 2024

I will update the document to give more information about the setting.

But I don't think we can make the behavior of data upload and download aligned for this setting.
https://docs.aws.amazon.com/AmazonS3/latest/userguide/VirtualHosting.html
The path style and the virtual host style of address are how the data is served. It is not related to how data is uploaded.
The issue is introduced because the MinIO server only supports path style address by default, but the AWS SDK used by the Velero AWS plugin tries to query the data by the virtual-host style without the force-path-style parameter.

We can resolve this in two ways.

@weakcamel
Copy link
Author

weakcamel commented Jan 5, 2024

That makes sense; if there's no logic in the AWS plugin to handle the different path styles (it's all in the SDK) then indeed, not much can be done about it other than good documentation.

Thank you!

@blackpiglet
Copy link
Contributor

Document PR is created:
#7279.

@vvanouytsel
Copy link

vvanouytsel commented Jan 8, 2024

Interesting, I have a similar issue using the velero plugin for AWS. I've tried with both version 1.8.0 and 1.8.2.
I get a similar behaviour. The backup is being created, but it does not complete. The velero logs indicate that the status field of the backup could not be updated because the backup does not exist.

The backup CRD indeed does not exist, however the backup is available in the S3 bucket.
The velero controller also logs that here is a backup in the S3 bucket that is not in the cluster and it attempts to sync it, but without any result.

Edit:
It seems that my issue is somehow related to our ArgoCD. Whenever I create a syncWindow which prevents ArgoCD from changing altered manifests back to their original state, creation a backup works without any issues.

@blackpiglet
Copy link
Contributor

Yeah, there was a similar issue related to ArgoCD too.
vmware-tanzu/helm-charts#503

@blackpiglet
Copy link
Contributor

The related PR is merged. Close the issue.

@weakcamel
Copy link
Author

Fantastic, many thanks - the docs are now explaining this very well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area/Storage/Minio For marking the issues where backend storage is minio Needs info Waiting for information
Projects
None yet
Development

No branches or pull requests

3 participants