Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High memory usage and OOM killed during maintenance tasks #7510

Open
hbollon opened this issue Mar 8, 2024 · 11 comments
Open

High memory usage and OOM killed during maintenance tasks #7510

hbollon opened this issue Mar 8, 2024 · 11 comments
Assignees

Comments

@hbollon
Copy link

hbollon commented Mar 8, 2024

Hello team, we are using Velero for a new k8s platform on-premise (using k3s) to backup some of our mounted PVC using FSB feature. We have deployed Velero using the helm chart.
We're using it with Kopia uploader to be able to use .kopiaignore file to configure some paths to ignore during backups.
The backups storage is located on Scaleway Object Storage and the bucket size is about ~850GB of backup data (38 723 files).

First backup is successful but after that one the Velero pod start to crashloop due to OOM during maintenance tasks (we have configured 6GB memory limit for this Velero pod which should be more than sufficient no?)

The last logs I have before the OOM:

velero time="2024-03-08T08:42:15Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=backups/scaleway-default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"
velero time="2024-03-08T08:42:15Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=backups/scaleway-default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
velero time="2024-03-08T08:42:17Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:18Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:19Z" level=info msg="Running quick maintenance..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=info msg="Running quick maintenance..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:19Z" level=info msg="Finished quick maintenance." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=info msg="Finished quick maintenance." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:20Z" level=info msg="Running maintenance on backup repository" backupRepo=backups/xxx-production-scaleway-default-kopia-lsz8r logSource="pkg/controller/backup_repository_controller.go:285"
velero time="2024-03-08T08:42:21Z" level=warning msg="active indexes [xn0_00f547b5bbe0d3c63853d13cb06dc432-s69be3da756471a94126-c1 [lot of others indexes...] ] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error

I tried to give as much context and informations as possible but if you need others details don't hesitate to ping me, it's a quite urgent issue to us...

What did you expect to happen:

I don't think it's normal that Velero takes so much memory in just a minute during maintenance tasks.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2024-03-08-09-47-09.tar.gz

Environment:

  • Velero version (use velero version): v1.13.0
  • Velero features (use velero client config get features):
  • Velero chart version: v5.4.1
  • Kubernetes version (use kubectl version): v1.28.3
  • Cloud provider or hardware configuration: On-premise using K3S
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@kaovilai
Copy link
Contributor

kaovilai commented Mar 8, 2024

Duplicate of #7291

@hbollon
Copy link
Author

hbollon commented Mar 8, 2024

I don't think this is really a duplicate 🤔
I already follow this feature request which seems a good idea.
But in my case I don't think this is a normal behavior to not be able to keep Velero running even with 10GB memory limit during maintenance tasks... even if this is a quite demanding operation
It's currently a breaking behavior which prevent us to backups our PVCs, with this issue I research hints on why Velero in our setup use so much memory (if it's, indeed, a not normal behavior) or a workaround to mitigate this issue to be able to do our backups.

@Lyndon-Li
Copy link
Contributor

@hbollon
Not sure if a live session is allowed from your side, we need to check some factors of the repo, so a live session is more efficient. If it is not allowed, please also let us know, we will seek for other ways to troubleshoot.

Moreover, please keep the data in the repo, we may need to try troubleshooting, fix verification in the repo since not all env could reproduce the issue.

@hbollon
Copy link
Author

hbollon commented Mar 11, 2024

Hello @Lyndon-Li
A live session is doable on our side since we don't dig into the data present on PVCs / Storage bucket 😉
You can reach me on k8s Slack to organise that

@thedarkside
Copy link

Observing this on one of our project clusters as well. Velero pod keeps crashing with OOM. Version 1.13.0

We have round about 1TB of files. Mostly images 5-15mb in size. So there are many files.

Current resources configured:

          resources:
            limits:
              cpu: '1'
              memory: 512Mi
            requests:
              cpu: 500m
              memory: 128Mi

Will try increasing.

@thedarkside
Copy link

Yep, bumping those solved it for now!

@contributorr
Copy link

contributorr commented May 6, 2024

@Lyndon-Li @hbollon any updates on this issue? I'm experiencing a similiar problem from a specific environment where velero pod crashes with 2GB of memory limit, but somehow works with 4GB. On the other hand on multiple (even bigger) environments there's no need to increase mem limit to more than 1GB. Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2? Thanks

@Lyndon-Li
Copy link
Contributor

@hbollon @contributorr
There are multiple memory usage improvement in 1.14 which integrates the latest Kopia release. Velero 1.14 will be RC next week, you can try with the RC release and let us know the result.
The improvement should be helpful for the problems we identified in @hbollon's environment.

@contributorr Please note that not all memory usage are irrational, varying from the status of the file system (e.g., more files, smaller files), it may take more memory than others.

@Lyndon-Li
Copy link
Contributor

Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2?

No it is not specific to 1.13. The improvements will be only in 1.14

@Lyndon-Li
Copy link
Contributor

@hbollon @contributorr
1.14 RC is ready, https://github.com/vmware-tanzu/velero/releases/tag/v1.14.0-rc.1. You can try it to see any improvement for your cases.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Jun 5, 2024

The problem in @hbollon's environment is reproduced locally. Here is the details:

  1. This problem happens in the repo connection stage, so the high memory usage could happen for most of the operations, i.e., backup, restore, maintenance
  2. In order to control the fragment of the repo index blobs, Kopia repo does index compaction periodically. For Velero 1.13 and prior, it is done during repo connection
  3. The index compaction needs to load all the indexes into memory, so it may take huge memory. The memory usage is liner correlated to the count of indexes.
  4. According to the local test, it takes up to 16GB memory for index blobs of 750MB in size (If the files are small and random enough, this means around 21 million files)

1.14 (integrates Kopia 0.17) doesn't solve this problem ultimately, but 1.14 does something better:

  1. Kopia 0.17 does index compaction in maintenance only, so that backup and restore won't be affected
  2. Kopia 0.17 does index compaction for one epoch each time of maintenance, this make the problem less likely happen
  3. Velero 1.14 has moved the repo maintenance into a dedicate job so that the backups/restores done by Velero server and node-agent are not affected even when the problem happens

The problem still happens for 1.14 when huge number of indexes are generated in one backup or consecutive backups in a short time (e.g., 24 hours). So there will be following up fixes post 1.14. The plan is we will find a way to reduce the number of indexes to compact each time, so that controllable memory is used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants