High memory usage and OOM killed during maintenance tasks #7510

hbollon · 2024-03-08T09:02:54Z

Hello team, we are using Velero for a new k8s platform on-premise (using k3s) to backup some of our mounted PVC using FSB feature. We have deployed Velero using the helm chart.
We're using it with Kopia uploader to be able to use .kopiaignore file to configure some paths to ignore during backups.
The backups storage is located on Scaleway Object Storage and the bucket size is about ~850GB of backup data (38 723 files).

First backup is successful but after that one the Velero pod start to crashloop due to OOM during maintenance tasks (we have configured 6GB memory limit for this Velero pod which should be more than sufficient no?)

The last logs I have before the OOM:

velero time="2024-03-08T08:42:15Z" level=info msg="Validating BackupStorageLocation" backup-storage-location=backups/scaleway-default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:141"
velero time="2024-03-08T08:42:15Z" level=info msg="BackupStorageLocations is valid, marking as available" backup-storage-location=backups/scaleway-default controller=backup-storage-location logSource="pkg/controller/backup_storage_location_controller.go:126"
velero time="2024-03-08T08:42:17Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:18Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:19Z" level=info msg="Running quick maintenance..." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=info msg="Running quick maintenance..." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=warning msg="active indexes [xn0_ca15f0c8c09bc81fb191052050ec1965-sbcd43b3a959b33fa126-c1] deletion watermark 2024-03-07 03:23:10 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error
velero time="2024-03-08T08:42:19Z" level=info msg="Finished quick maintenance." logModule=kopia/maintenance logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:19Z" level=info msg="Finished quick maintenance." logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:94" logger name="[shared-manager]"
velero time="2024-03-08T08:42:20Z" level=info msg="Running maintenance on backup repository" backupRepo=backups/xxx-production-scaleway-default-kopia-lsz8r logSource="pkg/controller/backup_repository_controller.go:285"
velero time="2024-03-08T08:42:21Z" level=warning msg="active indexes [xn0_00f547b5bbe0d3c63853d13cb06dc432-s69be3da756471a94126-c1 [lot of others indexes...] ] deletion watermark 0001-01-01 00:00:00 +0000 UTC" logModule=kopia/kopia/format logSource="pkg/kopia/kopia_log.go:101" logger name="[index-blob-manager]" sublevel=error

I tried to give as much context and informations as possible but if you need others details don't hesitate to ping me, it's a quite urgent issue to us...

What did you expect to happen:

I don't think it's normal that Velero takes so much memory in just a minute during maintenance tasks.

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

bundle-2024-03-08-09-47-09.tar.gz

Environment:

Velero version (use velero version): v1.13.0
Velero features (use velero client config get features):
Velero chart version: v5.4.1
Kubernetes version (use kubectl version): v1.28.3
Cloud provider or hardware configuration: On-premise using K3S
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

kaovilai · 2024-03-08T17:17:58Z

Duplicate of #7291

hbollon · 2024-03-08T21:52:36Z

I don't think this is really a duplicate 🤔
I already follow this feature request which seems a good idea.
But in my case I don't think this is a normal behavior to not be able to keep Velero running even with 10GB memory limit during maintenance tasks... even if this is a quite demanding operation
It's currently a breaking behavior which prevent us to backups our PVCs, with this issue I research hints on why Velero in our setup use so much memory (if it's, indeed, a not normal behavior) or a workaround to mitigate this issue to be able to do our backups.

Lyndon-Li · 2024-03-11T02:12:46Z

@hbollon
Not sure if a live session is allowed from your side, we need to check some factors of the repo, so a live session is more efficient. If it is not allowed, please also let us know, we will seek for other ways to troubleshoot.

Moreover, please keep the data in the repo, we may need to try troubleshooting, fix verification in the repo since not all env could reproduce the issue.

hbollon · 2024-03-11T09:07:06Z

Hello @Lyndon-Li
A live session is doable on our side since we don't dig into the data present on PVCs / Storage bucket 😉
You can reach me on k8s Slack to organise that

thedarkside · 2024-03-26T15:42:43Z

Observing this on one of our project clusters as well. Velero pod keeps crashing with OOM. Version 1.13.0

We have round about 1TB of files. Mostly images 5-15mb in size. So there are many files.

Current resources configured:

          resources:
            limits:
              cpu: '1'
              memory: 512Mi
            requests:
              cpu: 500m
              memory: 128Mi

Will try increasing.

thedarkside · 2024-03-28T17:45:13Z

Yep, bumping those solved it for now!

contributorr · 2024-05-06T07:32:56Z

@Lyndon-Li @hbollon any updates on this issue? I'm experiencing a similiar problem from a specific environment where velero pod crashes with 2GB of memory limit, but somehow works with 4GB. On the other hand on multiple (even bigger) environments there's no need to increase mem limit to more than 1GB. Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2? Thanks

Lyndon-Li · 2024-05-06T08:07:54Z

@hbollon @contributorr
There are multiple memory usage improvement in 1.14 which integrates the latest Kopia release. Velero 1.14 will be RC next week, you can try with the RC release and let us know the result.
The improvement should be helpful for the problems we identified in @hbollon's environment.

@contributorr Please note that not all memory usage are irrational, varying from the status of the file system (e.g., more files, smaller files), it may take more memory than others.

Lyndon-Li · 2024-05-06T08:09:28Z

Is this specific to 1.13.0 - any chance it's fixed in 1.13.1/2?

No it is not specific to 1.13. The improvements will be only in 1.14

Lyndon-Li · 2024-05-27T11:15:27Z

@hbollon @contributorr
1.14 RC is ready, https://github.com/vmware-tanzu/velero/releases/tag/v1.14.0-rc.1. You can try it to see any improvement for your cases.

Lyndon-Li · 2024-06-05T10:49:54Z

The problem in @hbollon's environment is reproduced locally. Here is the details:

This problem happens in the repo connection stage, so the high memory usage could happen for most of the operations, i.e., backup, restore, maintenance
In order to control the fragment of the repo index blobs, Kopia repo does index compaction periodically. For Velero 1.13 and prior, it is done during repo connection
The index compaction needs to load all the indexes into memory, so it may take huge memory. The memory usage is liner correlated to the count of indexes.
According to the local test, it takes up to 16GB memory for index blobs of 750MB in size (If the files are small and random enough, this means around 21 million files)

1.14 (integrates Kopia 0.17) doesn't solve this problem ultimately, but 1.14 does something better:

Kopia 0.17 does index compaction in maintenance only, so that backup and restore won't be affected
Kopia 0.17 does index compaction for one epoch each time of maintenance, this make the problem less likely happen
Velero 1.14 has moved the repo maintenance into a dedicate job so that the backups/restores done by Velero server and node-agent are not affected even when the problem happens

The problem still happens for 1.14 when huge number of indexes are generated in one backup or consecutive backups in a short time (e.g., 24 hours). So there will be following up fixes post 1.14. The plan is we will find a way to reduce the number of indexes to compact each time, so that controllable memory is used.

reasonerjt added the Needs investigation label Mar 11, 2024

reasonerjt assigned Lyndon-Li Mar 11, 2024

reasonerjt added Kopia Performance labels Mar 11, 2024

veerendra2 mentioned this issue May 16, 2024

[Epic] Backup Replication #103

Open

1 task

Lyndon-Li mentioned this issue Jun 5, 2024

Performance test - test the extreme cases for resource usage and repo recoverability #7390

Closed

Lyndon-Li added 1.15-candidate and removed Needs investigation labels Jun 5, 2024

Lyndon-Li removed the 1.15-candidate label Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage and OOM killed during maintenance tasks #7510

High memory usage and OOM killed during maintenance tasks #7510

hbollon commented Mar 8, 2024 •

edited

Loading

kaovilai commented Mar 8, 2024

hbollon commented Mar 8, 2024

Lyndon-Li commented Mar 11, 2024

hbollon commented Mar 11, 2024

thedarkside commented Mar 26, 2024

thedarkside commented Mar 28, 2024

contributorr commented May 6, 2024 •

edited

Loading

Lyndon-Li commented May 6, 2024

Lyndon-Li commented May 6, 2024

Lyndon-Li commented May 27, 2024

Lyndon-Li commented Jun 5, 2024 •

edited

Loading

High memory usage and OOM killed during maintenance tasks #7510

High memory usage and OOM killed during maintenance tasks #7510

Comments

hbollon commented Mar 8, 2024 • edited Loading

kaovilai commented Mar 8, 2024

hbollon commented Mar 8, 2024

Lyndon-Li commented Mar 11, 2024

hbollon commented Mar 11, 2024

thedarkside commented Mar 26, 2024

thedarkside commented Mar 28, 2024

contributorr commented May 6, 2024 • edited Loading

Lyndon-Li commented May 6, 2024

Lyndon-Li commented May 6, 2024

Lyndon-Li commented May 27, 2024

Lyndon-Li commented Jun 5, 2024 • edited Loading

hbollon commented Mar 8, 2024 •

edited

Loading

contributorr commented May 6, 2024 •

edited

Loading

Lyndon-Li commented Jun 5, 2024 •

edited

Loading