velero backup of cluster resources results in a crash of velero pod #3539

mfiedi · 2021-03-08T12:30:59Z

What steps did you take and what happened:
I'm trying to run a velero backup request for all cluster scope resources in my Openshift Cluster. The command I'm using is:

Soon after the backup command is issued I see a crash of the velero pod in namespace spp-velero. I enabled debugging in the velero deployment and see that the backups hangs when trying to retrieve the events from the cluster. When I query the amount of event, I see that there are 214558. That's an incredibly high number and honestly I've no idea where the come from. Here is a snippet from the velero log right before the restart:

time="2021-03-08T10:01:47Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=nodes
time="2021-03-08T10:01:47Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=nodes
time="2021-03-08T10:01:47Z" level=info msg="Retrieved 6 items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=nodes
time="2021-03-08T10:01:47Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=podtemplates
time="2021-03-08T10:01:47Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=podtemplates
time="2021-03-08T10:01:47Z" level=info msg="Retrieved 0 items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=podtemplates
time="2021-03-08T10:01:47Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=services
time="2021-03-08T10:01:47Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=services
time="2021-03-08T10:01:48Z" level=info msg="Retrieved 77 items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=services
time="2021-03-08T10:01:48Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=replicationcontrollers
time="2021-03-08T10:01:48Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=replicationcontrollers
time="2021-03-08T10:01:48Z" level=info msg="Retrieved 0 items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=replicationcontrollers
time="2021-03-08T10:01:48Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=limitranges
time="2021-03-08T10:01:48Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=limitranges
time="2021-03-08T10:01:48Z" level=info msg="Retrieved 0 items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=limitranges
time="2021-03-08T10:01:48Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=secrets
time="2021-03-08T10:01:48Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=secrets
time="2021-03-08T10:01:50Z" level=info msg="Retrieved 1307 items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:297" namespace= resource=secrets
time="2021-03-08T10:01:50Z" level=info msg="Getting items for resource" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:165" resource=events
time="2021-03-08T10:01:50Z" level=info msg="Listing items" backup=spp-velero/test-mfiedler-debug group=v1 logSource="pkg/backup/item_collector.go:291" namespace= resource=events

In an attempt to solve this I raised the memory settings for the velero deployment to these settings:

velero_resource_allocation:
limits:
cpu: '1'
memory: 2Gi
requests:
cpu: 500m
memory: 1Gi

That did not solve the issue. The velero pods still crashes, the backup remains in progress s forever and never comes to an end.

I now deleted all events and restarted the backup. Now it works fine. Questions remains why the backup crashes with many existing events?

What did you expect to happen:
Backup completes successfully

The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
refer to attached log
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml

velero describe backup sppbackup-k8s-b7fb5ea4-17f2-4c3c-980e-a74f1b654a0b -n spp-velero
Name: sppbackup-k8s-b7fb5ea4-17f2-4c3c-980e-a74f1b654a0b
Namespace: spp-velero
Labels: velero.io/storage-location=default
Annotations: velero.io/source-cluster-k8s-gitversion=v1.18.3+fa69cae
velero.io/source-cluster-k8s-major-version=1
velero.io/source-cluster-k8s-minor-version=18+

Phase: InProgress

Errors: 0
Warnings: 0

Namespaces:
Included:
Excluded:

Resources:
Included: *
Excluded:
Cluster-scoped: included

Label selector:

Storage Location: default

Velero-Native Snapshot PVs: auto

TTL: 61320h0m0s

Hooks:

Backup Format Version: 1.1.0

Started: 2021-03-08 12:11:49 +0100 CET
Completed: <n/a>

Expiration: 2028-03-06 12:11:49 +0100 CET

Velero-Native Snapshots:

velero backup logs <backupname>

velero backup logs sppbackup-k8s-b7fb5ea4-17f2-4c3c-980e-a74f1b654a0b -n spp-velero
Logs for backup "sppbackup-k8s-b7fb5ea4-17f2-4c3c-980e-a74f1b654a0b" are not available until it's finished processing. Please wait until the backup has a phase of Completed or Failed and try again.

velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

Velero version (use velero version):
Client:
Version: v1.5.2
Git commit: e115e5a
Server:
Version: v1.5.2-konveyor
Velero features (use velero client config get features):
features:
Kubernetes version (use kubectl version):
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.2-0-g52c56ce", GitCommit:"b66f2d3a6893be729f1b8660309a59c6e0b69196", GitTreeState:"clean", BuildDate:"2020-08-10T04:49:09Z", GoVersion:"go1.13.4", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"18+", GitVersion:"v1.18.3+fa69cae", GitCommit:"fa69cae", GitTreeState:"clean", BuildDate:"2020-12-14T23:03:06Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):
NAME="Red Hat Enterprise Linux Server"
VERSION="7.9 (Maipo)"
ID="rhel"
ID_LIKE="fedora"
VARIANT="Server"
VARIANT_ID="server"
VERSION_ID="7.9"
PRETTY_NAME="Red Hat Enterprise Linux Server 7.9 (Maipo)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:7.9:GA:server"
HOME_URL="https://www.redhat.com/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 7"
REDHAT_BUGZILLA_PRODUCT_VERSION=7.9
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="7.9"

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

mfiedi · 2021-03-08T12:31:11Z

cluster resource backup failed.txt

dsu-igeek added the Needs investigation label Mar 8, 2021

nrb added the Reviewed Q2 2021 label May 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

velero backup of cluster resources results in a crash of velero pod #3539

velero backup of cluster resources results in a crash of velero pod #3539

mfiedi commented Mar 8, 2021

mfiedi commented Mar 8, 2021

velero backup of cluster resources results in a crash of velero pod #3539

velero backup of cluster resources results in a crash of velero pod #3539

Comments

mfiedi commented Mar 8, 2021

mfiedi commented Mar 8, 2021