Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807

TzahiAshkenazi · 2023-09-12T08:02:17Z

What steps did you take and what happened:
Running kopia backup .
create SN with 5000 pods & PVC size 32MB storage class = cephfs
backup CR failed exit on : "PartiallyFailed"

Steps & Commands:

Create single namespaces with 5000 Pods&PVs. (Each PV size is32MB without data , created using BusyBox.) sc=cephfs
Created Backup CR
cr_name': 'backup-kopia-busybox-perf-single-5000-pods-cephfs
./velero backup create backup-kopia-busybox-perf-single-5000-pods-cephfs --include-namespaces busybox-perf-single-ns-5000-pods --default-volumes-to-fs-backup=true --snapshot-volumes=false -n velero-1-12
backup CR failed and exit on "PartiallyFailed" after 2:09:33 HH::MM:SS

Errors :

time="2023-09-11T12:46:42Z" level=error msg="Error backing up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs error="podvolumebackups.velero.io \"backup-kopia-busybox-perf-single-5000-pods-cephfs-9nctb\" already exists, the server was not able to generate a unique name for the object" logSource="pkg/backup/backup.go:448" name=busybox-perf-single-ns-5000-pods-3126
time="2023-09-11T13:44:22Z" level=info msg="1 errors encountered backup up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs logSource="pkg/backup/backup.go:444" name=busybox-perf-single-ns-5000-pods-674
time="2023-09-11T13:44:22Z" level=error msg="Error backing up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs error="podvolumebackups.velero.io \"backup-kopia-busybox-perf-single-5000-pods-cephfs-nt9nb\" already exists, the server was not able to generate a unique name for the object" logSource="pkg/backup/backup.go:448" name=busybox-perf-single-ns-5000-pods-674
time="2023-09-11T13:52:31Z" level=info msg="1 errors encountered backup up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs logSource="pkg/backup/backup.go:444" name=busybox-perf-single-ns-5000-pods-962
time="2023-09-11T13:52:31Z" level=error msg="Error backing up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs error="podvolumebackups.velero.io \"backup-kopia-busybox-perf-single-5000-pods-cephfs-6rhqd\" already exists, the server was not able to generate a unique name for the object" logSource="pkg/backup/backup.go:448" name=busybox-perf-single-ns-5000-pods-962

What did you expect to happen:
backup CR will complete successfully

Anything else you would like to add:
attached a link as the possible root cause in the issue.
https://github.com/kubernetes/apiserver/blame/master/pkg/storage/names/generate.go#L45

Environment:

Version: release-1.12-dev
features:
Kubernetes version : Kustomize Version: v4.5.4
Kubernetes installer & version:
Cloud provider or hardware configuration:
OCP running over BM servers
3 masters & 12 workers nodes
OS (e.g. from /etc/os-release):

NAME="Red Hat Enterprise Linux CoreOS"
ID="rhcos"
ID_LIKE="rhel fedora"
VERSION="411.86.202306021408-0"
VERSION_ID="4.11"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202306021408-0 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
REDHAT_BUGZILLA_PRODUCT_VERSION="4.11"
REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
REDHAT_SUPPORT_PRODUCT_VERSION="4.11"
OPENSHIFT_VERSION="4.11"
RHEL_VERSION="8.6"
OSTREE_VERSION="411.86.202306021408-0"

P.S
i couldn't run "velero debug command " since i re-deploy velero for new test cases , if the attached logs missing some info which is needed , i can reproduce the issue again

kopia_logs.tar.gz

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

weshayutin · 2023-09-12T13:09:41Z

@sseago @shubham-pampattiwar @shawn-hurley FYI.. thank you @TzahiAshkenazi!

sseago · 2023-09-12T14:39:03Z

I'm going to take this one on. From initial analysis, it looks like the root cause is the use of generateName in cases where there are a large number of resources created with the same root string. The GenerateName func doesn't check uniqueness, it just generates the name field with the root plus a random 6-char string. If the resulting name already exists, it returns an error. The caller needs to check IsAlreadyExistsError and try again if true. The fix should be to replace our Create calls that use GenerateName with a CreateWithRetries func which will retry for a small number of times (probably 3 to 5) in the GenerateName case before returning an error. I'll submit a PR for this.

sseago · 2023-09-12T14:40:57Z

I've seen this same error in the past when generating test data from a bash script, creating objects from yaml using generateName via CLI.

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

kaovilai · 2023-09-13T20:04:59Z

related to kubernetes/kubernetes#115489

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

reasonerjt · 2023-09-18T06:10:53Z

I don't think we need to retry whenever a resource is created.

Could you clarify whether this is highly reproducible?

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

sseago · 2023-09-19T21:27:49Z

@reasonerjt The only time we need to retry is when we set generateName and don't set name. This is only done in a limited number of places in velero which should be handled in the associated PR. The underlying issue is that the code which processes generateName appends a random string to the name but does not check to see whether a resource already exists with the combined name. If there's a clash, then it returns an already exists error. This is documented behavior and retry is recommended as a response to this error. Beyond a certain number of resources, clashes like this become quite likely. The client-go retry package makes this relatively straightforward, providing a util func to allow a certain number of automatic retries in response to a particular error type.

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

qiuming-best · 2023-09-20T03:22:48Z

I've done one test of String(n int) string function, below is statistics of the results of a run:
5 char for 1k calls, repeated: 0, repeat rate: 0.00%
5 char for 5k calls, repeated: 1, repeat rate: 0.02%
5 char for 10k calls, repeated: 5, repeat rate: 0.05%
5 char for 100k calls, repeated: 315, repeat rate: 0.32%
5 char for 1000k calls, repeated: 34012, repeat rate: 3.40%

increasing the length of random alphanumeric string:
6 char for 1k calls, repeated: 0, repeat rate: 0.00%
6 char for 5k calls, repeated: 0, repeat rate: 0.00%
6 char for 10k calls, repeated: 0, repeat rate: 0.00%
6 char for 100k calls, repeated: 6, repeat rate: 0.01%
6 char for 1000k calls, repeated: 1330, repeat rate: 0.13%

The effect is most obvious when increasing the length of random alphanumeric string, and retry when a collision happening also works.

In our real-world scenario, the number of resources created with the same prefix within a namespace does not exceed 1 million (0.32% in the above run). If we attempt to create resources with the same prefix twice, the probability of them being duplicated is 0.32% * 0.32%, which is 0.001%, which means if we create over 100,000 resources in one namespace may have one name repeated. This probability is extremely low.

TzahiAshkenazi · 2023-09-20T06:07:57Z

update
please note that this bug also Occurs on kopia restore operation


[root@f07-h27-000-r640 velero]# ./velero  restore describe restore-kopia-busybox-perf-single-5000-pods-rbd  -nvelero-1-12
Name:         restore-kopia-busybox-perf-single-5000-pods-rbd
Namespace:    velero-1-12
Labels:       <none>
Annotations:  <none>

Phase:                       PartiallyFailed (run 'velero restore logs restore-kopia-busybox-perf-single-5000-pods-rbd' for more information)
Total items to be restored:  40022
Items restored:              40022

Started:    2023-09-19 08:50:47 +0000 UTC
Completed:  2023-09-19 19:00:28 +0000 UTC
Errors:
  Velero:   podvolumerestores.velero.io "restore-kopia-busybox-perf-single-5000-pods-rbd-dxx7r" already exists, the server was not able to generate a unique name for the object
            podvolumerestores.velero.io "restore-kopia-busybox-perf-single-5000-pods-rbd-5xmxg" already exists, the server was not able to generate a unique name for the object
  Cluster:    <none>
  Namespaces: <none>

Backup:  backup-kopia-busybox-perf-single-5000-pods-rbd

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Existing Resource Policy:   <none>
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

qiuming-best · 2023-09-20T08:01:31Z

5 char for 5k calls, repeated: 1, repeat rate: 0.02%, basically it would occur for the randomly generated PVR name with 5k pods.

sseago · 2023-09-20T14:12:03Z

So in the above restore example, we're seeing that this is common enough to happen twice in a single restore.

@TzahiAshkenazi note that my PR should handle the restore case too. I've basically taken every instance where Velero relies on GenerateName and used the client-go/util/retry func in all cases to retry up to 5 times when generated name is repeated.

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

TzahiAshkenazi · 2023-09-26T13:56:57Z

thanks for the update @sseago

TzahiAshkenazi · 2023-10-08T14:03:56Z

@sseago
when we can expect this fix upstream (which velero version)
thanks

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

sseago · 2023-10-09T21:45:20Z

@TzahiAshkenazi I just updated the fix PR to resolve a conflict with recent commits on main. At this point, the PR has one approval, so as soon as a second maintainer approves it, it will be merged (and thus on-target for Velero 1.13 inclusion)

@shubham-pampattiwar @reasonerjt Can we also consider this for 1.12.1? The bug is currently impacting our performance testing against 1.12. I can submit a cherry-pick PR against release-1.12 once the PR against main merges.

shubham-pampattiwar · 2023-10-10T19:37:26Z

+1 for velero 1.12.1

issue #6807: Retry failed create when using generateName

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

[release-1.12]: CP issue #6807: Retry failed create when using generateName

When creating resources with generateName, apimachinery does not guarantee uniqueness when it appends the random suffix to the generateName stub, so if it fails with already exists error, we need to retry. Signed-off-by: Scott Seago <sseago@redhat.com>

sseago self-assigned this Sep 12, 2023

sseago mentioned this issue Sep 15, 2023

issue #6807: Retry failed create when using generateName #6830

Merged

1 task

Lyndon-Li assigned qiuming-best Sep 18, 2023

Lyndon-Li added the target/1.12.1 label Oct 11, 2023

shubham-pampattiwar closed this as completed in #6830 Oct 11, 2023

shubham-pampattiwar added a commit that referenced this issue Oct 11, 2023

Merge pull request #6830 from sseago/retry-generateName

74ed994

issue #6807: Retry failed create when using generateName

sseago mentioned this issue Oct 11, 2023

[release-1.12]: CP issue #6807: Retry failed create when using generateName #6943

Merged

1 task

Lyndon-Li added a commit that referenced this issue Oct 16, 2023

Merge pull request #6943 from sseago/retry-generateName1.12

ef1b981

[release-1.12]: CP issue #6807: Retry failed create when using generateName

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807

Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807

TzahiAshkenazi commented Sep 12, 2023 •

edited

weshayutin commented Sep 12, 2023

sseago commented Sep 12, 2023

sseago commented Sep 12, 2023

kaovilai commented Sep 13, 2023

reasonerjt commented Sep 18, 2023

sseago commented Sep 19, 2023

qiuming-best commented Sep 20, 2023 •

edited

TzahiAshkenazi commented Sep 20, 2023

qiuming-best commented Sep 20, 2023 •

edited

sseago commented Sep 20, 2023 •

edited

TzahiAshkenazi commented Sep 26, 2023

TzahiAshkenazi commented Oct 8, 2023 •

edited

sseago commented Oct 9, 2023

shubham-pampattiwar commented Oct 10, 2023

Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807

Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807

Comments

TzahiAshkenazi commented Sep 12, 2023 • edited

weshayutin commented Sep 12, 2023

sseago commented Sep 12, 2023

sseago commented Sep 12, 2023

kaovilai commented Sep 13, 2023

reasonerjt commented Sep 18, 2023

sseago commented Sep 19, 2023

qiuming-best commented Sep 20, 2023 • edited

TzahiAshkenazi commented Sep 20, 2023

qiuming-best commented Sep 20, 2023 • edited

sseago commented Sep 20, 2023 • edited

TzahiAshkenazi commented Sep 26, 2023

TzahiAshkenazi commented Oct 8, 2023 • edited

sseago commented Oct 9, 2023

shubham-pampattiwar commented Oct 10, 2023

TzahiAshkenazi commented Sep 12, 2023 •

edited

qiuming-best commented Sep 20, 2023 •

edited

qiuming-best commented Sep 20, 2023 •

edited

sseago commented Sep 20, 2023 •

edited

TzahiAshkenazi commented Oct 8, 2023 •

edited