Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kopia Backup failed on 5k pods on sn , exit on " Podvolumebackups already exists, the server was not able to generate a unique name for the object. #6807

Closed
TzahiAshkenazi opened this issue Sep 12, 2023 · 14 comments · Fixed by #6830
Assignees

Comments

@TzahiAshkenazi
Copy link

TzahiAshkenazi commented Sep 12, 2023

What steps did you take and what happened:
Running kopia backup .
create SN with 5000 pods & PVC size 32MB storage class = cephfs
backup CR failed exit on : "PartiallyFailed"

Steps & Commands:

  1. Create single namespaces with 5000 Pods&PVs. (Each PV size is32MB without data , created using BusyBox.) sc=cephfs

  2. Created Backup CR
    cr_name': 'backup-kopia-busybox-perf-single-5000-pods-cephfs
    ./velero backup create backup-kopia-busybox-perf-single-5000-pods-cephfs --include-namespaces busybox-perf-single-ns-5000-pods --default-volumes-to-fs-backup=true --snapshot-volumes=false -n velero-1-12

  3. backup CR failed and exit on "PartiallyFailed" after 2:09:33 HH::MM:SS

Errors :

time="2023-09-11T12:46:42Z" level=error msg="Error backing up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs error="podvolumebackups.velero.io \"backup-kopia-busybox-perf-single-5000-pods-cephfs-9nctb\" already exists, the server was not able to generate a unique name for the object" logSource="pkg/backup/backup.go:448" name=busybox-perf-single-ns-5000-pods-3126
time="2023-09-11T13:44:22Z" level=info msg="1 errors encountered backup up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs logSource="pkg/backup/backup.go:444" name=busybox-perf-single-ns-5000-pods-674
time="2023-09-11T13:44:22Z" level=error msg="Error backing up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs error="podvolumebackups.velero.io \"backup-kopia-busybox-perf-single-5000-pods-cephfs-nt9nb\" already exists, the server was not able to generate a unique name for the object" logSource="pkg/backup/backup.go:448" name=busybox-perf-single-ns-5000-pods-674
time="2023-09-11T13:52:31Z" level=info msg="1 errors encountered backup up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs logSource="pkg/backup/backup.go:444" name=busybox-perf-single-ns-5000-pods-962
time="2023-09-11T13:52:31Z" level=error msg="Error backing up item" backup=velero-1-12/backup-kopia-busybox-perf-single-5000-pods-cephfs error="podvolumebackups.velero.io \"backup-kopia-busybox-perf-single-5000-pods-cephfs-6rhqd\" already exists, the server was not able to generate a unique name for the object" logSource="pkg/backup/backup.go:448" name=busybox-perf-single-ns-5000-pods-962

What did you expect to happen:
backup CR will complete successfully

Anything else you would like to add:
attached a link as the possible root cause in the issue.
https://github.com/kubernetes/apiserver/blame/master/pkg/storage/names/generate.go#L45

Environment:

  • Version: release-1.12-dev

  • features:

  • Kubernetes version : Kustomize Version: v4.5.4

  • Kubernetes installer & version:

  • Cloud provider or hardware configuration:
    OCP running over BM servers
    3 masters & 12 workers nodes

  • OS (e.g. from /etc/os-release):

    NAME="Red Hat Enterprise Linux CoreOS"
    ID="rhcos"
    ID_LIKE="rhel fedora"
    VERSION="411.86.202306021408-0"
    VERSION_ID="4.11"
    PLATFORM_ID="platform:el8"
    PRETTY_NAME="Red Hat Enterprise Linux CoreOS 411.86.202306021408-0 (Ootpa)"
    ANSI_COLOR="0;31"
    CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos"
    HOME_URL="https://www.redhat.com/"
    DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.11/"
    BUG_REPORT_URL="https://bugzilla.redhat.com/"
    REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform"
    REDHAT_BUGZILLA_PRODUCT_VERSION="4.11"
    REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform"
    REDHAT_SUPPORT_PRODUCT_VERSION="4.11"
    OPENSHIFT_VERSION="4.11"
    RHEL_VERSION="8.6"
    OSTREE_VERSION="411.86.202306021408-0"

P.S
i couldn't run "velero debug command " since i re-deploy velero for new test cases , if the attached logs missing some info which is needed , i can reproduce the issue again

kopia_logs.tar.gz

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@weshayutin
Copy link
Contributor

@sseago sseago self-assigned this Sep 12, 2023
@sseago
Copy link
Collaborator

sseago commented Sep 12, 2023

I'm going to take this one on. From initial analysis, it looks like the root cause is the use of generateName in cases where there are a large number of resources created with the same root string. The GenerateName func doesn't check uniqueness, it just generates the name field with the root plus a random 6-char string. If the resulting name already exists, it returns an error. The caller needs to check IsAlreadyExistsError and try again if true. The fix should be to replace our Create calls that use GenerateName with a CreateWithRetries func which will retry for a small number of times (probably 3 to 5) in the GenerateName case before returning an error. I'll submit a PR for this.

@sseago
Copy link
Collaborator

sseago commented Sep 12, 2023

I've seen this same error in the past when generating test data from a bash script, creating objects from yaml using generateName via CLI.

sseago added a commit to sseago/velero that referenced this issue Sep 13, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
@kaovilai
Copy link
Contributor

related to kubernetes/kubernetes#115489

sseago added a commit to sseago/velero that referenced this issue Sep 15, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this issue Sep 15, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
@reasonerjt
Copy link
Contributor

I don't think we need to retry whenever a resource is created.

Could you clarify whether this is highly reproducible?

sseago added a commit to sseago/velero that referenced this issue Sep 19, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this issue Sep 19, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
@sseago
Copy link
Collaborator

sseago commented Sep 19, 2023

@reasonerjt The only time we need to retry is when we set generateName and don't set name. This is only done in a limited number of places in velero which should be handled in the associated PR. The underlying issue is that the code which processes generateName appends a random string to the name but does not check to see whether a resource already exists with the combined name. If there's a clash, then it returns an already exists error. This is documented behavior and retry is recommended as a response to this error. Beyond a certain number of resources, clashes like this become quite likely. The client-go retry package makes this relatively straightforward, providing a util func to allow a certain number of automatic retries in response to a particular error type.

sseago added a commit to sseago/velero that referenced this issue Sep 19, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
@qiuming-best
Copy link
Contributor

qiuming-best commented Sep 20, 2023

I've done one test of String(n int) string function, below is statistics of the results of a run:
5 char for 1k calls, repeated: 0, repeat rate: 0.00%
5 char for 5k calls, repeated: 1, repeat rate: 0.02%
5 char for 10k calls, repeated: 5, repeat rate: 0.05%
5 char for 100k calls, repeated: 315, repeat rate: 0.32%
5 char for 1000k calls, repeated: 34012, repeat rate: 3.40%

increasing the length of random alphanumeric string:
6 char for 1k calls, repeated: 0, repeat rate: 0.00%
6 char for 5k calls, repeated: 0, repeat rate: 0.00%
6 char for 10k calls, repeated: 0, repeat rate: 0.00%
6 char for 100k calls, repeated: 6, repeat rate: 0.01%
6 char for 1000k calls, repeated: 1330, repeat rate: 0.13%

The effect is most obvious when increasing the length of random alphanumeric string, and retry when a collision happening also works.

In our real-world scenario, the number of resources created with the same prefix within a namespace does not exceed 1 million (0.32% in the above run). If we attempt to create resources with the same prefix twice, the probability of them being duplicated is 0.32% * 0.32%, which is 0.001%, which means if we create over 100,000 resources in one namespace may have one name repeated. This probability is extremely low.

@TzahiAshkenazi
Copy link
Author

update
please note that this bug also Occurs on kopia restore operation


[root@f07-h27-000-r640 velero]# ./velero  restore describe restore-kopia-busybox-perf-single-5000-pods-rbd  -nvelero-1-12
Name:         restore-kopia-busybox-perf-single-5000-pods-rbd
Namespace:    velero-1-12
Labels:       <none>
Annotations:  <none>

Phase:                       PartiallyFailed (run 'velero restore logs restore-kopia-busybox-perf-single-5000-pods-rbd' for more information)
Total items to be restored:  40022
Items restored:              40022

Started:    2023-09-19 08:50:47 +0000 UTC
Completed:  2023-09-19 19:00:28 +0000 UTC
Errors:
  Velero:   podvolumerestores.velero.io "restore-kopia-busybox-perf-single-5000-pods-rbd-dxx7r" already exists, the server was not able to generate a unique name for the object
            podvolumerestores.velero.io "restore-kopia-busybox-perf-single-5000-pods-rbd-5xmxg" already exists, the server was not able to generate a unique name for the object
  Cluster:    <none>
  Namespaces: <none>

Backup:  backup-kopia-busybox-perf-single-5000-pods-rbd

Namespaces:
  Included:  all namespaces found in the backup
  Excluded:  <none>

Resources:
  Included:        *
  Excluded:        nodes, events, events.events.k8s.io, backups.velero.io, restores.velero.io, resticrepositories.velero.io, csinodes.storage.k8s.io, volumeattachments.storage.k8s.io, backuprepositories.velero.io
  Cluster-scoped:  auto

Namespace mappings:  <none>

Label selector:  <none>

Restore PVs:  auto

Existing Resource Policy:   <none>
ItemOperationTimeout:       4h0m0s

Preserve Service NodePorts:  auto

@qiuming-best
Copy link
Contributor

qiuming-best commented Sep 20, 2023

5 char for 5k calls, repeated: 1, repeat rate: 0.02%, basically it would occur for the randomly generated PVR name with 5k pods.

@sseago
Copy link
Collaborator

sseago commented Sep 20, 2023

So in the above restore example, we're seeing that this is common enough to happen twice in a single restore.

@TzahiAshkenazi note that my PR should handle the restore case too. I've basically taken every instance where Velero relies on GenerateName and used the client-go/util/retry func in all cases to retry up to 5 times when generated name is repeated.

sseago added a commit to sseago/velero that referenced this issue Sep 20, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
@TzahiAshkenazi
Copy link
Author

thanks for the update @sseago

@TzahiAshkenazi
Copy link
Author

TzahiAshkenazi commented Oct 8, 2023

@sseago
when we can expect this fix upstream (which velero version)
thanks

sseago added a commit to sseago/velero that referenced this issue Oct 9, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
@sseago
Copy link
Collaborator

sseago commented Oct 9, 2023

@TzahiAshkenazi I just updated the fix PR to resolve a conflict with recent commits on main. At this point, the PR has one approval, so as soon as a second maintainer approves it, it will be merged (and thus on-target for Velero 1.13 inclusion)

@shubham-pampattiwar @reasonerjt Can we also consider this for 1.12.1? The bug is currently impacting our performance testing against 1.12. I can submit a cherry-pick PR against release-1.12 once the PR against main merges.

@shubham-pampattiwar
Copy link
Collaborator

+1 for velero 1.12.1

shubham-pampattiwar added a commit that referenced this issue Oct 11, 2023
issue #6807: Retry failed create when using generateName
sseago added a commit to sseago/velero that referenced this issue Oct 11, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this issue Oct 11, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this issue Oct 11, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
sseago added a commit to sseago/velero that referenced this issue Oct 13, 2023
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
Lyndon-Li added a commit that referenced this issue Oct 16, 2023
[release-1.12]: CP issue #6807: Retry failed create when using generateName
sseago added a commit to openshift/velero that referenced this issue May 23, 2024
When creating resources with generateName, apimachinery
does not guarantee uniqueness when it appends the random
suffix to the generateName stub, so if it fails with
already exists error, we need to retry.

Signed-off-by: Scott Seago <sseago@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants