Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CORE-1643 rptest: Increase backoff interval for GCS #17545

Merged
merged 1 commit into from
Apr 2, 2024

Conversation

Lazin
Copy link
Contributor

@Lazin Lazin commented Apr 2, 2024

The timeing_stress_test performs a lot of TS operations in short time and in some cases it has to upload manifest frequently (more often than once per second). The manifest uploads are driven by segment uploads and retention. Normally, we will try to upload less often (once per 60s or less) even if we're writing into the partition constantly. But under the local storage pressure the ntp-archiver is forced to upload the manifest mroe frequently. The local storage pressure means that the local storage wants to evict some data but it can't do this unless the manifest is uploaded and the clean offset is moved forward.

The timing stress test introduces local storage pressure and uploads manifests frequently. The GCS may throttle us when we're trying to reupload the manifest faster than once per second. If the initial backoff has default value of 100ms this is exactly what we will try to do once some throttling is applied. Redpanda receives SlowDown response and decides to retry after 100ms, after 200ms, 400ms etc. It never uploads the manifest and the test fails.

This fix increases the initial backoff to 1000ms if the test is running on GCS.

Fixes #15488

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

  • none

The timeing_stress_test performs a lot of TS operations in short time
and in some cases it has to upload manifest frequently (more often than
once per second). The manifest uploads are driven by segment uploads and
retention. Normally, we will try to upload less often (once per 60s or
less) even if we're writing into the partition constantly. But under the
local storage pressure the ntp-archiver is forced to upload the manifest
mroe frequently. The local storage pressure means that the local storage
wants to evict some data but it can't do this unless the manifest is
uploaded and the clean offset is moved forward.

The timing stress test introduces local storage pressure and uploads
manifests frequently. The GCS may throttle us when we're trying to
reupload the manifest faster than once per second. If the initial
backoff has default value of 100ms this is exactly what we will try to
do once some throttling is applied. Redpanda receives SlowDown response
and decides to retry after 100ms, after 200ms, 400ms etc. It never
uploads the manifest and the test fails.

This fix increases the initial backoff to 1000ms if the test is running
on GCS.
@piyushredpanda piyushredpanda merged commit 114bd23 into redpanda-data:dev Apr 2, 2024
16 checks passed
@dotnwat dotnwat changed the title rptest: Increase backoff interval for GCS CORE-1643 rptest: Increase backoff interval for GCS Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants