CORE-1643 rptest: Increase backoff interval for GCS #17545

Lazin · 2024-04-02T16:10:26Z

The timeing_stress_test performs a lot of TS operations in short time and in some cases it has to upload manifest frequently (more often than once per second). The manifest uploads are driven by segment uploads and retention. Normally, we will try to upload less often (once per 60s or less) even if we're writing into the partition constantly. But under the local storage pressure the ntp-archiver is forced to upload the manifest mroe frequently. The local storage pressure means that the local storage wants to evict some data but it can't do this unless the manifest is uploaded and the clean offset is moved forward.

The timing stress test introduces local storage pressure and uploads manifests frequently. The GCS may throttle us when we're trying to reupload the manifest faster than once per second. If the initial backoff has default value of 100ms this is exactly what we will try to do once some throttling is applied. Redpanda receives SlowDown response and decides to retry after 100ms, after 200ms, 400ms etc. It never uploads the manifest and the test fails.

This fix increases the initial backoff to 1000ms if the test is running on GCS.

Fixes #15488

Backports Required

Release Notes

none

The timeing_stress_test performs a lot of TS operations in short time and in some cases it has to upload manifest frequently (more often than once per second). The manifest uploads are driven by segment uploads and retention. Normally, we will try to upload less often (once per 60s or less) even if we're writing into the partition constantly. But under the local storage pressure the ntp-archiver is forced to upload the manifest mroe frequently. The local storage pressure means that the local storage wants to evict some data but it can't do this unless the manifest is uploaded and the clean offset is moved forward. The timing stress test introduces local storage pressure and uploads manifests frequently. The GCS may throttle us when we're trying to reupload the manifest faster than once per second. If the initial backoff has default value of 100ms this is exactly what we will try to do once some throttling is applied. Redpanda receives SlowDown response and decides to retry after 100ms, after 200ms, 400ms etc. It never uploads the manifest and the test fails. This fix increases the initial backoff to 1000ms if the test is running on GCS.

Lazin requested a review from andijcr April 2, 2024 16:10

Lazin force-pushed the ci-fix/15488 branch from e959306 to 0edfe13 Compare April 2, 2024 16:12

Lazin force-pushed the ci-fix/15488 branch from 0edfe13 to 3445b91 Compare April 2, 2024 16:16

piyushredpanda requested a review from andrwng April 2, 2024 17:40

andrwng approved these changes Apr 2, 2024

View reviewed changes

piyushredpanda merged commit 114bd23 into redpanda-data:dev Apr 2, 2024
16 checks passed

dotnwat changed the title ~~rptest: Increase backoff interval for GCS~~ CORE-1643 rptest: Increase backoff interval for GCS Apr 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CORE-1643 rptest: Increase backoff interval for GCS #17545

CORE-1643 rptest: Increase backoff interval for GCS #17545

Lazin commented Apr 2, 2024

CORE-1643 rptest: Increase backoff interval for GCS #17545

CORE-1643 rptest: Increase backoff interval for GCS #17545

Conversation

Lazin commented Apr 2, 2024

Backports Required

Release Notes