Fix 429 rate limiting in run-cloud CLI polling#10275
Merged
Merged
Conversation
Contributor
a26b3da to
a3178ae
Compare
bnavetta
approved these changes
May 6, 2026
The run-cloud CLI command polls the agent status API every 1 second for up to 80 seconds, which can trigger 429 rate limiting from the server. When a 429 is received, it was treated as a fatal error, terminating the CLI even though the agent run itself was healthy and running. Changes: - Increase poll interval from 1s to 3s to reduce baseline request rate - Wrap each status poll in with_bounded_retry so transient HTTP errors (429, 5xx) are retried with exponential backoff instead of immediately killing the CLI. Reuses the existing retry helper from agent_sdk::retry. - Permanent errors (403, 404, etc.) still fail immediately Co-Authored-By: Oz <oz-agent@warp.dev>
a3178ae to
f879f9d
Compare
trungtai1805
pushed a commit
to trungtai1805/warp
that referenced
this pull request
May 9, 2026
## Description Fix 429 rate limiting errors in `oz agent run-cloud` CLI command. The CLI polls the agent status API every 1 second for up to 80 seconds, which triggers 429 (Too Many Requests) responses from the server. When a 429 is received, it was treated as a fatal error, terminating the CLI even though the agent run itself was healthy and running fine on the worker. ### Root Cause - `TASK_STATUS_POLL_INTERVAL` was set to 1 second, generating up to 80 API calls per `run-cloud` invocation - The polling loop in `poll_run_until_joinable_session` treated _any_ error from `get_ambient_agent_task` as fatal, immediately yielding the error and terminating the stream ### Changes - **Increase poll interval from 1s to 3s** to reduce baseline request rate (~27 polls instead of ~80 over the 80s timeout window) - **Add retry-with-backoff for transient HTTP errors** (429, 5xx) instead of treating them as fatal. Up to 5 retries with escalating backoff (2s, 4s, 8s, 15s, 15s). Uses the existing `is_transient_http_error` classifier. - Permanent errors (403, 404, etc.) still fail immediately - Consecutive transient error counter resets on any successful poll ## Linked Issue - [ ] The linked issue is labeled `ready-to-spec` or `ready-to-implement`. - [ ] Where appropriate, screenshots or a short video of the implementation are included below (especially for user-visible or UI changes). ## Testing Added 3 new unit tests in `spawn_tests.rs`: - `poll_retries_transient_429_errors` — verifies transient 429 errors are retried and polling resumes successfully - `poll_fails_on_permanent_http_error` — verifies 403 errors fail immediately without retry - `poll_gives_up_after_max_transient_retries` — verifies the retry limit is respected and the error is eventually surfaced All 15 spawn tests pass. Verified `cargo fmt`, `cargo clippy`, and `cargo check` pass. ## Agent Mode - [x] Warp Agent Mode - This PR was created via Warp's AI Agent Mode <!-- CHANGELOG-BUG-FIX: Fixed 429 rate-limiting errors when using `oz agent run-cloud` CLI by reducing poll frequency and adding transient error retry with backoff --> _Conversation: https://staging.warp.dev/conversation/30be443b-0ba4-4c5d-a4ac-3827833c47db_ _Run: https://oz.staging.warp.dev/runs/019dfe3b-6baa-7a98-ba88-1a1becf50c2b_ _This PR was generated with [Oz](https://warp.dev/oz)._ Co-authored-by: Oz <oz-agent@warp.dev>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fix 429 rate limiting errors in
oz agent run-cloudCLI command.The CLI polls the agent status API every 1 second for up to 80 seconds, which triggers 429 (Too Many Requests) responses from the server. When a 429 is received, it was treated as a fatal error, terminating the CLI even though the agent run itself was healthy and running fine on the worker.
Root Cause
TASK_STATUS_POLL_INTERVALwas set to 1 second, generating up to 80 API calls perrun-cloudinvocationpoll_run_until_joinable_sessiontreated any error fromget_ambient_agent_taskas fatal, immediately yielding the error and terminating the streamChanges
is_transient_http_errorclassifier.Linked Issue
ready-to-specorready-to-implement.Testing
Added 3 new unit tests in
spawn_tests.rs:poll_retries_transient_429_errors— verifies transient 429 errors are retried and polling resumes successfullypoll_fails_on_permanent_http_error— verifies 403 errors fail immediately without retrypoll_gives_up_after_max_transient_retries— verifies the retry limit is respected and the error is eventually surfacedAll 15 spawn tests pass. Verified
cargo fmt,cargo clippy, andcargo checkpass.Agent Mode
Conversation: https://staging.warp.dev/conversation/30be443b-0ba4-4c5d-a4ac-3827833c47db
Run: https://oz.staging.warp.dev/runs/019dfe3b-6baa-7a98-ba88-1a1becf50c2b
This PR was generated with Oz.