Skip to content

Fix 429 rate limiting in run-cloud CLI polling#10275

Merged
ianhodge merged 1 commit into
masterfrom
fix/run-cloud-429-retry
May 6, 2026
Merged

Fix 429 rate limiting in run-cloud CLI polling#10275
ianhodge merged 1 commit into
masterfrom
fix/run-cloud-429-retry

Conversation

@ianhodge
Copy link
Copy Markdown
Member

@ianhodge ianhodge commented May 6, 2026

Description

Fix 429 rate limiting errors in oz agent run-cloud CLI command.

The CLI polls the agent status API every 1 second for up to 80 seconds, which triggers 429 (Too Many Requests) responses from the server. When a 429 is received, it was treated as a fatal error, terminating the CLI even though the agent run itself was healthy and running fine on the worker.

Root Cause

  • TASK_STATUS_POLL_INTERVAL was set to 1 second, generating up to 80 API calls per run-cloud invocation
  • The polling loop in poll_run_until_joinable_session treated any error from get_ambient_agent_task as fatal, immediately yielding the error and terminating the stream

Changes

  • Increase poll interval from 1s to 3s to reduce baseline request rate (~27 polls instead of ~80 over the 80s timeout window)
  • Add retry-with-backoff for transient HTTP errors (429, 5xx) instead of treating them as fatal. Up to 5 retries with escalating backoff (2s, 4s, 8s, 15s, 15s). Uses the existing is_transient_http_error classifier.
  • Permanent errors (403, 404, etc.) still fail immediately
  • Consecutive transient error counter resets on any successful poll

Linked Issue

  • The linked issue is labeled ready-to-spec or ready-to-implement.
  • Where appropriate, screenshots or a short video of the implementation are included below (especially for user-visible or UI changes).

Testing

Added 3 new unit tests in spawn_tests.rs:

  • poll_retries_transient_429_errors — verifies transient 429 errors are retried and polling resumes successfully
  • poll_fails_on_permanent_http_error — verifies 403 errors fail immediately without retry
  • poll_gives_up_after_max_transient_retries — verifies the retry limit is respected and the error is eventually surfaced

All 15 spawn tests pass. Verified cargo fmt, cargo clippy, and cargo check pass.

Agent Mode

  • Warp Agent Mode - This PR was created via Warp's AI Agent Mode

Conversation: https://staging.warp.dev/conversation/30be443b-0ba4-4c5d-a4ac-3827833c47db
Run: https://oz.staging.warp.dev/runs/019dfe3b-6baa-7a98-ba88-1a1becf50c2b
This PR was generated with Oz.

@cla-bot cla-bot Bot added the cla-signed label May 6, 2026
@ianhodge ianhodge requested a review from bnavetta May 6, 2026 17:20
@ianhodge ianhodge marked this pull request as ready for review May 6, 2026 17:20
@oz-for-oss
Copy link
Copy Markdown
Contributor

oz-for-oss Bot commented May 6, 2026

@ianhodge

I ran into an unexpected error while working on this.

Powered by Oz

@ianhodge ianhodge force-pushed the fix/run-cloud-429-retry branch from a26b3da to a3178ae Compare May 6, 2026 17:42
The run-cloud CLI command polls the agent status API every 1 second for up
to 80 seconds, which can trigger 429 rate limiting from the server. When a
429 is received, it was treated as a fatal error, terminating the CLI even
though the agent run itself was healthy and running.

Changes:
- Increase poll interval from 1s to 3s to reduce baseline request rate
- Wrap each status poll in with_bounded_retry so transient HTTP errors
  (429, 5xx) are retried with exponential backoff instead of immediately
  killing the CLI. Reuses the existing retry helper from agent_sdk::retry.
- Permanent errors (403, 404, etc.) still fail immediately

Co-Authored-By: Oz <oz-agent@warp.dev>
@ianhodge ianhodge force-pushed the fix/run-cloud-429-retry branch from a3178ae to f879f9d Compare May 6, 2026 18:49
@ianhodge ianhodge enabled auto-merge (squash) May 6, 2026 18:54
@ianhodge ianhodge merged commit 5ae13bd into master May 6, 2026
24 checks passed
@ianhodge ianhodge deleted the fix/run-cloud-429-retry branch May 6, 2026 19:06
trungtai1805 pushed a commit to trungtai1805/warp that referenced this pull request May 9, 2026
## Description

Fix 429 rate limiting errors in `oz agent run-cloud` CLI command.

The CLI polls the agent status API every 1 second for up to 80 seconds,
which triggers 429 (Too Many Requests) responses from the server. When a
429 is received, it was treated as a fatal error, terminating the CLI
even though the agent run itself was healthy and running fine on the
worker.

### Root Cause
- `TASK_STATUS_POLL_INTERVAL` was set to 1 second, generating up to 80
API calls per `run-cloud` invocation
- The polling loop in `poll_run_until_joinable_session` treated _any_
error from `get_ambient_agent_task` as fatal, immediately yielding the
error and terminating the stream

### Changes
- **Increase poll interval from 1s to 3s** to reduce baseline request
rate (~27 polls instead of ~80 over the 80s timeout window)
- **Add retry-with-backoff for transient HTTP errors** (429, 5xx)
instead of treating them as fatal. Up to 5 retries with escalating
backoff (2s, 4s, 8s, 15s, 15s). Uses the existing
`is_transient_http_error` classifier.
- Permanent errors (403, 404, etc.) still fail immediately
- Consecutive transient error counter resets on any successful poll

## Linked Issue

- [ ] The linked issue is labeled `ready-to-spec` or
`ready-to-implement`.
- [ ] Where appropriate, screenshots or a short video of the
implementation are included below (especially for user-visible or UI
changes).

## Testing

Added 3 new unit tests in `spawn_tests.rs`:
- `poll_retries_transient_429_errors` — verifies transient 429 errors
are retried and polling resumes successfully
- `poll_fails_on_permanent_http_error` — verifies 403 errors fail
immediately without retry
- `poll_gives_up_after_max_transient_retries` — verifies the retry limit
is respected and the error is eventually surfaced

All 15 spawn tests pass. Verified `cargo fmt`, `cargo clippy`, and
`cargo check` pass.

## Agent Mode
- [x] Warp Agent Mode - This PR was created via Warp's AI Agent Mode

<!--
CHANGELOG-BUG-FIX: Fixed 429 rate-limiting errors when using `oz agent
run-cloud` CLI by reducing poll frequency and adding transient error
retry with backoff
-->

_Conversation:
https://staging.warp.dev/conversation/30be443b-0ba4-4c5d-a4ac-3827833c47db_
_Run:
https://oz.staging.warp.dev/runs/019dfe3b-6baa-7a98-ba88-1a1becf50c2b_
_This PR was generated with [Oz](https://warp.dev/oz)._

Co-authored-by: Oz <oz-agent@warp.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants