fix(endpoint): retry EndpointJob.wait() on transient httpx errors by deanq · Pull Request #340 · runpod/flash

deanq · 2026-05-25T03:06:39Z

Summary

EndpointJob.wait() previously called self.status() with zero exception handling, so a single transient httpx.RemoteProtocolError (or any transport/timeout failure) on the Runpod /v2/{id}/status/{job_id} poll aborted the whole wait — even though the underlying job was still healthy. Cold starts (model download, vLLM compile, CUDA graph capture) make this very visible: one dropped poll fails a five-minute wait that was nearly complete.
The polling loop now catches httpx.TransportError and httpx.TimeoutException, logs at debug, applies the existing exponential backoff, and continues. It re-raises only when the user-supplied timeout deadline is exceeded (still TimeoutError), or when _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit — so genuinely dead endpoints still fail loud. The counter resets on any successful poll.
httpx.HTTPStatusError (4xx auth/config bugs from raise_for_status) is intentionally NOT caught — it propagates immediately.
The user-space _wait_resilient workaround in flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py is now obsolete; cleanup of that file is intentionally out of scope for this PR.

Test plan

make quality-check passes (all tests + lint/format, coverage 85.45%).
New unit tests in tests/unit/test_endpoint_client.py::TestEndpointJobWaitTransientErrors:
- transient RemoteProtocolError once, then COMPLETED — wait() returns normally (2 polls).
- persistent RemoteProtocolError — wait() re-raises after _POLL_MAX_CONSECUTIVE_ERRORS polls.
- error / success / error burst / success / COMPLETED — counter resets, wait() completes.
- HTTPStatusError(401) is NOT swallowed; re-raised on first call.
- RemoteProtocolError forever + timeout=0.1 — wait() raises TimeoutError, not the httpx error.
Manual smoke against an endpoint with cold workers (vLLM): await job.wait() survives mid-poll TCP drops instead of aborting.

EndpointJob.wait() previously aborted on a single httpx.RemoteProtocolError (or any other transient transport/timeout failure) raised by the Runpod /v2/{id}/status/{job_id} poll, even though the underlying job was still healthy. Multi-minute cold starts amplify this: one dropped poll fails a five-minute wait that was nearly complete. Catch httpx.TransportError and httpx.TimeoutException inside the polling loop, log at debug, apply the existing exponential backoff, and continue. Re-raise only when: - the user-supplied timeout deadline is exceeded (TimeoutError), or - _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit, so dead endpoints still fail loud. The counter resets on any successful poll. httpx.HTTPStatusError (4xx auth/config bugs) is intentionally NOT caught — it propagates immediately. Refs AE-3154.

promptless · 2026-05-25T03:11:37Z

Promptless prepared a documentation update related to this change.

Triggered by runpod/flash PR #340

Documents that the Flash SDK's EndpointJob.wait() method now automatically retries transient network errors (connection drops, timeouts, protocol errors) with exponential backoff, making it resilient during cold starts. HTTP status errors like 401 and 404 fail immediately without retry.

Review: Document EndpointJob.wait() retry behavior for transient errors

Copilot

Pull request overview

This PR improves the resilience of EndpointJob.wait() polling by tolerating transient httpx transport/timeouts during job status checks, rather than aborting the entire wait on a single dropped connection.

Changes:

Add transient-error retry handling to EndpointJob.wait() with exponential backoff and a maximum consecutive-error threshold.
Introduce _POLL_MAX_CONSECUTIVE_ERRORS to cap tolerated consecutive transient failures.
Add unit tests covering transient error retry, threshold behavior, counter reset, and HTTPStatusError propagation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
`src/runpod_flash/endpoint.py`	Adds retry/backoff logic in `EndpointJob.wait()` for transient `httpx` transport/timeout errors with a consecutive-error threshold.
`tests/unit/test_endpoint_client.py`	Adds unit tests validating retry behavior and error propagation for `EndpointJob.wait()`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

The previous test passed `timeout=0.1` against the default `_POLL_INITIAL_INTERVAL=0.25`, so `wait()` raised `TimeoutError` from its pre-sleep deadline guard before ever calling `status()`. The retry path was never exercised — the test only validated the pre-sleep guard. Apply the `fast_poll` fixture and lift the consecutive-error threshold above the number of retries the deadline allows, so multiple httpx errors are actually suppressed before the deadline trips. Assert `_api_get.call_count >= 2` to lock in that the retry path runs. Surfaced by Copilot review on AE-3154.

runpod-Henrik

Henrik's AI-Powered Bug Finder — PR #340 Review

Verdict: PASS WITH NITS

Targeted, minimal fix for a real cold-start failure mode. Catching (httpx.TransportError, httpx.TimeoutException) around self.status(), escalating after _POLL_MAX_CONSECUTIVE_ERRORS consecutive failures, and resetting on success is the right shape. HTTPStatusError deliberately not caught — auth/config bugs still fail loud, as documented. Counter reset on success prevents cumulative noise. Tests cover all five paths (one-shot transient, threshold escalation, burst with reset, HTTPStatusError pass-through, deadline still authoritative). CI green on 3.10/3.11/3.12.

1. Issue: user-supplied `timeout=` can be exceeded by ~30s per retried poll

User scenario: a caller writes await job.wait(timeout=10) expecting the call to return within 10 seconds. A worker is mid-cold-start and the proxy is dropping connections. Each await self.status() hangs against _api_get's default timeout=30.0 (endpoint.py:898) before raising httpx.ReadTimeout. The new code catches that, increments the counter, and loops back — but the deadline check at endpoint.py:144,150 happens before asyncio.sleep, never around the status() call. So if the deadline trips while _api_get is blocked in its 30-second httpx timeout, the user waits the full 30s past their declared deadline before the next loop iteration finally raises TimeoutError.

Concretely: wait(timeout=10) against a hung endpoint can realistically take 30–40 seconds wall-clock to raise.

Two cheap mitigations:

Tighten _api_get's effective timeout for the wait() poll path by passing a bound (e.g., min of remaining deadline and 5s) when called from wait().
Or wrap the status() call in asyncio.wait_for(self.status(), timeout=remaining_deadline) so the deadline is authoritative even mid-call.

Pre-existing concern in shape, but this PR makes it worse: previously a single httpx timeout bailed immediately; now we suppress it and re-enter the wait. Worth at least a docstring note that `timeout` is best-effort and can be exceeded by one poll's httpx timeout.

2. Nit: `(httpx.TransportError, httpx.TimeoutException)` is redundant

`httpx.TimeoutException` is a subclass of `httpx.TransportError` (`HTTPError → RequestError → TransportError → TimeoutException` in httpx's exception tree). Catching both has no effect over catching just `TransportError`. Either drop `TimeoutException` from the tuple, or add a comment noting the redundancy is intentional for documentation purposes.

3. Question: should `httpx.DecodingError` be retried too?

`DecodingError` is a `RequestError` but not a `TransportError`. If the Runpod API returns a malformed response (corrupt gzip, partial JSON during a server hiccup), the current catch wouldn't swallow it and `wait()` would abort. That's defensible — "the server is returning garbage" feels different from "the connection dropped" — but worth confirming whether you've actually seen this shape in the wild, or it's just hypothetical. If hypothetical, leave it strict.

4. Nit: retry events are logged at DEBUG

`log.debug("transient httpx error polling job %s ...")` — operators on the default `LOG_LEVEL=INFO` (or `WARNING`) won't see why a `wait()` is taking longer than expected. When someone files "my wait() hung for 90 seconds before failing," the only diagnostic is to turn debug logging on and reproduce.

Consider logging the first retry at INFO ("transient error polling job %s, retrying") and subsequent at DEBUG. Or expose `consecutive_errors` / cumulative retry count on `EndpointJob` so callers can introspect.

Not a blocker — debug is the conservative choice — but worth weighing against operability.

5. Question: `_wait_resilient` cleanup in `flash-examples` — tracked?

PR description notes the user-space workaround in `flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py` is now obsolete and cleanup is intentionally deferred. Is there a tracking ticket / follow-up PR open against flash-examples? If not, this kind of dangling cruft tends to live forever — would be worth filing one before merge so it doesn't get lost.

Nits

`tests/unit/test_endpoint_client.py:212` — `fast_poll` fixture is great. Worth applying it to `test_wait_timeout_raises` too (currently at line ~199 in the unchanged section) so existing timeout tests don't sit on a real 0.3s sleep.
`endpoint.py:144,150` — the deadline check is duplicated before and after `asyncio.sleep`. Pre-existing, but with the new retry path it might be cleaner to extract a `_check_deadline()` helper and call it once per loop top. Out of scope for this PR.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Address Henrik QA feedback on PR #340: - Wrap status() call in asyncio.wait_for with remaining deadline so the user-supplied timeout is authoritative even when _api_get's httpx timeout would otherwise exceed it. - Drop redundant httpx.TimeoutException from except tuple (subclass of TransportError). - Log first transient retry at INFO so operators see why wait() is taking longer; keep subsequent retries at DEBUG to avoid spam.

deanq requested a review from Copilot May 25, 2026 03:09

Copilot started reviewing on behalf of deanq May 25, 2026 03:09 View session

Copilot AI reviewed May 25, 2026

View reviewed changes

Comment thread tests/unit/test_endpoint_client.py Outdated

deanq requested review from KAJdev, jhcipar and runpod-Henrik May 25, 2026 04:27

runpod-Henrik reviewed May 26, 2026

View reviewed changes

KAJdev approved these changes May 27, 2026

View reviewed changes

deanq and others added 3 commits May 27, 2026 12:16

Merge branch 'main' into deanq/ae-3154-endpointjob-wait-retry

303c89c

Merge branch 'main' into deanq/ae-3154-endpointjob-wait-retry

11d2485

deanq merged commit c126357 into main May 27, 2026
4 checks passed

deanq deleted the deanq/ae-3154-endpointjob-wait-retry branch May 27, 2026 20:25

runpod-release-please-bot Bot mentioned this pull request May 27, 2026

chore: release 1.17.0 #343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340
deanq merged 5 commits into
mainfrom
deanq/ae-3154-endpointjob-wait-retry

deanq commented May 25, 2026

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

runpod-Henrik left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

deanq commented May 25, 2026

Summary

Test plan

Uh oh!

promptless Bot commented May 25, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

runpod-Henrik left a comment

Choose a reason for hiding this comment

Henrik's AI-Powered Bug Finder — PR #340 Review

1. Issue: user-supplied timeout= can be exceeded by ~30s per retried poll

2. Nit: `(httpx.TransportError, httpx.TimeoutException)` is redundant

3. Question: should `httpx.DecodingError` be retried too?

4. Nit: retry events are logged at DEBUG

5. Question: `_wait_resilient` cleanup in `flash-examples` — tracked?

Nits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

1. Issue: user-supplied `timeout=` can be exceeded by ~30s per retried poll