Skip to content

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340

Merged
deanq merged 5 commits into
mainfrom
deanq/ae-3154-endpointjob-wait-retry
May 27, 2026
Merged

fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340
deanq merged 5 commits into
mainfrom
deanq/ae-3154-endpointjob-wait-retry

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented May 25, 2026

Summary

  • EndpointJob.wait() previously called self.status() with zero exception handling, so a single transient httpx.RemoteProtocolError (or any transport/timeout failure) on the Runpod /v2/{id}/status/{job_id} poll aborted the whole wait — even though the underlying job was still healthy. Cold starts (model download, vLLM compile, CUDA graph capture) make this very visible: one dropped poll fails a five-minute wait that was nearly complete.
  • The polling loop now catches httpx.TransportError and httpx.TimeoutException, logs at debug, applies the existing exponential backoff, and continues. It re-raises only when the user-supplied timeout deadline is exceeded (still TimeoutError), or when _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit — so genuinely dead endpoints still fail loud. The counter resets on any successful poll.
  • httpx.HTTPStatusError (4xx auth/config bugs from raise_for_status) is intentionally NOT caught — it propagates immediately.
  • The user-space _wait_resilient workaround in flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py is now obsolete; cleanup of that file is intentionally out of scope for this PR.

Refs AE-3154.

Test plan

  • make quality-check passes (all tests + lint/format, coverage 85.45%).
  • New unit tests in tests/unit/test_endpoint_client.py::TestEndpointJobWaitTransientErrors:
    • transient RemoteProtocolError once, then COMPLETEDwait() returns normally (2 polls).
    • persistent RemoteProtocolErrorwait() re-raises after _POLL_MAX_CONSECUTIVE_ERRORS polls.
    • error / success / error burst / success / COMPLETED — counter resets, wait() completes.
    • HTTPStatusError(401) is NOT swallowed; re-raised on first call.
    • RemoteProtocolError forever + timeout=0.1wait() raises TimeoutError, not the httpx error.
  • Manual smoke against an endpoint with cold workers (vLLM): await job.wait() survives mid-poll TCP drops instead of aborting.

EndpointJob.wait() previously aborted on a single httpx.RemoteProtocolError
(or any other transient transport/timeout failure) raised by the Runpod
/v2/{id}/status/{job_id} poll, even though the underlying job was still
healthy. Multi-minute cold starts amplify this: one dropped poll fails a
five-minute wait that was nearly complete.

Catch httpx.TransportError and httpx.TimeoutException inside the polling
loop, log at debug, apply the existing exponential backoff, and continue.
Re-raise only when:
  - the user-supplied timeout deadline is exceeded (TimeoutError), or
  - _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit, so dead
    endpoints still fail loud.

The counter resets on any successful poll. httpx.HTTPStatusError (4xx
auth/config bugs) is intentionally NOT caught — it propagates immediately.

Refs AE-3154.
@promptless
Copy link
Copy Markdown

promptless Bot commented May 25, 2026

Promptless prepared a documentation update related to this change.

Triggered by runpod/flash PR #340

Documents that the Flash SDK's EndpointJob.wait() method now automatically retries transient network errors (connection drops, timeouts, protocol errors) with exponential backoff, making it resilient during cold starts. HTTP status errors like 401 and 404 fail immediately without retry.

Review: Document EndpointJob.wait() retry behavior for transient errors

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the resilience of EndpointJob.wait() polling by tolerating transient httpx transport/timeouts during job status checks, rather than aborting the entire wait on a single dropped connection.

Changes:

  • Add transient-error retry handling to EndpointJob.wait() with exponential backoff and a maximum consecutive-error threshold.
  • Introduce _POLL_MAX_CONSECUTIVE_ERRORS to cap tolerated consecutive transient failures.
  • Add unit tests covering transient error retry, threshold behavior, counter reset, and HTTPStatusError propagation.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/runpod_flash/endpoint.py Adds retry/backoff logic in EndpointJob.wait() for transient httpx transport/timeout errors with a consecutive-error threshold.
tests/unit/test_endpoint_client.py Adds unit tests validating retry behavior and error propagation for EndpointJob.wait().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/unit/test_endpoint_client.py Outdated
The previous test passed `timeout=0.1` against the default
`_POLL_INITIAL_INTERVAL=0.25`, so `wait()` raised `TimeoutError` from
its pre-sleep deadline guard before ever calling `status()`. The retry
path was never exercised — the test only validated the pre-sleep guard.

Apply the `fast_poll` fixture and lift the consecutive-error threshold
above the number of retries the deadline allows, so multiple httpx
errors are actually suppressed before the deadline trips. Assert
`_api_get.call_count >= 2` to lock in that the retry path runs.

Surfaced by Copilot review on AE-3154.
@deanq deanq requested review from KAJdev, jhcipar and runpod-Henrik May 25, 2026 04:27
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Henrik's AI-Powered Bug Finder — PR #340 Review

Verdict: PASS WITH NITS

Targeted, minimal fix for a real cold-start failure mode. Catching (httpx.TransportError, httpx.TimeoutException) around self.status(), escalating after _POLL_MAX_CONSECUTIVE_ERRORS consecutive failures, and resetting on success is the right shape. HTTPStatusError deliberately not caught — auth/config bugs still fail loud, as documented. Counter reset on success prevents cumulative noise. Tests cover all five paths (one-shot transient, threshold escalation, burst with reset, HTTPStatusError pass-through, deadline still authoritative). CI green on 3.10/3.11/3.12.


1. Issue: user-supplied timeout= can be exceeded by ~30s per retried poll

User scenario: a caller writes await job.wait(timeout=10) expecting the call to return within 10 seconds. A worker is mid-cold-start and the proxy is dropping connections. Each await self.status() hangs against _api_get's default timeout=30.0 (endpoint.py:898) before raising httpx.ReadTimeout. The new code catches that, increments the counter, and loops back — but the deadline check at endpoint.py:144,150 happens before asyncio.sleep, never around the status() call. So if the deadline trips while _api_get is blocked in its 30-second httpx timeout, the user waits the full 30s past their declared deadline before the next loop iteration finally raises TimeoutError.

Concretely: wait(timeout=10) against a hung endpoint can realistically take 30–40 seconds wall-clock to raise.

Two cheap mitigations:

  • Tighten _api_get's effective timeout for the wait() poll path by passing a bound (e.g., min of remaining deadline and 5s) when called from wait().
  • Or wrap the status() call in asyncio.wait_for(self.status(), timeout=remaining_deadline) so the deadline is authoritative even mid-call.

Pre-existing concern in shape, but this PR makes it worse: previously a single httpx timeout bailed immediately; now we suppress it and re-enter the wait. Worth at least a docstring note that `timeout` is best-effort and can be exceeded by one poll's httpx timeout.


2. Nit: `(httpx.TransportError, httpx.TimeoutException)` is redundant

`httpx.TimeoutException` is a subclass of `httpx.TransportError` (`HTTPError → RequestError → TransportError → TimeoutException` in httpx's exception tree). Catching both has no effect over catching just `TransportError`. Either drop `TimeoutException` from the tuple, or add a comment noting the redundancy is intentional for documentation purposes.


3. Question: should `httpx.DecodingError` be retried too?

`DecodingError` is a `RequestError` but not a `TransportError`. If the Runpod API returns a malformed response (corrupt gzip, partial JSON during a server hiccup), the current catch wouldn't swallow it and `wait()` would abort. That's defensible — "the server is returning garbage" feels different from "the connection dropped" — but worth confirming whether you've actually seen this shape in the wild, or it's just hypothetical. If hypothetical, leave it strict.


4. Nit: retry events are logged at DEBUG

`log.debug("transient httpx error polling job %s ...")` — operators on the default `LOG_LEVEL=INFO` (or `WARNING`) won't see why a `wait()` is taking longer than expected. When someone files "my wait() hung for 90 seconds before failing," the only diagnostic is to turn debug logging on and reproduce.

Consider logging the first retry at INFO ("transient error polling job %s, retrying") and subsequent at DEBUG. Or expose `consecutive_errors` / cumulative retry count on `EndpointJob` so callers can introspect.

Not a blocker — debug is the conservative choice — but worth weighing against operability.


5. Question: `_wait_resilient` cleanup in `flash-examples` — tracked?

PR description notes the user-space workaround in `flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py` is now obsolete and cleanup is intentionally deferred. Is there a tracking ticket / follow-up PR open against flash-examples? If not, this kind of dangling cruft tends to live forever — would be worth filing one before merge so it doesn't get lost.


Nits

  • `tests/unit/test_endpoint_client.py:212` — `fast_poll` fixture is great. Worth applying it to `test_wait_timeout_raises` too (currently at line ~199 in the unchanged section) so existing timeout tests don't sit on a real 0.3s sleep.
  • `endpoint.py:144,150` — the deadline check is duplicated before and after `asyncio.sleep`. Pre-existing, but with the new retry path it might be cleaner to extract a `_check_deadline()` helper and call it once per loop top. Out of scope for this PR.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

deanq and others added 3 commits May 27, 2026 12:16
Address Henrik QA feedback on PR #340:

- Wrap status() call in asyncio.wait_for with remaining deadline so the
  user-supplied timeout is authoritative even when _api_get's httpx
  timeout would otherwise exceed it.
- Drop redundant httpx.TimeoutException from except tuple
  (subclass of TransportError).
- Log first transient retry at INFO so operators see why wait() is
  taking longer; keep subsequent retries at DEBUG to avoid spam.
@deanq deanq merged commit c126357 into main May 27, 2026
4 checks passed
@deanq deanq deleted the deanq/ae-3154-endpointjob-wait-retry branch May 27, 2026 20:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants