fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340
Conversation
EndpointJob.wait() previously aborted on a single httpx.RemoteProtocolError
(or any other transient transport/timeout failure) raised by the Runpod
/v2/{id}/status/{job_id} poll, even though the underlying job was still
healthy. Multi-minute cold starts amplify this: one dropped poll fails a
five-minute wait that was nearly complete.
Catch httpx.TransportError and httpx.TimeoutException inside the polling
loop, log at debug, apply the existing exponential backoff, and continue.
Re-raise only when:
- the user-supplied timeout deadline is exceeded (TimeoutError), or
- _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit, so dead
endpoints still fail loud.
The counter resets on any successful poll. httpx.HTTPStatusError (4xx
auth/config bugs) is intentionally NOT caught — it propagates immediately.
Refs AE-3154.
|
Promptless prepared a documentation update related to this change. Triggered by runpod/flash PR #340 Documents that the Flash SDK's Review: Document EndpointJob.wait() retry behavior for transient errors |
There was a problem hiding this comment.
Pull request overview
This PR improves the resilience of EndpointJob.wait() polling by tolerating transient httpx transport/timeouts during job status checks, rather than aborting the entire wait on a single dropped connection.
Changes:
- Add transient-error retry handling to
EndpointJob.wait()with exponential backoff and a maximum consecutive-error threshold. - Introduce
_POLL_MAX_CONSECUTIVE_ERRORSto cap tolerated consecutive transient failures. - Add unit tests covering transient error retry, threshold behavior, counter reset, and
HTTPStatusErrorpropagation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/runpod_flash/endpoint.py |
Adds retry/backoff logic in EndpointJob.wait() for transient httpx transport/timeout errors with a consecutive-error threshold. |
tests/unit/test_endpoint_client.py |
Adds unit tests validating retry behavior and error propagation for EndpointJob.wait(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The previous test passed `timeout=0.1` against the default `_POLL_INITIAL_INTERVAL=0.25`, so `wait()` raised `TimeoutError` from its pre-sleep deadline guard before ever calling `status()`. The retry path was never exercised — the test only validated the pre-sleep guard. Apply the `fast_poll` fixture and lift the consecutive-error threshold above the number of retries the deadline allows, so multiple httpx errors are actually suppressed before the deadline trips. Assert `_api_get.call_count >= 2` to lock in that the retry path runs. Surfaced by Copilot review on AE-3154.
runpod-Henrik
left a comment
There was a problem hiding this comment.
Henrik's AI-Powered Bug Finder — PR #340 Review
Verdict: PASS WITH NITS
Targeted, minimal fix for a real cold-start failure mode. Catching (httpx.TransportError, httpx.TimeoutException) around self.status(), escalating after _POLL_MAX_CONSECUTIVE_ERRORS consecutive failures, and resetting on success is the right shape. HTTPStatusError deliberately not caught — auth/config bugs still fail loud, as documented. Counter reset on success prevents cumulative noise. Tests cover all five paths (one-shot transient, threshold escalation, burst with reset, HTTPStatusError pass-through, deadline still authoritative). CI green on 3.10/3.11/3.12.
1. Issue: user-supplied timeout= can be exceeded by ~30s per retried poll
User scenario: a caller writes await job.wait(timeout=10) expecting the call to return within 10 seconds. A worker is mid-cold-start and the proxy is dropping connections. Each await self.status() hangs against _api_get's default timeout=30.0 (endpoint.py:898) before raising httpx.ReadTimeout. The new code catches that, increments the counter, and loops back — but the deadline check at endpoint.py:144,150 happens before asyncio.sleep, never around the status() call. So if the deadline trips while _api_get is blocked in its 30-second httpx timeout, the user waits the full 30s past their declared deadline before the next loop iteration finally raises TimeoutError.
Concretely: wait(timeout=10) against a hung endpoint can realistically take 30–40 seconds wall-clock to raise.
Two cheap mitigations:
- Tighten
_api_get's effective timeout for thewait()poll path by passing a bound (e.g., min of remaining deadline and 5s) when called fromwait(). - Or wrap the
status()call inasyncio.wait_for(self.status(), timeout=remaining_deadline)so the deadline is authoritative even mid-call.
Pre-existing concern in shape, but this PR makes it worse: previously a single httpx timeout bailed immediately; now we suppress it and re-enter the wait. Worth at least a docstring note that `timeout` is best-effort and can be exceeded by one poll's httpx timeout.
2. Nit: `(httpx.TransportError, httpx.TimeoutException)` is redundant
`httpx.TimeoutException` is a subclass of `httpx.TransportError` (`HTTPError → RequestError → TransportError → TimeoutException` in httpx's exception tree). Catching both has no effect over catching just `TransportError`. Either drop `TimeoutException` from the tuple, or add a comment noting the redundancy is intentional for documentation purposes.
3. Question: should `httpx.DecodingError` be retried too?
`DecodingError` is a `RequestError` but not a `TransportError`. If the Runpod API returns a malformed response (corrupt gzip, partial JSON during a server hiccup), the current catch wouldn't swallow it and `wait()` would abort. That's defensible — "the server is returning garbage" feels different from "the connection dropped" — but worth confirming whether you've actually seen this shape in the wild, or it's just hypothetical. If hypothetical, leave it strict.
4. Nit: retry events are logged at DEBUG
`log.debug("transient httpx error polling job %s ...")` — operators on the default `LOG_LEVEL=INFO` (or `WARNING`) won't see why a `wait()` is taking longer than expected. When someone files "my wait() hung for 90 seconds before failing," the only diagnostic is to turn debug logging on and reproduce.
Consider logging the first retry at INFO ("transient error polling job %s, retrying") and subsequent at DEBUG. Or expose `consecutive_errors` / cumulative retry count on `EndpointJob` so callers can introspect.
Not a blocker — debug is the conservative choice — but worth weighing against operability.
5. Question: `_wait_resilient` cleanup in `flash-examples` — tracked?
PR description notes the user-space workaround in `flash-examples/02_ml_inference/02_vllm_chat/vllm_chat.py` is now obsolete and cleanup is intentionally deferred. Is there a tracking ticket / follow-up PR open against flash-examples? If not, this kind of dangling cruft tends to live forever — would be worth filing one before merge so it doesn't get lost.
Nits
- `tests/unit/test_endpoint_client.py:212` — `fast_poll` fixture is great. Worth applying it to `test_wait_timeout_raises` too (currently at line ~199 in the unchanged section) so existing timeout tests don't sit on a real 0.3s sleep.
- `endpoint.py:144,150` — the deadline check is duplicated before and after `asyncio.sleep`. Pre-existing, but with the new retry path it might be cleaner to extract a `_check_deadline()` helper and call it once per loop top. Out of scope for this PR.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
Address Henrik QA feedback on PR #340: - Wrap status() call in asyncio.wait_for with remaining deadline so the user-supplied timeout is authoritative even when _api_get's httpx timeout would otherwise exceed it. - Drop redundant httpx.TimeoutException from except tuple (subclass of TransportError). - Log first transient retry at INFO so operators see why wait() is taking longer; keep subsequent retries at DEBUG to avoid spam.
Summary
EndpointJob.wait()previously calledself.status()with zero exception handling, so a single transienthttpx.RemoteProtocolError(or any transport/timeout failure) on the Runpod/v2/{id}/status/{job_id}poll aborted the whole wait — even though the underlying job was still healthy. Cold starts (model download, vLLM compile, CUDA graph capture) make this very visible: one dropped poll fails a five-minute wait that was nearly complete.httpx.TransportErrorandhttpx.TimeoutException, logs at debug, applies the existing exponential backoff, and continues. It re-raises only when the user-suppliedtimeoutdeadline is exceeded (stillTimeoutError), or when_POLL_MAX_CONSECUTIVE_ERRORS(5) consecutive failures hit — so genuinely dead endpoints still fail loud. The counter resets on any successful poll.httpx.HTTPStatusError(4xx auth/config bugs fromraise_for_status) is intentionally NOT caught — it propagates immediately._wait_resilientworkaround inflash-examples/02_ml_inference/02_vllm_chat/vllm_chat.pyis now obsolete; cleanup of that file is intentionally out of scope for this PR.Refs AE-3154.
Test plan
make quality-checkpasses (all tests + lint/format, coverage 85.45%).tests/unit/test_endpoint_client.py::TestEndpointJobWaitTransientErrors:RemoteProtocolErroronce, thenCOMPLETED—wait()returns normally (2 polls).RemoteProtocolError—wait()re-raises after_POLL_MAX_CONSECUTIVE_ERRORSpolls.COMPLETED— counter resets,wait()completes.HTTPStatusError(401)is NOT swallowed; re-raised on first call.RemoteProtocolErrorforever +timeout=0.1—wait()raisesTimeoutError, not the httpx error.await job.wait()survives mid-poll TCP drops instead of aborting.