fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340
fix(endpoint): retry EndpointJob.wait() on transient httpx errors#340deanq wants to merge 2 commits into
Conversation
EndpointJob.wait() previously aborted on a single httpx.RemoteProtocolError
(or any other transient transport/timeout failure) raised by the Runpod
/v2/{id}/status/{job_id} poll, even though the underlying job was still
healthy. Multi-minute cold starts amplify this: one dropped poll fails a
five-minute wait that was nearly complete.
Catch httpx.TransportError and httpx.TimeoutException inside the polling
loop, log at debug, apply the existing exponential backoff, and continue.
Re-raise only when:
- the user-supplied timeout deadline is exceeded (TimeoutError), or
- _POLL_MAX_CONSECUTIVE_ERRORS (5) consecutive failures hit, so dead
endpoints still fail loud.
The counter resets on any successful poll. httpx.HTTPStatusError (4xx
auth/config bugs) is intentionally NOT caught — it propagates immediately.
Refs AE-3154.
|
Promptless prepared a documentation update related to this change. Triggered by runpod/flash PR #340 Documents that the Flash SDK's Review: Document EndpointJob.wait() retry behavior for transient errors |
There was a problem hiding this comment.
Pull request overview
This PR improves the resilience of EndpointJob.wait() polling by tolerating transient httpx transport/timeouts during job status checks, rather than aborting the entire wait on a single dropped connection.
Changes:
- Add transient-error retry handling to
EndpointJob.wait()with exponential backoff and a maximum consecutive-error threshold. - Introduce
_POLL_MAX_CONSECUTIVE_ERRORSto cap tolerated consecutive transient failures. - Add unit tests covering transient error retry, threshold behavior, counter reset, and
HTTPStatusErrorpropagation.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
src/runpod_flash/endpoint.py |
Adds retry/backoff logic in EndpointJob.wait() for transient httpx transport/timeout errors with a consecutive-error threshold. |
tests/unit/test_endpoint_client.py |
Adds unit tests validating retry behavior and error propagation for EndpointJob.wait(). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The previous test passed `timeout=0.1` against the default `_POLL_INITIAL_INTERVAL=0.25`, so `wait()` raised `TimeoutError` from its pre-sleep deadline guard before ever calling `status()`. The retry path was never exercised — the test only validated the pre-sleep guard. Apply the `fast_poll` fixture and lift the consecutive-error threshold above the number of retries the deadline allows, so multiple httpx errors are actually suppressed before the deadline trips. Assert `_api_get.call_count >= 2` to lock in that the retry path runs. Surfaced by Copilot review on AE-3154.
Summary
EndpointJob.wait()previously calledself.status()with zero exception handling, so a single transienthttpx.RemoteProtocolError(or any transport/timeout failure) on the Runpod/v2/{id}/status/{job_id}poll aborted the whole wait — even though the underlying job was still healthy. Cold starts (model download, vLLM compile, CUDA graph capture) make this very visible: one dropped poll fails a five-minute wait that was nearly complete.httpx.TransportErrorandhttpx.TimeoutException, logs at debug, applies the existing exponential backoff, and continues. It re-raises only when the user-suppliedtimeoutdeadline is exceeded (stillTimeoutError), or when_POLL_MAX_CONSECUTIVE_ERRORS(5) consecutive failures hit — so genuinely dead endpoints still fail loud. The counter resets on any successful poll.httpx.HTTPStatusError(4xx auth/config bugs fromraise_for_status) is intentionally NOT caught — it propagates immediately._wait_resilientworkaround inflash-examples/02_ml_inference/02_vllm_chat/vllm_chat.pyis now obsolete; cleanup of that file is intentionally out of scope for this PR.Refs AE-3154.
Test plan
make quality-checkpasses (all tests + lint/format, coverage 85.45%).tests/unit/test_endpoint_client.py::TestEndpointJobWaitTransientErrors:RemoteProtocolErroronce, thenCOMPLETED—wait()returns normally (2 polls).RemoteProtocolError—wait()re-raises after_POLL_MAX_CONSECUTIVE_ERRORSpolls.COMPLETED— counter resets,wait()completes.HTTPStatusError(401)is NOT swallowed; re-raised on first call.RemoteProtocolErrorforever +timeout=0.1—wait()raisesTimeoutError, not the httpx error.await job.wait()survives mid-poll TCP drops instead of aborting.