fix(upload_file): unstick concurrent uploads stalling at 30s#803
Open
jrvb-rl wants to merge 1 commit into
Open
fix(upload_file): unstick concurrent uploads stalling at 30s#803jrvb-rl wants to merge 1 commit into
jrvb-rl wants to merge 1 commit into
Conversation
…erver Customers running many concurrent upload_file calls through a single SDK instance were hitting 408 "Request timed out waiting for request data" from mux/Jetty. Triage on 2026-05-14 found the request-body simply wasn't arriving within Jetty's 30 s data-wait window. Three contributing causes, all introduced by recent SDK changes: 1. DEFAULT_CONNECTION_LIMITS was lowered from (100, 20) to (20, 10) in #797. Combined with the new shared process-global pool, parallel uploads divide bandwidth across far fewer TCP connections than before and starve each other on multipart bodies. Restore the prior (100, 20). 2. _should_retry blanket-retries 408, including upload_file. Each retry pushes another large multipart body onto the same exhausted pool and amplifies the stall instead of recovering from it. Carve out upload_file 408 — other endpoints still retry 408 as before. 3. _files.py read the entire PathLike into memory via read_bytes() before the request was built, contradicting the docstring claim of streaming. Hand httpx an open file handle so the multipart encoder reads lazily and the pool slot isn't held during disk I/O. Tests updated to assert IsInstance(io.IOBase) instead of IsBytes() and to close handles. Also exposes a public connection_limits knob on Runloop / AsyncRunloop (and copy()) so customers can raise the cap above 100 for upload-heavy workloads without needing a custom httpx client. Passing connection_limits implicitly opts out of the shared pool, since the shared pool is process-global and can't honor per-client limits. Includes examples/stress_upload_file.py — a standalone repro that fires N concurrent upload_file calls and reports status-code distribution and latency percentiles, so the before/after delta is one command away. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Customers running many concurrent
upload_filecalls through a single SDK instance were hitting bursts of408 Request timed out waiting for request datafrom mux/Jetty. Triage on 2026-05-14 traced this to three things, all introduced by recent SDK changes:_should_retryblanket-retries 408, so a stalled upload immediately queues another large multipart body onto the same exhausted pool — amplifying the stall instead of recovering from it._files.pyread the entirePathLikeinto memory viaread_bytes()before the request was even built, contradicting the docstring's claim of streaming. Holds the pool slot during disk I/O for large files.Changes
_constants.pyDEFAULT_CONNECTION_LIMITSrestored to(max_connections=100, max_keepalive_connections=20)._base_client.py_should_retryreturnsFalsefor 408s on/upload_file. Other 408s, 429s, 5xxs still retry._base_client.pyconnection_limits: httpx.Limits | None = NoneonSyncAPIClient/AsyncAPIClient. Passing it gives the client a private pool with the requested limits (implicitly opts out of the shared pool — that's process-global and can't honor per-client limits)._client.pyconnection_limitsonRunloop/AsyncRunloopconstructors andcopy()/with_options()._files.py_transform_file/_async_transform_filereturnpath.open(\"rb\")(anio.IOBase) instead ofread_bytes(). httpx streams from the handle.tests/test_files.pyto expect streamed handles, added 2 new_should_retrytests, addedTestSyncConnectionLimits/TestAsyncConnectionLimitscovering the new option.examples/stress_upload_file.pyupload_filecalls and reports per-status counts and p50/p95/p99 latency. Knobs:--concurrency,--total,--file-size-kb,--max-retries,--max-connections,--max-keepalive.Usage of the new option
Test plan
uv run pytest tests/— 1079 passed locallytest_upload_file_408_does_not_retry,test_non_upload_file_408_still_retries,TestSyncConnectionLimits,TestAsyncConnectionLimitsexamples/stress_upload_file.pyagainst prod with--concurrency 50 --total 200:v1.20.3, expect a burst of 408s at ~30 s eachNotes for reviewers
_constants.pyis Stainless-generated; the (100, 20) value just restores the pre-feat: share HTTP connection pool across SDK instances; refactor polling #797 default and we should probably move this off the generated file longer-term so it stops being a release-time merge conflict candidate.http2: bool = Trueas a sibling knob if that's a follow-up we want.🤖 Generated with Claude Code