perf(tracing): span queue linger + per-loop httpx keepalive by smoreinis · Pull Request #362 · scaleapi/scale-agentex-python

smoreinis · 2026-05-20T19:41:55Z

Summary

Two compounding causes of slow SGP trace export under load test, fixed together:

Span queue linger — the async drain loop returned size-1 batches almost every time because there was no time window for spans to accumulate. AsyncSpanQueue now lingers up to 100ms (env-tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so concurrently-emitted spans coalesce into one upsert_batch call. Stops early when the batch fills or on shutdown.
Per-event-loop httpx keepalive — SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK TracingModule disabled keepalive (max_keepalive_connections=0) to avoid "bound to a different event loop" errors in sync-ACP, paying a full TLS handshake per span. Replaced with a per-loop client cache keyed on id(asyncio.get_running_loop()): max_keepalive_connections=20 within each loop, and cross-loop safety is preserved by giving each loop its own client.

The combination is what matters: keepalive alone is wasted if requests stay size-1, and batching alone is wasted if every request pays a TLS handshake.

Test plan

rye run lint — ruff + pyright clean
rye run pytest tests/lib/core/tracing/ — 38 passed, 2 skipped (pre-existing load-test gates)
New tests cover:
- Linger coalesces staggered enqueues into one batch
- linger_ms=0 preserves the old immediate-drain behavior
- Linger respects batch_size cap
- _get_client caches per event loop and only builds the client once
- max_keepalive_connections > 0 regression guard
- Disabled processor (empty sgp_api_key/sgp_account_id) returns None client
- WeakKeyDictionary evicts the entry after a closed loop is GC'd (regression guard against the id()-recycling bug)
Local end-to-end run against examples/tutorials/00_sync/030_langgraph with real SGP credentials (see below)
Re-run the original load test that surfaced the slow exports and confirm improved export latency / throughput

End-to-end verification

Validated the fix against examples/tutorials/00_sync/030_langgraph (sync-ACP + LangGraph + SGP — the exact pattern that motivated the per-loop client cache). Editable-installed this branch's SDK into the tutorial's uv venv, ran agentex agents run --manifest manifest.yaml against the local AgentEx stack, sent three messages, then queried SGP back via the API.

Agent boot: clean — add_tracing_processor_config at module load, processor registers, _get_client() returns a real client (not None).
Span hierarchy preserved: AGENT_WORKFLOW root with COMPLETION/CUSTOM children parented correctly.
Zero "bound to a different event loop" errors in the agent log throughout.
All three traces fully delivered to SGP, including the tool-calling trace from the LangGraph callback handler:

trace_id (prefix)	spans landed in SGP
`8916dd2f-…`	`AGENT_WORKFLOW:message` → `COMPLETION:llm:gpt-5`
`2c534aa7-…`	`AGENT_WORKFLOW:message` → `COMPLETION:llm:gpt-5`
`8e5346a6-…`	`AGENT_WORKFLOW:message` → 2× `COMPLETION:llm:gpt-5` + `CUSTOM:tool:get_weather`

Greptile Summary

This PR addresses two compounding performance bottlenecks in SGP trace export: a span queue that returned size-1 batches due to no coalescing window, and per-request TLS handshakes caused by disabled HTTP keepalive. Both are fixed together, since batching without keepalive and keepalive without batching each only deliver partial benefit.

Span queue linger: AsyncSpanQueue now waits up to linger_ms (default 100ms, env-tunable) after the first item arrives, using asyncio.wait_for with a deadline. Items stop accumulating on batch fill, timeout, or shutdown. The linger is suppressed immediately when _stopping is set so new batches after shutdown begin draining at once.
Per-loop keepalive client cache: All three processors (SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, TracingModule) replaced max_keepalive_connections=0 with a WeakKeyDictionary[asyncio.AbstractEventLoop, client] cache keyed on the live loop object. Cross-loop safety is preserved because each loop gets its own client, and GC'd loops (and their stale clients) are evicted automatically — closing the id()-recycling race that the previous review flagged.
New tests cover linger coalescing, linger_ms=0 back-compat, batch-size cap enforcement, per-loop caching, keepalive regression, disabled-processor short-circuit, and WeakKeyDictionary eviction.

Confidence Score: 5/5

Safe to merge — changes are purely additive performance improvements with no behavioral regression in the happy path and well-tested edge cases.

The linger window is bounded, shutdown interaction is correct (task_done always fires in finally, _queue.join() waits for in-flight batches), and the WeakKeyDictionary cache addresses the id()-recycling issue raised in the previous review. All three processors now use consistent patterns with comprehensive regression tests.

No files require special attention.

Important Files Changed

Filename	Overview
src/agentex/lib/core/tracing/span_queue.py	Adds linger window via asyncio.wait_for with a loop-time deadline; shutdown interaction is safe (linger bounded by linger_ms, task_done always called in finally block, _queue.join() accounts for in-flight items)
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py	Replaces single eagerly-built client with WeakKeyDictionary per-loop cache; _get_client() correctly short-circuits on disabled, handles no-running-loop fallback, and preserves existing warning log on on_spans_start when disabled
src/agentex/lib/core/tracing/processors/agentex_tracing_processor.py	Mirrors SGP pattern with WeakKeyDictionary; client property is lazily built per loop — construction no longer touches httpx at import time, eliminating the no-running-loop risk at module load
src/agentex/lib/adk/_modules/tracing.py	One-liner: max_keepalive_connections 0 to 20; the existing _bound_loop_id check already rebuilds the client on loop change, so cross-loop safety is unaffected
tests/lib/core/tracing/test_span_queue.py	Three new linger test cases covering coalescing, zero-linger back-compat, and batch-size cap; timing-based but margins are generous (100ms window, 20ms gaps) and consistent with the existing queue test style
tests/lib/core/tracing/processors/test_sgp_tracing_processor.py	Existing tests updated to stub _get_client() instead of wiring sgp_async_client directly; four new tests cover caching, keepalive regression, WeakKeyDictionary eviction, and disabled-processor path
tests/lib/core/tracing/processors/test_agentex_tracing_processor.py	New test file mirroring SGP processor tests; importorskip guard for pydantic_ai is a good defensive touch for dev envs without the full install

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Queue.get() — block for first item"] --> B["batch = [first]"]
    B --> C{linger_ms > 0 AND not _stopping?}
    C -- Yes --> D["deadline = loop.time() + linger_ms/1000"]
    D --> E{len batch < batch_size?}
    E -- Yes --> F["remaining = deadline minus now"]
    F --> G{remaining > 0?}
    G -- No --> P
    G -- Yes --> H["await wait_for(queue.get(), timeout=remaining)"]
    H -- item --> I["append to batch"] --> E
    H -- TimeoutError --> P
    C -- No --> J["get_nowait() until empty or batch_size reached"]
    J --> P
    P["Process: START items to on_spans_start, END items to on_spans_end"] --> Q["finally: task_done x len batch"]
    Q --> A
    subgraph Per-loop client lookup
        R["get_running_loop()"] -- RuntimeError --> S["build one-off client no cache"]
        R -- loop --> T{loop in WeakKeyDict?}
        T -- Yes --> U["return cached client"]
        T -- No --> V["_build_client() + store in WeakKeyDict"] --> U
    end

_{Reviews (3): Last reviewed commit: "test(tracing): cover Agentex per-loop ca..." | Re-trigger Greptile}

Two compounding causes of slow SGP trace export under load: - The async drain loop returned size-1 batches almost every time because there was no time window for spans to accumulate. Adds a 100ms linger (tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so concurrently-emitted spans coalesce into one upsert_batch call. - httpx keepalive was disabled (max_keepalive_connections=0) in SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK TracingModule to avoid "bound to a different event loop" errors in sync-ACP. Each span paid a full TLS handshake. Replaced with a per-event-loop client cache keyed on id(asyncio.get_running_loop()); connections are reused within a loop and cross-loop safety is preserved. Tests cover linger coalescing, batch-size cap interaction, per-loop client caching, a keepalive-enabled regression guard, and disabled-processor null-client behavior.

Addresses Greptile review feedback on PR #362. The original `dict[int, AsyncSGPClient]` cache used `id(asyncio.get_running_loop())` as the key. In CPython `id()` returns a memory address, and once a loop is garbage-collected its address can be assigned to a new loop — a fresh loop hashing to a stale entry would receive a client whose httpx.AsyncClient was bound to the dead loop, reintroducing the "bound to a different event loop" error this PR was built to prevent. Switching the cache to `weakref.WeakKeyDictionary` keyed on the loop object itself fixes the bug: the entry is evicted automatically when the loop is collected, so id() recycling can't cause stale-client reuse. Multi-loop caching benefit is preserved (better than the single-slot pattern in TracingModule for agents that bounce between loops). Same fix applied to AgentexAsyncTracingProcessor. Added a regression test verifying the cache evicts a closed/dropped loop's entry after gc.collect().

Addresses both Greptile P3 findings on PR #362: - AgentexAsyncTracingProcessor implemented the same per-loop client cache pattern as SGPAsyncTracingProcessor but had no dedicated test file. Added test_agentex_tracing_processor.py mirroring the SGP coverage: single-build-per-loop, keepalive-enabled regression guard, and WeakKeyDictionary eviction after GC. Skipped cleanly with pytest.importorskip when pydantic_ai isn't installed (the SDK dev venv state), since agentex_tracing_processor pulls in agentex.lib.adk which requires it. - test_linger_respects_batch_size_cap used linger_ms=500, forcing the tail singleton batch to wait out the full 500ms timeout — the test only asserts no batch exceeds the cap, so dropping to linger_ms=50 keeps correctness while cutting wall time by ~10x.

smoreinis · 2026-05-21T17:42:12Z

Both Greptile P3 follow-ups addressed in 9bb4ae6:

AgentexAsyncTracingProcessor per-loop cache now has dedicated coverage in tests/lib/core/tracing/processors/test_agentex_tracing_processor.py, mirroring the SGP side: single-build-per-loop, max_keepalive_connections > 0 regression guard, and WeakKeyDictionary eviction after GC. The file uses pytest.importorskip("pydantic_ai") so it skips cleanly in the SDK dev venv (where pydantic_ai isn't installed) and runs in venvs that have the fuller dep set.
test_linger_respects_batch_size_cap dropped from linger_ms=500 to linger_ms=50, cutting the tail-singleton wait by ~10× without changing the assertion.

Verified locally: 3 new Agentex tests pass when pydantic_ai is available; cap test still passes; full rye run pytest tests/lib/core/tracing/ still green (39 passed, 3 skipped from the importorskip).

stainless-app Bot and others added 4 commits May 18, 2026 22:40

feat(api): add schedule, checkpoints, and deployment endpoints

53b5c36

fix: resolve lint and test failures from new endpoints (#360)

bdf129c

feat: added Pydantic AI sync, async, temporal integration (#359)

781dfe1

greptile-apps Bot reviewed May 20, 2026

View reviewed changes

Comment thread src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Outdated

danielmillerp approved these changes May 20, 2026

View reviewed changes

stainless-app Bot force-pushed the next branch from 781dfe1 to 84dbf72 Compare May 21, 2026 18:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(tracing): span queue linger + per-loop httpx keepalive#362

perf(tracing): span queue linger + per-loop httpx keepalive#362
smoreinis wants to merge 6 commits into
nextfrom
stas/tracing-perf-linger-keepalive

smoreinis commented May 20, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

smoreinis commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

smoreinis commented May 20, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

End-to-end verification

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

smoreinis commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smoreinis commented May 20, 2026 •

edited by greptile-apps Bot

Loading