Skip to content

perf(tracing): span queue linger + per-loop httpx keepalive#362

Open
smoreinis wants to merge 6 commits into
nextfrom
stas/tracing-perf-linger-keepalive
Open

perf(tracing): span queue linger + per-loop httpx keepalive#362
smoreinis wants to merge 6 commits into
nextfrom
stas/tracing-perf-linger-keepalive

Conversation

@smoreinis
Copy link
Copy Markdown
Contributor

@smoreinis smoreinis commented May 20, 2026

Summary

Two compounding causes of slow SGP trace export under load test, fixed together:

  • Span queue linger — the async drain loop returned size-1 batches almost every time because there was no time window for spans to accumulate. AsyncSpanQueue now lingers up to 100ms (env-tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so concurrently-emitted spans coalesce into one upsert_batch call. Stops early when the batch fills or on shutdown.
  • Per-event-loop httpx keepaliveSGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK TracingModule disabled keepalive (max_keepalive_connections=0) to avoid "bound to a different event loop" errors in sync-ACP, paying a full TLS handshake per span. Replaced with a per-loop client cache keyed on id(asyncio.get_running_loop()): max_keepalive_connections=20 within each loop, and cross-loop safety is preserved by giving each loop its own client.

The combination is what matters: keepalive alone is wasted if requests stay size-1, and batching alone is wasted if every request pays a TLS handshake.

Test plan

  • rye run lint — ruff + pyright clean
  • rye run pytest tests/lib/core/tracing/ — 38 passed, 2 skipped (pre-existing load-test gates)
  • New tests cover:
    • Linger coalesces staggered enqueues into one batch
    • linger_ms=0 preserves the old immediate-drain behavior
    • Linger respects batch_size cap
    • _get_client caches per event loop and only builds the client once
    • max_keepalive_connections > 0 regression guard
    • Disabled processor (empty sgp_api_key/sgp_account_id) returns None client
    • WeakKeyDictionary evicts the entry after a closed loop is GC'd (regression guard against the id()-recycling bug)
  • Local end-to-end run against examples/tutorials/00_sync/030_langgraph with real SGP credentials (see below)
  • Re-run the original load test that surfaced the slow exports and confirm improved export latency / throughput

End-to-end verification

Validated the fix against examples/tutorials/00_sync/030_langgraph (sync-ACP + LangGraph + SGP — the exact pattern that motivated the per-loop client cache). Editable-installed this branch's SDK into the tutorial's uv venv, ran agentex agents run --manifest manifest.yaml against the local AgentEx stack, sent three messages, then queried SGP back via the API.

  • Agent boot: clean — add_tracing_processor_config at module load, processor registers, _get_client() returns a real client (not None).
  • Span hierarchy preserved: AGENT_WORKFLOW root with COMPLETION/CUSTOM children parented correctly.
  • Zero "bound to a different event loop" errors in the agent log throughout.
  • All three traces fully delivered to SGP, including the tool-calling trace from the LangGraph callback handler:
trace_id (prefix) spans landed in SGP
8916dd2f-… AGENT_WORKFLOW:messageCOMPLETION:llm:gpt-5
2c534aa7-… AGENT_WORKFLOW:messageCOMPLETION:llm:gpt-5
8e5346a6-… AGENT_WORKFLOW:message → 2× COMPLETION:llm:gpt-5 + CUSTOM:tool:get_weather

Greptile Summary

This PR addresses two compounding performance bottlenecks in SGP trace export: a span queue that returned size-1 batches due to no coalescing window, and per-request TLS handshakes caused by disabled HTTP keepalive. Both are fixed together, since batching without keepalive and keepalive without batching each only deliver partial benefit.

  • Span queue linger: AsyncSpanQueue now waits up to linger_ms (default 100ms, env-tunable) after the first item arrives, using asyncio.wait_for with a deadline. Items stop accumulating on batch fill, timeout, or shutdown. The linger is suppressed immediately when _stopping is set so new batches after shutdown begin draining at once.
  • Per-loop keepalive client cache: All three processors (SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, TracingModule) replaced max_keepalive_connections=0 with a WeakKeyDictionary[asyncio.AbstractEventLoop, client] cache keyed on the live loop object. Cross-loop safety is preserved because each loop gets its own client, and GC'd loops (and their stale clients) are evicted automatically — closing the id()-recycling race that the previous review flagged.
  • New tests cover linger coalescing, linger_ms=0 back-compat, batch-size cap enforcement, per-loop caching, keepalive regression, disabled-processor short-circuit, and WeakKeyDictionary eviction.

Confidence Score: 5/5

Safe to merge — changes are purely additive performance improvements with no behavioral regression in the happy path and well-tested edge cases.

The linger window is bounded, shutdown interaction is correct (task_done always fires in finally, _queue.join() waits for in-flight batches), and the WeakKeyDictionary cache addresses the id()-recycling issue raised in the previous review. All three processors now use consistent patterns with comprehensive regression tests.

No files require special attention.

Important Files Changed

Filename Overview
src/agentex/lib/core/tracing/span_queue.py Adds linger window via asyncio.wait_for with a loop-time deadline; shutdown interaction is safe (linger bounded by linger_ms, task_done always called in finally block, _queue.join() accounts for in-flight items)
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Replaces single eagerly-built client with WeakKeyDictionary per-loop cache; _get_client() correctly short-circuits on disabled, handles no-running-loop fallback, and preserves existing warning log on on_spans_start when disabled
src/agentex/lib/core/tracing/processors/agentex_tracing_processor.py Mirrors SGP pattern with WeakKeyDictionary; client property is lazily built per loop — construction no longer touches httpx at import time, eliminating the no-running-loop risk at module load
src/agentex/lib/adk/_modules/tracing.py One-liner: max_keepalive_connections 0 to 20; the existing _bound_loop_id check already rebuilds the client on loop change, so cross-loop safety is unaffected
tests/lib/core/tracing/test_span_queue.py Three new linger test cases covering coalescing, zero-linger back-compat, and batch-size cap; timing-based but margins are generous (100ms window, 20ms gaps) and consistent with the existing queue test style
tests/lib/core/tracing/processors/test_sgp_tracing_processor.py Existing tests updated to stub _get_client() instead of wiring sgp_async_client directly; four new tests cover caching, keepalive regression, WeakKeyDictionary eviction, and disabled-processor path
tests/lib/core/tracing/processors/test_agentex_tracing_processor.py New test file mirroring SGP processor tests; importorskip guard for pydantic_ai is a good defensive touch for dev envs without the full install

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["Queue.get() — block for first item"] --> B["batch = [first]"]
    B --> C{linger_ms > 0 AND not _stopping?}
    C -- Yes --> D["deadline = loop.time() + linger_ms/1000"]
    D --> E{len batch < batch_size?}
    E -- Yes --> F["remaining = deadline minus now"]
    F --> G{remaining > 0?}
    G -- No --> P
    G -- Yes --> H["await wait_for(queue.get(), timeout=remaining)"]
    H -- item --> I["append to batch"] --> E
    H -- TimeoutError --> P
    C -- No --> J["get_nowait() until empty or batch_size reached"]
    J --> P
    P["Process: START items to on_spans_start, END items to on_spans_end"] --> Q["finally: task_done x len batch"]
    Q --> A
    subgraph Per-loop client lookup
        R["get_running_loop()"] -- RuntimeError --> S["build one-off client no cache"]
        R -- loop --> T{loop in WeakKeyDict?}
        T -- Yes --> U["return cached client"]
        T -- No --> V["_build_client() + store in WeakKeyDict"] --> U
    end
Loading

Reviews (3): Last reviewed commit: "test(tracing): cover Agentex per-loop ca..." | Re-trigger Greptile

stainless-app Bot and others added 4 commits May 18, 2026 22:40
Two compounding causes of slow SGP trace export under load:

- The async drain loop returned size-1 batches almost every time
  because there was no time window for spans to accumulate.  Adds a
  100ms linger (tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so
  concurrently-emitted spans coalesce into one upsert_batch call.

- httpx keepalive was disabled (max_keepalive_connections=0) in
  SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK
  TracingModule to avoid "bound to a different event loop" errors in
  sync-ACP.  Each span paid a full TLS handshake.  Replaced with a
  per-event-loop client cache keyed on id(asyncio.get_running_loop());
  connections are reused within a loop and cross-loop safety is
  preserved.

Tests cover linger coalescing, batch-size cap interaction, per-loop
client caching, a keepalive-enabled regression guard, and
disabled-processor null-client behavior.
Comment thread src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Outdated
Addresses Greptile review feedback on PR #362.  The original
`dict[int, AsyncSGPClient]` cache used `id(asyncio.get_running_loop())`
as the key.  In CPython `id()` returns a memory address, and once a
loop is garbage-collected its address can be assigned to a new loop —
a fresh loop hashing to a stale entry would receive a client whose
httpx.AsyncClient was bound to the dead loop, reintroducing the
"bound to a different event loop" error this PR was built to prevent.

Switching the cache to `weakref.WeakKeyDictionary` keyed on the loop
object itself fixes the bug: the entry is evicted automatically when
the loop is collected, so id() recycling can't cause stale-client
reuse.  Multi-loop caching benefit is preserved (better than the
single-slot pattern in TracingModule for agents that bounce between
loops).

Same fix applied to AgentexAsyncTracingProcessor.  Added a regression
test verifying the cache evicts a closed/dropped loop's entry after
gc.collect().
Addresses both Greptile P3 findings on PR #362:

- AgentexAsyncTracingProcessor implemented the same per-loop client
  cache pattern as SGPAsyncTracingProcessor but had no dedicated test
  file.  Added test_agentex_tracing_processor.py mirroring the SGP
  coverage: single-build-per-loop, keepalive-enabled regression guard,
  and WeakKeyDictionary eviction after GC.  Skipped cleanly with
  pytest.importorskip when pydantic_ai isn't installed (the SDK dev
  venv state), since agentex_tracing_processor pulls in agentex.lib.adk
  which requires it.

- test_linger_respects_batch_size_cap used linger_ms=500, forcing the
  tail singleton batch to wait out the full 500ms timeout — the test
  only asserts no batch exceeds the cap, so dropping to linger_ms=50
  keeps correctness while cutting wall time by ~10x.
@smoreinis
Copy link
Copy Markdown
Contributor Author

Both Greptile P3 follow-ups addressed in 9bb4ae6:

  • AgentexAsyncTracingProcessor per-loop cache now has dedicated coverage in tests/lib/core/tracing/processors/test_agentex_tracing_processor.py, mirroring the SGP side: single-build-per-loop, max_keepalive_connections > 0 regression guard, and WeakKeyDictionary eviction after GC. The file uses pytest.importorskip("pydantic_ai") so it skips cleanly in the SDK dev venv (where pydantic_ai isn't installed) and runs in venvs that have the fuller dep set.
  • test_linger_respects_batch_size_cap dropped from linger_ms=500 to linger_ms=50, cutting the tail-singleton wait by ~10× without changing the assertion.

Verified locally: 3 new Agentex tests pass when pydantic_ai is available; cap test still passes; full rye run pytest tests/lib/core/tracing/ still green (39 passed, 3 skipped from the importorskip).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants