Skip to content

perf(tracing): span queue linger + per-loop httpx keepalive#362

Merged
smoreinis merged 8 commits into
nextfrom
stas/tracing-perf-linger-keepalive
May 29, 2026
Merged

perf(tracing): span queue linger + per-loop httpx keepalive#362
smoreinis merged 8 commits into
nextfrom
stas/tracing-perf-linger-keepalive

Conversation

@smoreinis
Copy link
Copy Markdown
Contributor

@smoreinis smoreinis commented May 20, 2026

Summary

Two compounding causes of slow SGP trace export under load test, fixed together:

  • Span queue linger — the async drain loop returned size-1 batches almost every time because there was no time window for spans to accumulate. AsyncSpanQueue now lingers up to 100ms (env-tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so concurrently-emitted spans coalesce into one upsert_batch call. Stops early when the batch fills or on shutdown.
  • Per-event-loop httpx keepaliveSGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK TracingModule disabled keepalive (max_keepalive_connections=0) to avoid "bound to a different event loop" errors in sync-ACP, paying a full TLS handshake per span. Replaced with a per-loop client cache keyed on id(asyncio.get_running_loop()): max_keepalive_connections=20 within each loop, and cross-loop safety is preserved by giving each loop its own client.

The combination is what matters: keepalive alone is wasted if requests stay size-1, and batching alone is wasted if every request pays a TLS handshake.

Test plan

  • rye run lint — ruff + pyright clean
  • rye run pytest tests/lib/core/tracing/ — 38 passed, 2 skipped (pre-existing load-test gates)
  • New tests cover:
    • Linger coalesces staggered enqueues into one batch
    • linger_ms=0 preserves the old immediate-drain behavior
    • Linger respects batch_size cap
    • _get_client caches per event loop and only builds the client once
    • max_keepalive_connections > 0 regression guard
    • Disabled processor (empty sgp_api_key/sgp_account_id) returns None client
    • WeakKeyDictionary evicts the entry after a closed loop is GC'd (regression guard against the id()-recycling bug)
  • Local end-to-end run against examples/tutorials/00_sync/030_langgraph with real SGP credentials (see below)
  • Re-run the original load test that surfaced the slow exports and confirm improved export latency / throughput

End-to-end verification

Validated the fix against examples/tutorials/00_sync/030_langgraph (sync-ACP + LangGraph + SGP — the exact pattern that motivated the per-loop client cache). Editable-installed this branch's SDK into the tutorial's uv venv, ran agentex agents run --manifest manifest.yaml against the local AgentEx stack, sent three messages, then queried SGP back via the API.

  • Agent boot: clean — add_tracing_processor_config at module load, processor registers, _get_client() returns a real client (not None).
  • Span hierarchy preserved: AGENT_WORKFLOW root with COMPLETION/CUSTOM children parented correctly.
  • Zero "bound to a different event loop" errors in the agent log throughout.
  • All three traces fully delivered to SGP, including the tool-calling trace from the LangGraph callback handler:
trace_id (prefix) spans landed in SGP
8916dd2f-… AGENT_WORKFLOW:messageCOMPLETION:llm:gpt-5
2c534aa7-… AGENT_WORKFLOW:messageCOMPLETION:llm:gpt-5
8e5346a6-… AGENT_WORKFLOW:message → 2× COMPLETION:llm:gpt-5 + CUSTOM:tool:get_weather

Greptile Summary

This PR addresses two compounding causes of slow SGP trace export: a missing linger window in AsyncSpanQueue that caused size-1 batches on every drain cycle, and per-event-loop httpx keepalive connections for SGPAsyncTracingProcessor and AgentexAsyncTracingProcessor.

  • Span queue linger: AsyncSpanQueue now waits up to linger_ms (default 100ms, env-tunable) for spans to coalesce before dispatching, with early exit on batch-full or shutdown. The PR also adds bounded-queue support with drop counting and per-processor retry logic for transient HTTP errors (429/5xx).
  • Per-loop client cache: Both Async*TracingProcessor classes replace the old max_keepalive_connections=0 workaround with a weakref.WeakKeyDictionary[asyncio.AbstractEventLoop, Client] cache, giving each event loop its own connection pool. TracingModule in tracing.py also enables keepalive but retains the original single-slot id(loop)-based guard instead of being updated to WeakKeyDictionary.
  • Tests: 38 passing tests cover linger coalescing, batch-size cap, drop observability, retry exhaustion, WeakKeyDictionary eviction regression guards, and keepalive regression guards for both processors.

Confidence Score: 4/5

Safe to merge for the two processors; TracingModule retains a loop-detection gap that could reintroduce the error the PR fixes.

The Async*TracingProcessor changes are well-implemented with correct WeakKeyDictionary eviction, sound task_done() bookkeeping in the linger path, and comprehensive test coverage. The one gap is TracingModule._tracing_service: it now enables keepalive (max_keepalive_connections=20) but still tracks the current loop via id() stored in _bound_loop_id. When CPython recycles the memory address of a GC'd loop, the stale httpx.AsyncClient (with live keepalive connections) is returned for the new loop — the same condition the PR was built to prevent. This affects the ADK span/start_span/end_span path in sync-ACP and streaming contexts.

src/agentex/lib/adk/_modules/tracing.py — TracingModule._tracing_service loop-detection logic

Important Files Changed

Filename Overview
src/agentex/lib/core/tracing/span_queue.py Adds linger window (100ms env-tunable), bounded queue with drop counting, per-processor retry up to max_retries. Logic is sound: task_done() bookkeeping is correct, retry attempts are properly bounded, and sparse drop logging is well designed.
src/agentex/lib/adk/_modules/tracing.py Enables keepalive (max_keepalive_connections=20) but retains id()-based event loop detection; id() recycling after GC can return a stale client to the new loop, reintroducing the exact "bound to a different event loop" error both processors were fixed to prevent.
src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Correctly migrated to WeakKeyDictionary per-loop client cache with keepalive enabled; disabled check moved to _get_client(); span building now skipped when disabled (minor optimization).
src/agentex/lib/core/tracing/processors/agentex_tracing_processor.py Migrated to WeakKeyDictionary per-loop client cache with keepalive; lazy client construction via property is clean and consistent with the SGP processor.
tests/lib/core/tracing/test_span_queue.py New tests cover linger coalescing, linger=0 back-compat, batch-size cap, drop observability, and retry semantics (retryable vs non-retryable statuses, exhaustion). Coverage is thorough.
tests/lib/core/tracing/processors/test_sgp_tracing_processor.py Updated fixtures to use _get_client stub instead of direct attribute assignment; new tests verify per-loop caching, keepalive regression guard, WeakKeyDictionary eviction, and disabled-processor None return.
tests/lib/core/tracing/processors/test_agentex_tracing_processor.py New test file mirrors the SGP processor tests for the Agentex async processor; covers lazy construction, per-loop caching, keepalive regression guard, and WeakKeyDictionary eviction.

Sequence Diagram

sequenceDiagram
    participant Agent
    participant AsyncSpanQueue
    participant DrainLoop
    participant Processor as SGP/Agentex Processor
    participant ClientCache as WeakKeyDictionary[Loop→Client]
    participant SGP as SGP API

    Agent->>AsyncSpanQueue: enqueue(span_start)
    AsyncSpanQueue->>DrainLoop: create_task(_drain_loop)
    Note over DrainLoop: await queue.get() → first item

    rect rgb(200, 230, 255)
        Note over DrainLoop: Linger window (default 100ms)
        DrainLoop->>AsyncSpanQueue: "wait_for(queue.get(), timeout=remaining)"
        AsyncSpanQueue-->>DrainLoop: span_1 (20ms later)
        DrainLoop->>AsyncSpanQueue: "wait_for(queue.get(), timeout=remaining)"
        AsyncSpanQueue-->>DrainLoop: span_2 (40ms later)
        DrainLoop->>AsyncSpanQueue: "wait_for(queue.get(), timeout=remaining)"
        Note over DrainLoop: TimeoutError → linger ends
    end

    DrainLoop->>Processor: on_spans_start([span_0, span_1, span_2])
    Processor->>ClientCache: get(running_loop)
    ClientCache-->>Processor: cached AsyncClient (keepalive)
    Processor->>SGP: upsert_batch([3 spans])
    SGP-->>Processor: 200 OK
    DrainLoop->>AsyncSpanQueue: task_done() × 3

    Note over DrainLoop: On transient error (429/5xx)
    DrainLoop->>AsyncSpanQueue: put_nowait(item, attempts+1)
Loading

Comments Outside Diff (1)

  1. src/agentex/lib/adk/_modules/tracing.py, line 57-83 (link)

    P1 id() recycling risk survives keepalive change

    TracingModule._tracing_service compares id(loop) against _bound_loop_id. If a loop is GC'd and CPython reuses its memory address for a new loop, id(new_loop) == _bound_loop_id so the guard doesn't fire and the stale _tracing_service_lazy (whose httpx.AsyncClient is bound to the dead loop) is returned — exactly the "bound to a different event loop" RuntimeError the PR aims to prevent.

    Both SGPAsyncTracingProcessor and AgentexAsyncTracingProcessor were correctly migrated to weakref.WeakKeyDictionary for the same reason (see the thread on this PR), but TracingModule was updated only to enable keepalive, leaving the id()-based guard in place. With keepalive now on, the live connection pool makes the stale-client scenario more likely to produce an error rather than just incurring a fresh TLS handshake.

    Prompt To Fix With AI
    This is a comment left during a code review.
    Path: src/agentex/lib/adk/_modules/tracing.py
    Line: 57-83
    
    Comment:
    **`id()` recycling risk survives keepalive change**
    
    `TracingModule._tracing_service` compares `id(loop)` against `_bound_loop_id`. If a loop is GC'd and CPython reuses its memory address for a new loop, `id(new_loop) == _bound_loop_id` so the guard doesn't fire and the stale `_tracing_service_lazy` (whose `httpx.AsyncClient` is bound to the dead loop) is returned — exactly the "bound to a different event loop" RuntimeError the PR aims to prevent.
    
    Both `SGPAsyncTracingProcessor` and `AgentexAsyncTracingProcessor` were correctly migrated to `weakref.WeakKeyDictionary` for the same reason (see the thread on this PR), but `TracingModule` was updated only to enable keepalive, leaving the `id()`-based guard in place. With keepalive now on, the live connection pool makes the stale-client scenario more likely to produce an error rather than just incurring a fresh TLS handshake.
    
    How can I resolve this? If you propose a fix, please make it concise.

    Fix in Cursor Fix in Claude Code Fix in Codex

Fix All in Cursor Fix All in Claude Code Fix All in Codex

Prompt To Fix All With AI
Fix the following 1 code review issue. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 1
src/agentex/lib/adk/_modules/tracing.py:57-83
**`id()` recycling risk survives keepalive change**

`TracingModule._tracing_service` compares `id(loop)` against `_bound_loop_id`. If a loop is GC'd and CPython reuses its memory address for a new loop, `id(new_loop) == _bound_loop_id` so the guard doesn't fire and the stale `_tracing_service_lazy` (whose `httpx.AsyncClient` is bound to the dead loop) is returned — exactly the "bound to a different event loop" RuntimeError the PR aims to prevent.

Both `SGPAsyncTracingProcessor` and `AgentexAsyncTracingProcessor` were correctly migrated to `weakref.WeakKeyDictionary` for the same reason (see the thread on this PR), but `TracingModule` was updated only to enable keepalive, leaving the `id()`-based guard in place. With keepalive now on, the live connection pool makes the stale-client scenario more likely to produce an error rather than just incurring a fresh TLS handshake.

Reviews (5): Last reviewed commit: "Merge remote-tracking branch 'origin/nex..." | Re-trigger Greptile

stainless-app Bot and others added 4 commits May 18, 2026 22:40
Two compounding causes of slow SGP trace export under load:

- The async drain loop returned size-1 batches almost every time
  because there was no time window for spans to accumulate.  Adds a
  100ms linger (tunable via AGENTEX_SPAN_QUEUE_LINGER_MS) so
  concurrently-emitted spans coalesce into one upsert_batch call.

- httpx keepalive was disabled (max_keepalive_connections=0) in
  SGPAsyncTracingProcessor, AgentexAsyncTracingProcessor, and the ADK
  TracingModule to avoid "bound to a different event loop" errors in
  sync-ACP.  Each span paid a full TLS handshake.  Replaced with a
  per-event-loop client cache keyed on id(asyncio.get_running_loop());
  connections are reused within a loop and cross-loop safety is
  preserved.

Tests cover linger coalescing, batch-size cap interaction, per-loop
client caching, a keepalive-enabled regression guard, and
disabled-processor null-client behavior.
Comment thread src/agentex/lib/core/tracing/processors/sgp_tracing_processor.py Outdated
Addresses Greptile review feedback on PR #362.  The original
`dict[int, AsyncSGPClient]` cache used `id(asyncio.get_running_loop())`
as the key.  In CPython `id()` returns a memory address, and once a
loop is garbage-collected its address can be assigned to a new loop —
a fresh loop hashing to a stale entry would receive a client whose
httpx.AsyncClient was bound to the dead loop, reintroducing the
"bound to a different event loop" error this PR was built to prevent.

Switching the cache to `weakref.WeakKeyDictionary` keyed on the loop
object itself fixes the bug: the entry is evicted automatically when
the loop is collected, so id() recycling can't cause stale-client
reuse.  Multi-loop caching benefit is preserved (better than the
single-slot pattern in TracingModule for agents that bounce between
loops).

Same fix applied to AgentexAsyncTracingProcessor.  Added a regression
test verifying the cache evicts a closed/dropped loop's entry after
gc.collect().
Addresses both Greptile P3 findings on PR #362:

- AgentexAsyncTracingProcessor implemented the same per-loop client
  cache pattern as SGPAsyncTracingProcessor but had no dedicated test
  file.  Added test_agentex_tracing_processor.py mirroring the SGP
  coverage: single-build-per-loop, keepalive-enabled regression guard,
  and WeakKeyDictionary eviction after GC.  Skipped cleanly with
  pytest.importorskip when pydantic_ai isn't installed (the SDK dev
  venv state), since agentex_tracing_processor pulls in agentex.lib.adk
  which requires it.

- test_linger_respects_batch_size_cap used linger_ms=500, forcing the
  tail singleton batch to wait out the full 500ms timeout — the test
  only asserts no batch exceeds the cap, so dropping to linger_ms=50
  keeps correctness while cutting wall time by ~10x.
@smoreinis
Copy link
Copy Markdown
Contributor Author

Both Greptile P3 follow-ups addressed in 9bb4ae6:

  • AgentexAsyncTracingProcessor per-loop cache now has dedicated coverage in tests/lib/core/tracing/processors/test_agentex_tracing_processor.py, mirroring the SGP side: single-build-per-loop, max_keepalive_connections > 0 regression guard, and WeakKeyDictionary eviction after GC. The file uses pytest.importorskip("pydantic_ai") so it skips cleanly in the SDK dev venv (where pydantic_ai isn't installed) and runs in venvs that have the fuller dep set.
  • test_linger_respects_batch_size_cap dropped from linger_ms=500 to linger_ms=50, cutting the tail-singleton wait by ~10× without changing the assertion.

Verified locally: 3 new Agentex tests pass when pydantic_ai is available; cap test still passes; full rye run pytest tests/lib/core/tracing/ still green (39 passed, 3 skipped from the importorskip).

@james-cardenas
Copy link
Copy Markdown

james-cardenas commented May 28, 2026

SDK Validation Notes:

The Good News: PR #362 completely fixed our ingest queueing latency. At both 1x and 2x traffic, our p99 dropped from ~2.4s down to ~0.5s.

The Problem: When we push to 2x load (~42 RPS), the export pipeline inside the Mock Agent collapses and starts silently dropping data (thought dropped spans also occurred in 1x load run).

Here are the metrics detailing the breakdown of export.

1. Client export did not scale with RPS

At 2x scale, the agent only sent ~17% of the PUTs it was sending at 1x. Egress completely missed the floor target.

Run Partner RPS Client PUT/s Egress Linear expectation from 1× steady
M4 1× steady 21.1 57.2 ~522 KB/s baseline
M5 2× steady 41.7 9.6 ~86 KB/s ~114 PUT/s, ~1,044 KB/s

2. Server batch ~169/s vs mock client ~10/s — platform is not idle

This doesn't seem to be an EGP bottleneck. The backend API is runs steadily at ~169 PUT/s. The mock agent seems to be the issue.

Run Client PUT/s (p62_a) Server PUT/s (p62_b) Client ÷ server
M4 1× steady 57 86 66%
M5 2× steady 9.6 169 6%

3. ~9.1 KB/PUT unchanged — fewer PUTs, not smaller bodies

The batches aren't getting smaller; they stay flat at ~9.1 KB/PUT. The agent is just sending far fewer of them.

Run PUT/turn KB/turn
M4 1× steady 57.2 ÷ 21.1 ≈ 2.7 ~24 KB
M5 2× steady 9.6 ÷ 41.7 ≈ 0.23 ~2.1 KB

4. Memory is not the 2× blocker; EGP saturation matches history

This is not an OOM issue. Memory stayed incredibly stable at ~52%.

Signal M4 1× steady M5 2× steady
Mock mem % limit ~32% ~52%
EGP mem % limit ~54% ~51%
Batch p99 ~0.49 s ~0.50 s

Spans Getting Drops (Logs):
I'm seeing dropped spans in both the 1x & the 2x runs:

Date,Host,Service,Content
"2026-05-27T23:09:40.976Z","ip-10-53-115-35.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 24 spans during end"
"2026-05-27T23:09:40.907Z","ip-10-53-115-35.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during start"
"2026-05-27T23:09:40.880Z","ip-10-53-70-177.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during start"
"2026-05-27T23:09:40.742Z","ip-10-53-115-35.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during end"
"2026-05-27T23:06:44.205Z","ip-10-53-115-35.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during end"
"2026-05-27T23:06:17.705Z","ip-10-53-115-35.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 26 spans during end"
"2026-05-27T22:51:41.362Z","ip-10-53-70-177.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 23 spans during end"
"2026-05-27T22:51:41.276Z","ip-10-53-66-228.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during end"
"2026-05-27T22:51:41.017Z","ip-10-53-70-177.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 24 spans during start"
"2026-05-27T22:49:33.299Z","ip-10-53-70-177.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during start"
"2026-05-27T22:49:33.299Z","ip-10-53-70-177.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 26 spans during end"
"2026-05-27T22:49:32.868Z","ip-10-53-155-37.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 24 spans during start"
"2026-05-27T19:53:53.290Z","ip-10-53-159-196.us-west-2.compute.internal-scale-egp-69a861dc369440bb8ebbc919","rocket-mock-agent-agentex-agent","Tracing processor SGPAsyncTracingProcessor failed handling 25 spans during start"

I think the agent is silently dropping ~25-span batches instead of retrying them — not sure if this is by design or not. Needless to say, under heavy load, the agent drain task is not keeping up and is causing a backlog.

NOTE:

  • I have not experimented with setting AGENTEX_SPAN_QUEUE_LINGER_MS to anything other than the default.

@james-cardenas
Copy link
Copy Markdown

@smoreinis Quick update: I just finished auditing the DB across all traces from the 1x & 2x runs, and the data is all there. The median trace perfectly persisted total expected # of spans. So the ~86 KB/s egress isn't due to missing rows or silent data drops. I think the issue lies purely with the queue drain mechanics. No need to dig deeper into the Spans Getting Dropped issue — this is a sep smaller issue.

smoreinis added 2 commits May 28, 2026 16:42
Under load, span export failures were silently dropped and unbounded
queue growth was invisible. Add observability and a narrow retry:

- Bound the queue (AGENTEX_SPAN_QUEUE_MAX_SIZE, 0=unbounded default) and
  expose a dropped_spans counter + depth so span loss is measurable
  instead of silent.
- Re-enqueue only transient HTTP failures {429,500,502,503,504} up to
  AGENTEX_SPAN_QUEUE_MAX_RETRIES (default 1 = no retry). Auth/4xx (e.g.
  the 401s seen in the load test) and non-HTTP exceptions are dropped and
  counted, never retried, preserving the drain-continues-on-error contract.

Defaults preserve prior behavior (unbounded, no retry).
…ger-keepalive

# Conflicts:
#	.stats.yml
#	pyproject.toml
#	src/agentex/lib/cli/templates/default-pydantic-ai/project/acp.py.j2
#	src/agentex/lib/cli/templates/sync-pydantic-ai/project/acp.py.j2
#	src/agentex/lib/cli/templates/temporal-pydantic-ai/project/workflow.py.j2
@smoreinis smoreinis merged commit feec842 into next May 29, 2026
41 checks passed
@smoreinis smoreinis deleted the stas/tracing-perf-linger-keepalive branch May 29, 2026 01:03
@stainless-app stainless-app Bot mentioned this pull request May 29, 2026
@stainless-app stainless-app Bot mentioned this pull request May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants