fix(tests): repair test_streaming_model so all 28 tests run and pass#334
Merged
fix(tests): repair test_streaming_model so all 28 tests run and pass#334
Conversation
Four pre-existing bugs left this entire test file unrunnable on main (4 failures + 24 errors); fixing them here so the suite actually exercises TemporalStreamingModel and protects against regressions. Bug 1 (24 errors): `conftest.py` defines fixture `mock_adk_streaming` (no underscore) but every test in TestStreamingModelSettings and TestStreamingModelTools requested it as `_mock_adk_streaming`, so pytest failed to resolve the fixture before the body ever ran. The fixture is ``autouse=True`` and the param value was never used in any test body, so the parameter was vestigial — replaced with `_streaming_context_vars`, which provides the ContextVar setup these tests now actually need. Bug 2 (4 failures): `TemporalStreamingModel.get_response()` reads `task_id`, `trace_id`, and `parent_span_id` from ContextVars populated by `ContextInterceptor` from request headers in real Temporal flows. Tests had been passing `task_id=...` as a kwarg, which is silently swallowed by `**kwargs` and ignored, so all three ContextVars stayed at their defaults and the validation at the top of `get_response` raised before any work happened. New `_streaming_context_vars` fixture in conftest sets all three vars (and resets them on teardown), simulating what `ContextInterceptor` does in production. Bug 3 (test_computer_tool): A recent commit narrowed `ComputerTool` serialization to require an actual `Computer`/`AsyncComputer` instance, but `sample_computer_tool` still built a bare `MagicMock`. Switched to `MagicMock(spec=Computer)` so the production isinstance check passes. Bug 4 (3 streaming-context tests): The 3 tests in TestStreamingModelBasics that assert on `streaming_task_message_context` calls built event sequences with raw `MagicMock(type="...")`. Production dispatches via `isinstance(event, ResponseOutputItemAddedEvent)` etc., which `MagicMock` without `spec` never satisfies, so dispatch was silently skipped and the assertions failed. Switched to `MagicMock(spec=...)` for each event type — passes isinstance without triggering pydantic validation on the event's required fields. Also fixed `test_task_id_threading` which had been asserting against a hardcoded `task_id="test_task_12345"` that was never actually threaded anywhere (the kwarg was ignored, just like in Bug 2); it now asserts against the value yielded by the fixture, which is the value production reads from the ContextVar. After all four fixes: 28/28 pass, ruff clean, pyright clean.
declan-scale
approved these changes
Apr 30, 2026
Merged
smoreinis
added a commit
that referenced
this pull request
Apr 30, 2026
After merging the test-suite repair from main (#334) into this branch, one model test (test_responses_api_streaming) regressed because its assert_called_with strict-matched all kwargs of streaming_task_message_context and didn't tolerate the new `streaming_mode='coalesced'` kwarg this PR adds. Switched to assert_called() + targeted kwarg checks so the test verifies what it cares about (task_id threading) without locking in implementation details. Replaced the ad-hoc smoke scripts that lived in conversation with a real pytest module at tests/lib/core/services/adk/test_streaming.py covering: - _delta_char_len, _can_merge, _merge_pair: per-channel correctness + None-handling - _merge_consecutive: pure-text collapse, cross-channel order preservation, per-channel reconstruction matches per-token semantics - CoalescingBuffer: first-delta-immediate flush within ~20ms, size-threshold flush before timer fires, multi-delta coalescing within one window, idle close, add-after-close no-op - CoalescingBuffer cancel-during-flush regression test for the P1 fix: five queued chunks must all surface across publishes when close() cancels mid-flush (asserts substring presence rather than exact ordering, since the documented trade-off allows duplicates of the in-flight item) - StreamingTaskMessageContext mode dispatch: "off" suppresses publishes but persists full content, "per_token" publishes each delta synchronously, "coalesced" batches and persists full content
smoreinis
added a commit
that referenced
this pull request
Apr 30, 2026
…ar window) (#333) * perf(streaming): coalesce per-token publishes to Redis (50ms / 128-char window) Per-token Redis publishes from TemporalStreamingModel were adding ~45s (56-62%) overhead to agent response latency, mostly from head-of-line blocking on the model's event loop: each `await streaming_context.stream_update(...)` inside the OpenAI stream `async for` paused token consumption until the publish round-trip completed. This change introduces a `CoalescingBuffer` driven by an `asyncio.Event`, so the producer never awaits on Redis. Deltas are merged consecutive-only (preserving character order in every (type, index) channel) and flushed on a 50ms timer, on a 128-char size threshold, or immediately for the first delta to keep perceived responsiveness high. The buffer's `close()` drains remaining deltas before the DONE event, so consumers see the full sequence in order. A new `StreamingMode = Literal["off", "per_token", "coalesced"]` lives in `streaming.py` as the single source of truth and is plumbed through the adk streaming module, `StreamingService.streaming_task_message_context`, and `StreamingTaskMessageContext`. Default is `"coalesced"` everywhere, so all 13+ existing context callers (claude_agents, langgraph, litellm provider, openai sync provider, etc.) benefit automatically. * chore(streaming): fix import ordering (ruff I001) * fix(streaming): address greptile review findings - _run: when CancelledError is raised mid-flush in the for-loop, re-enqueue the in-flight item plus any remaining items in the local `drained` list back into self._buf so close()'s final drain can recover them. Previously the local `drained` list was unreachable after CancelledError exited the for-loop, causing the last coalesced batch to be silently dropped on close-during-flush races. Trade-off: the in-flight item may be duplicated on the consumer side (Redis pub may have completed before cancel was delivered), which is preferable to silent loss for streaming UX. - _merge_pair: replace `return b` fallback with AssertionError. All six current TaskMessageDelta variants have explicit isinstance branches, so the fallback is unreachable today. But _can_merge returns True for any same-type pair, so adding a 7th delta variant without updating _merge_pair would silently drop `a`'s accumulated content. Asserting turns a future silent data-loss into an immediate, diagnosable crash. * test(streaming): add coalescing-layer tests; loosen one model assertion After merging the test-suite repair from main (#334) into this branch, one model test (test_responses_api_streaming) regressed because its assert_called_with strict-matched all kwargs of streaming_task_message_context and didn't tolerate the new `streaming_mode='coalesced'` kwarg this PR adds. Switched to assert_called() + targeted kwarg checks so the test verifies what it cares about (task_id threading) without locking in implementation details. Replaced the ad-hoc smoke scripts that lived in conversation with a real pytest module at tests/lib/core/services/adk/test_streaming.py covering: - _delta_char_len, _can_merge, _merge_pair: per-channel correctness + None-handling - _merge_consecutive: pure-text collapse, cross-channel order preservation, per-channel reconstruction matches per-token semantics - CoalescingBuffer: first-delta-immediate flush within ~20ms, size-threshold flush before timer fires, multi-delta coalescing within one window, idle close, add-after-close no-op - CoalescingBuffer cancel-during-flush regression test for the P1 fix: five queued chunks must all surface across publishes when close() cancels mid-flush (asserts substring presence rather than exact ordering, since the documented trade-off allows duplicates of the in-flight item) - StreamingTaskMessageContext mode dispatch: "off" suppresses publishes but persists full content, "per_token" publishes each delta synchronously, "coalesced" batches and persists full content * chore(streaming): route TemporalStreamingModel logger through make_logger The model file used raw ``logging.getLogger("agentex.temporal.streaming")``, which returns a logger with no handler attached and no level configured — so the existing ``[TemporalStreamingModel] Initialized ... streaming_mode=...`` INFO log was silently dropped, making it impossible to verify at runtime that a coalesced (or any) streaming mode was actually wired. Switch to the SDK's ``make_logger`` helper (level=INFO, RichHandler in local mode, StreamHandler otherwise) used everywhere else in the SDK. The explicit logger name ``agentex.temporal.streaming`` is preserved so any external logging configuration targeting that name keeps working.
declan-scale
added a commit
that referenced
this pull request
Apr 30, 2026
* feat(api): api update * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * chore(internal): more robust bootstrap script * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * fix: use correct field name format for multipart file arrays * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * feat: support setting headers via env * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * codegen metadata * fix: allow litellm security patch (#336) * fix(adk): Always inject headers on execute activity (#337) * perf(streaming): coalesce per-token publishes to Redis (50ms / 128-char window) (#333) * perf(streaming): coalesce per-token publishes to Redis (50ms / 128-char window) Per-token Redis publishes from TemporalStreamingModel were adding ~45s (56-62%) overhead to agent response latency, mostly from head-of-line blocking on the model's event loop: each `await streaming_context.stream_update(...)` inside the OpenAI stream `async for` paused token consumption until the publish round-trip completed. This change introduces a `CoalescingBuffer` driven by an `asyncio.Event`, so the producer never awaits on Redis. Deltas are merged consecutive-only (preserving character order in every (type, index) channel) and flushed on a 50ms timer, on a 128-char size threshold, or immediately for the first delta to keep perceived responsiveness high. The buffer's `close()` drains remaining deltas before the DONE event, so consumers see the full sequence in order. A new `StreamingMode = Literal["off", "per_token", "coalesced"]` lives in `streaming.py` as the single source of truth and is plumbed through the adk streaming module, `StreamingService.streaming_task_message_context`, and `StreamingTaskMessageContext`. Default is `"coalesced"` everywhere, so all 13+ existing context callers (claude_agents, langgraph, litellm provider, openai sync provider, etc.) benefit automatically. * chore(streaming): fix import ordering (ruff I001) * fix(streaming): address greptile review findings - _run: when CancelledError is raised mid-flush in the for-loop, re-enqueue the in-flight item plus any remaining items in the local `drained` list back into self._buf so close()'s final drain can recover them. Previously the local `drained` list was unreachable after CancelledError exited the for-loop, causing the last coalesced batch to be silently dropped on close-during-flush races. Trade-off: the in-flight item may be duplicated on the consumer side (Redis pub may have completed before cancel was delivered), which is preferable to silent loss for streaming UX. - _merge_pair: replace `return b` fallback with AssertionError. All six current TaskMessageDelta variants have explicit isinstance branches, so the fallback is unreachable today. But _can_merge returns True for any same-type pair, so adding a 7th delta variant without updating _merge_pair would silently drop `a`'s accumulated content. Asserting turns a future silent data-loss into an immediate, diagnosable crash. * test(streaming): add coalescing-layer tests; loosen one model assertion After merging the test-suite repair from main (#334) into this branch, one model test (test_responses_api_streaming) regressed because its assert_called_with strict-matched all kwargs of streaming_task_message_context and didn't tolerate the new `streaming_mode='coalesced'` kwarg this PR adds. Switched to assert_called() + targeted kwarg checks so the test verifies what it cares about (task_id threading) without locking in implementation details. Replaced the ad-hoc smoke scripts that lived in conversation with a real pytest module at tests/lib/core/services/adk/test_streaming.py covering: - _delta_char_len, _can_merge, _merge_pair: per-channel correctness + None-handling - _merge_consecutive: pure-text collapse, cross-channel order preservation, per-channel reconstruction matches per-token semantics - CoalescingBuffer: first-delta-immediate flush within ~20ms, size-threshold flush before timer fires, multi-delta coalescing within one window, idle close, add-after-close no-op - CoalescingBuffer cancel-during-flush regression test for the P1 fix: five queued chunks must all surface across publishes when close() cancels mid-flush (asserts substring presence rather than exact ordering, since the documented trade-off allows duplicates of the in-flight item) - StreamingTaskMessageContext mode dispatch: "off" suppresses publishes but persists full content, "per_token" publishes each delta synchronously, "coalesced" batches and persists full content * chore(streaming): route TemporalStreamingModel logger through make_logger The model file used raw ``logging.getLogger("agentex.temporal.streaming")``, which returns a logger with no handler attached and no level configured — so the existing ``[TemporalStreamingModel] Initialized ... streaming_mode=...`` INFO log was silently dropped, making it impossible to verify at runtime that a coalesced (or any) streaming mode was actually wired. Switch to the SDK's ``make_logger`` helper (level=INFO, RichHandler in local mode, StreamHandler otherwise) used everywhere else in the SDK. The explicit logger name ``agentex.temporal.streaming`` is preserved so any external logging configuration targeting that name keeps working. * codegen metadata * feat(api): api update * release: 0.10.3 --------- Co-authored-by: stainless-app[bot] <142633134+stainless-app[bot]@users.noreply.github.com> Co-authored-by: Brandon Allen <brandon.allen@scale.com> Co-authored-by: Declan Brady <declan.brady@scale.com> Co-authored-by: Stas Moreinis <stas.moreinis@scale.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The entire
test_streaming_model.pysuite has been broken onmainfor some time — running it shows 4 failed + 24 errors, 0 passing. None of those tests have actually been exercisingTemporalStreamingModelsince whichever refactor introduced the drift. This PR fixes the four root causes so the suite runs end-to-end (28/28 pass) and starts catching real regressions again.Bugs fixed
Bug 1: Fixture name typo (24 errors)
conftest.pydefines the streaming-mock fixture asmock_adk_streaming(no leading underscore), but every test inTestStreamingModelSettingsandTestStreamingModelToolsrequests it as_mock_adk_streaming. Pytest can't resolve the fixture, so all 24 tests error before their bodies run.The fixture is
autouse=Trueand the parameter value was never used in any test body, so the param was purely vestigial — the original author appears to have intended the leading underscore as the standard pytest convention for intentionally unused side-effect fixtures, but applied it to the consumer instead of the fixture name.Fix: Replaced the dead
_mock_adk_streamingparameter with the new_streaming_context_varsfixture (see Bug 2), which these tests genuinely need.Bug 2: Test/code drift on validation contract (3 of 4 original failures)
TemporalStreamingModel.get_response()readstask_id,trace_id, andparent_span_idfrom ContextVars populated byContextInterceptorfrom request headers in real Temporal flows. Tests had been passingtask_id=...as a kwarg, which is silently swallowed by**kwargsand ignored — so all three ContextVars stayed at their defaults, the validation at the top ofget_responseraised, and no test inTestStreamingModelBasicsever reached its assertions.Fix: New
_streaming_context_varsfixture inconftest.pypopulates all three ContextVars (with proper teardown via tokens), simulating whatContextInterceptordoes in production. Tests that need the validation to pass now request this fixture.Bug 3:
sample_computer_toolfixture vs. recent ComputerTool narrowing (1 error)Commit
4c269080("Narrow ComputerTool.computer union for Responses API serialization") addedisinstance(computer, (Computer, AsyncComputer))validation to_convert_tools, butsample_computer_toolstill constructed a bareMagicMock().test_computer_toolstarted failing at conversion time.Fix: Switched to
MagicMock(spec=Computer)soisinstancepasses. Same pattern other fixtures in the file already use (HostedMCPTool,ImageGenerationTool, etc.).Bug 4: Stale event mocks (1 of 4 original failures + 2 surfaced after Bug 2)
Three tests in
TestStreamingModelBasicsthat assert onstreaming_task_message_contextcalls built their event sequences with rawMagicMock(type=\"...\"). Production dispatches viaisinstance(event, ResponseOutputItemAddedEvent)etc. — bareMagicMocknever satisfies that, so dispatch was silently skipped and the assertions failed.Fix: Switched to
MagicMock(spec=ResponseOutputItemAddedEvent)(and the other event types) for each event in those streams.spec=makesisinstancepass without triggering pydantic validation on the event's required fields (which would also needsequence_number, realResponseinstances, etc.).test_task_id_threadingadditionally asserted against a hardcodedtask_id=\"test_task_12345\"that never actually got threaded anywhere — the kwarg was ignored just like in Bug 2. Updated the assertion to use the value yielded by_streaming_context_vars, which is what production reads from the ContextVar.Verification
pytest src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.pyruff checkon changed filespyrighton changed filesTest plan
main)_mock_adk_streamingname (none found in the codebase, but worth a CI re-check)Why these are real fixes, not just "make the failures go away"
_streaming_context_varsfixture isn't a workaround — it explicitly mirrors whatContextInterceptordoes in production. Tests that exerciseget_response()now actually hit the same validation/dispatch path real flows do.MagicMock(spec=...)is the canonical pytest pattern for satisfyingisinstancewithout rebuilding production data structures from scratch. Same approach is already used elsewhere in this conftest.streaming_task_message_contextbeing called are now actually verifiable — previously the dispatcher was skipping over MagicMock events without anyone noticing.Greptile Summary
This PR correctly fixes four root causes that left all 28 streaming model tests broken. The
conftest.pychanges are clean — the new_streaming_context_varsfixture properly mirrors production'sContextInterceptorwith correct token-based teardown, and theMagicMock(spec=Computer)fix is the right approach. One residual bare mock intest_multiple_computer_tools_errorcould cause the test to assert the wrong exception if the isinstance guard fires first.Confidence Score: 4/5
Safe to merge after addressing the bare MagicMock() for computer2 in test_multiple_computer_tools_error; all other findings are P2 quality suggestions.
One P1 finding remains: computer2 = MagicMock() (no spec=Computer) in test_multiple_computer_tools_error could cause the test to raise the wrong error if the isinstance guard in _convert_tools fires before the multi-tool check. All other issues are P2 concerns that don't block merge.
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py — specifically line 551 (computer2 = MagicMock() should be MagicMock(spec=Computer)).
Important Files Changed
_streaming_context_varsfixture with proper ContextVar setup/teardown via tokens, and fixessample_computer_toolto useMagicMock(spec=Computer)— both changes are correct and well-motivated._mock_adk_streamingwith_streaming_context_vars, upgrades event mocks to spec-bound objects inTestStreamingModelBasics, and rewirestest_task_id_threadingto use the ContextVar value. One residual bareMagicMock()forcomputer2(line 551) may fail the isinstance guard before the intended multi-tool error is raised; self-referentialinitial_contentassertion (line 779) is vacuously true; Settings/Tools tests still use bare completed-event mocks that skip production dispatch.Sequence Diagram
sequenceDiagram participant Test participant Fixture as _streaming_context_vars participant ContextVar as ContextVars (streaming_task_id etc.) participant GetResponse as TemporalStreamingModel.get_response() participant Interceptor as ContextInterceptor (production) Note over Interceptor,ContextVar: Production flow Interceptor->>ContextVar: set(task_id, trace_id, span_id) from request headers Note over Test,ContextVar: Test flow (this PR) Test->>Fixture: request _streaming_context_vars Fixture->>ContextVar: set(sample_task_id, test-trace-id, test-parent-span-id) Fixture-->>Test: yield sample_task_id (token held) Test->>GetResponse: get_response(...) — no task_id kwarg needed GetResponse->>ContextVar: read streaming_task_id.get() ContextVar-->>GetResponse: sample_task_id GetResponse-->>Test: ModelResponse Test->>Fixture: teardown Fixture->>ContextVar: reset(token) — restore prior stateComments Outside Diff (3)
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py, line 551 (link)MagicMock()for secondComputerToolmay fail isinstance check before the multi-tool guardThe PR's Bug 3 fix (switching
sample_computer_tooltoMagicMock(spec=Computer)) was needed because production's_convert_toolsrunsisinstance(computer, (Computer, AsyncComputer))validation. The same validation applies to everyComputerToolpassed in, includingsecond_computer_toolhere. If the isinstance check fires before the "only one computer tool" guard, this test will raise a different error thanValueError("You can only provide one computer tool")— causing it to fail or produce a false positive.Apply the same pattern used in
sample_computer_tool:Prompt To Fix With AI
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py, line 777-780 (link)initial_contentassertion is trivially trueThe
assert_called_with(...)check reads the actualcall_args.kwargs['initial_content']as the expected value, so it unconditionally passes for any non-empty call. Only thetask_idcomparison does real work here. Either assert against a concrete expected value forinitial_content, or drop toassert_called_once()if the content is intentionally unspecified.Prompt To Fix With AI
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py, line 29-32 (link)MagicMockcompleted events inTestStreamingModelSettings/TestStreamingModelToolsskip production event dispatchAll ~24 mock streams in these classes still use
MagicMock(type="response.completed", ...). Because production dispatches viaisinstance(event, ResponseCompletedEvent), these bare mocks fail the isinstance check and the completed event is silently ignored. The tests pass only because they assert onresponses.createcall arguments (captured before streaming), not on the returnedModelResponse. Consider applyingMagicMock(spec=ResponseCompletedEvent)(with.response = MagicMock(output=[], usage=MagicMock())), consistent with whatTestStreamingModelBasicsdoes after this PR.Prompt To Fix With AI
Prompt To Fix All With AI
Reviews (1): Last reviewed commit: "fix(tests): repair test_streaming_model ..." | Re-trigger Greptile