feat(openai_agents): expose real `usage`, `response_id`, plumb `previous_response_id`, opt-in `prompt_cache_key` for stateful responses and prompt caching by x · Pull Request #335 · scaleapi/scale-agentex-python

x · 2026-04-30T13:47:00Z

Summary

Four changes to the streaming Responses API path in temporal_streaming_model.py.

My goal is to:

Make it easier to track prompt cache hits (usage)
Make it easier to get cache hits (prompt_cache_key)
Make it easier to use the responses API (response_id, previous_response_id)

Capture real usage from ResponseCompletedEvent.response.usage and surface in the span's output dict.

TemporalStreamingModel.get_response was constructing a zero-filled Usage and discarding event.response.usage. It now reads usage off the streaming protocol and falls back to zeros only on the error path (stream ends without a completed event).

The len(''.join(reasoning_contents)) // 4 reasoning-tokens estimator is dropped — the real value arrives in the API response.

The streaming_model_get_response span now carries {input_tokens, output_tokens, total_tokens, cached_input_tokens, reasoning_tokens} so traces show cache-hit rate.
Return real response_id captured from ResponseCompletedEvent.response.id instead of the previously fabricated f"resp_{uuid.uuid4().hex[:8]}".

OpenAI Agents SDK reads this field for previous_response_id chaining (run.py:145), exposes it as RunResult.last_response_id (result.py:108), and writes it into traces (tracing/span_data.py:164). A client-side UUID has never been issued by any server, so any caller that picks it up and tries to chain (the documented use case) gets a 400 "response not found".

Latent since this file was added (commit 2f2a6ed7) because nothing in the codebase chains previous_response_id yet.
Forward previous_response_id, conversation_id, and prompt from the SDK kwarg.

The SDK's abstract Model.get_response has previous_response_id, conversation_id, prompt as required keyword-only params; the SDK threads previous_response_id down through _ServerConversationTracker when callers set it on Runner.run / RunConfig.

Our implementation declared **kwargs # noqa: ARG002 and silently swallowed all three. Callers who used the official chaining API got no stateful behavior and no error. Replaced with explicit named params.

This change is not necessarily required we need to pass previous_response_id but we could be doing that via extra_args. That said, the silent behavior is sketch. I could be convinced that the right thing to here is to raise an exception when we see them in kwargs and expect them in extra_args but this felt less dangerous and more intuitive to the caller.
Plumb prompt_cache_key to responses.create as an opt-in parameter.

Callers set it via model_settings.extra_args["prompt_cache_key"]. We do not auto-inject a default. I don't trust prompt_cache_key to be standard across all OpenAI-compatible endpoints.

When unset, the parameter resolves to NOT_GIVEN and is omitted from the request body entirely.

Test plan

All 38 tests in test_streaming_model.py pass locally:

Why these changes together

The motivating workstream (downstream of this PR) is a stateful-Responses-API migration for ST&S agents to chain via previous_response_id for 40–80% better cache utilization on reasoning models. That migration needs:

Real Usage to measure cache hit rate before/after.
Real response_id to actually pass back as previous_response_id.
previous_response_id actually reaching responses.create instead of being silently swallowed.
Optional prompt_cache_key for callers who want it without forcing it on everyone.

Once a release containing these changes lands and aimi-scale bumps its agentex-sdk pin, the runtime monkey-patch in st_s/*/agentex_usage_patch.py and the corresponding _apply_agentex_usage_patch() calls in each run_worker.py can be deleted.

Greptile Summary

Four correctness fixes to the Responses API streaming path: (1) real Usage is now read from ResponseCompletedEvent.response.usage and surfaced in the span; (2) response_id is the server-issued value instead of a fabricated UUID that would 400 on any chained turn; (3) previous_response_id, conversation_id, and prompt are explicitly accepted and forwarded rather than silently swallowed by **kwargs; (4) prompt_cache_key is opt-in via extra_args and resolves to NOT_GIVEN by default.

Confidence Score: 5/5

Safe to merge — all changes are correctness improvements with safe fallback paths and comprehensive test coverage.

No P0 or P1 findings. All four changes fix real latent bugs (fake UUID chaining, zero usage, silently dropped SDK params). Fallbacks are explicit (zeros, None, NOT_GIVEN). The removal of **kwargs is intentional and acknowledged. Tests cover every new code path.

No files require special attention.

Important Files Changed

Filename	Overview
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py	Captures real `usage` and `response_id` from `ResponseCompletedEvent`, forwards `previous_response_id`/`conversation_id`/`prompt` as explicit kwargs (replacing `**kwargs` swallowing), and plumbs opt-in `prompt_cache_key` — all correctness improvements with safe fallbacks.
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py	Adds 10 targeted tests for the new behaviour (usage capture, response_id, prompt_cache_key opt-in, previous_response_id chaining, conversation/prompt forwarding) and strips the now-vestigial `task_id` kwarg from 28 existing test calls.

Sequence Diagram

sequenceDiagram
    participant SDK as OpenAI Agents SDK
    participant TSM as TemporalStreamingModel
    participant OAI as OpenAI responses.create
    participant Redis as Redis (streaming)

    SDK->>TSM: get_response(..., previous_response_id, conversation_id, prompt)
    Note over TSM: pops prompt_cache_key from extra_args (NOT_GIVEN default)
    TSM->>OAI: responses.create(stream=True, previous_response_id, conversation, prompt, prompt_cache_key, ...)
    loop stream events
        OAI-->>TSM: ResponseTextDeltaEvent / etc.
        TSM->>Redis: stream partial content
    end
    OAI-->>TSM: ResponseCompletedEvent(response.id, response.usage)
    Note over TSM: captured_response_id = response.id<br/>captured_usage = response.usage
    TSM-->>SDK: ModelResponse(response_id=captured_response_id, usage=real_usage)
    Note over SDK: RunResult.last_response_id = captured_response_id<br/>(used for next turn's previous_response_id)

_{Reviews (7): Last reviewed commit: "Merge branch 'main' into dpeticolas/prom..." | Re-trigger Greptile}

…_key Two related changes to the streaming Responses API path in `TemporalStreamingModel.get_response`. Both are observability/cache improvements; neither changes how the API is called for callers who don't opt in. 1. Capture real `Usage` from `ResponseCompletedEvent.response.usage`. Previously a zero-filled `Usage` was constructed and `event.response.usage` from the streaming protocol was discarded. The reasoning-tokens estimator (`len(''.join(reasoning_contents)) // 4`) is also dropped — the real value arrives in the API response. Falls back to zeros only when the stream ends without a `ResponseCompletedEvent` (error path). 2. Surface usage in the span's `output` dict. The `streaming_model_get_response` span now carries `{input_tokens, output_tokens, total_tokens, cached_input_tokens, reasoning_tokens}` so traces show cache-hit rate without external log scraping. 3. Plumb `prompt_cache_key` to `responses.create` as an opt-in. Callers set it via `model_settings.extra_args["prompt_cache_key"]`. We do not auto-inject a default — `prompt_cache_key` is not standard across OpenAI-compatible endpoints, and a non-OpenAI server that strictly validates request bodies could reject the field. When unset, the parameter resolves to `NOT_GIVEN` and is omitted from the request body entirely. Behavior on alternative providers is identical to today's unless a caller explicitly opts in.

`TemporalStreamingModel.get_response` was synthesizing a client-side UUID for `ModelResponse.response_id`: response_id=f"resp_{uuid.uuid4().hex[:8]}" Replace with the real `response.id` captured off `ResponseCompletedEvent.response.id` (alongside the `Usage` capture already happening in the same branch). On the error path, where the stream ends without a `ResponseCompletedEvent`, we return `None` — matching the documented `str | None` contract on `ModelResponse.response_id`. ## Why this matters The OpenAI Agents SDK reads `ModelResponse.response_id` in three places: - `agents/run.py:145` — gates whether the SDK chains via `previous_response_id` on the next call (the conditional is None-tolerant: a None value just leaves the chain pointer alone). - `agents/result.py:108` — exposes the value to user code as `RunResult.last_response_id`. - `agents/tracing/span_data.py:164` — written into trace records. A client-side UUID was never issued by any server. Any caller that picks it up and tries to chain via `previous_response_id` (the documented use case for `RunResult.last_response_id`) gets a 400 "response not found" from the API, surfacing far from the actual cause. Comparable SDK providers do this correctly: - `agents/models/openai_responses.py:149`: `response_id=response.id` - `agents/models/openai_chatcompletions.py:135`: `response_id=None` - `agents/extensions/models/litellm_model.py:182`: `response_id=None` `None` is the documented sentinel for "this provider doesn't support response_id," and the SDK is built to handle it. The bug has been latent since this file was added (commit 2f2a6ed, Oct 10) because nothing in the codebase's call paths chains `previous_response_id` yet. The first caller that does (e.g. a multi-turn stateful Responses API workflow) triggers it. ## Compatibility This change is invisible to callers that don't read `response_id` — and nothing in `scale-agentex-python` reads it. A repo-wide grep finds zero consumers; only the (now-fixed) write site exists. The field is serialized into Temporal event history and trace records but consumed only by the OpenAI Agents SDK, which already handles `None`.

… key Seven new tests in `TestStreamingModelUsageResponseIdAndCacheKey`: - Usage captured from `ResponseCompletedEvent.response.usage` - Usage falls back to zeros when stream ends without a completed event - Usage emitted in span output_data["usage"] - response_id captured from `ResponseCompletedEvent.response.id` - response_id is None (NOT a fabricated UUID) when stream ends without a completed event — guards against the previous footgun where a client-side UUID would be returned and silently break downstream `previous_response_id` chaining - prompt_cache_key resolves to NOT_GIVEN by default (omitted from request body, safe for non-OpenAI endpoints) - prompt_cache_key forwarded when caller opts in via `model_settings.extra_args["prompt_cache_key"]`, and popped from extra_args so it isn't passed twice Pre-existing tests in `TestStreamingModelBasics` (test_responses_api_streaming, test_task_id_threading, test_redis_context_creation) updated to set `response.id=None` on their `MagicMock(spec=ResponseCompletedEvent)` mocks. Without this, the auto-generated MagicMock attribute for `response.id` flows into `ModelResponse.response_id` and trips pydantic's `str | None` validation.

The OpenAI Agents SDK's `Model.get_response` abstract has three keyword-only parameters: `previous_response_id`, `conversation_id`, `prompt`. The SDK threads them down through `_ServerConversationTracker` when callers use `Runner.run(..., previous_response_id=X)` or set `RunConfig` with `auto_previous_response_id=True`. `TemporalStreamingModel.get_response` was declared with `**kwargs # noqa: ARG002`, which silently swallowed all three. Callers who used the SDK's official chaining API saw their `previous_response_id` disappear and got no stateful behavior — without an error. This commit: - Replaces `**kwargs` with explicit `previous_response_id`, `conversation_id`, `prompt` params, matching the abstract. - Forwards `previous_response_id` to `responses.create` via `_non_null_or_not_given` (so `None` resolves to `NOT_GIVEN` and the field is omitted from the request body — identical behavior to today for callers that don't opt in). - Accepts `conversation_id` and `prompt` for SDK contract compliance but does not forward them yet (marked `# noqa: ARG002`); they can be wired through later if a use case appears. ## Compatibility with non-OpenAI backends Same opt-in pattern as `prompt_cache_key`. `TemporalStreamingModel` calls `responses.create`, but the underlying client can be pointed at any OpenAI-compatible server (LiteLLM proxy, Foundry, vLLM, etc.). Some of those backends don't recognize `previous_response_id`. Because we forward it only when explicitly set, callers who don't opt in see no change in the wire request — the field is filtered out by `NOT_GIVEN`. Callers who opt in are responsible for knowing whether their backend supports it. ## Test housekeeping The 27 existing tests that passed `task_id=sample_task_id` to `get_response` were relying on `**kwargs` to silently swallow it. Production reads `task_id` from a ContextVar (set by `ContextInterceptor` in real Temporal flows, set by the `_streaming_context_vars` fixture in tests), not from a function argument. The kwarg was vestigial cruft. Removed.

…create The SDK's ``Model.get_response`` abstract has three Responses API server-state parameters: ``previous_response_id``, ``conversation_id``, ``prompt``. The prior commit wired up ``previous_response_id`` and accepted the other two for SDK contract compliance but discarded them with ``# noqa: ARG002``. Accept-and-discard is a code smell: callers using the SDK's ``Runner.run(conversation_id=..., prompt=...)`` API would see their arguments silently dropped. Since both map directly to ``responses.create`` kwargs and we're already on that endpoint, the cost of forwarding is two lines and removes the smell entirely. - ``conversation_id`` (SDK abstract name) → ``conversation`` (responses.create endpoint kwarg). The ``Conversation`` type accepts ``str`` directly, so no translation is needed. - ``prompt`` is the same name on both sides. Both follow the same opt-in pattern as ``previous_response_id`` and ``prompt_cache_key``: ``None`` resolves to ``NOT_GIVEN`` and is omitted from the request body, so behavior on alternative OpenAI-compatible backends is unchanged unless a caller explicitly opts in.

… union Two ruff fixes for the test file: - ARG002 on 25 test method signatures: the prior commit (forward previous_response_id from SDK kwarg) stripped the vestigial ``task_id=sample_task_id`` kwargs from get_response calls, but left ``sample_task_id`` in the test method parameter lists. The contextvars fixture (``_streaming_context_vars``) already pulls ``sample_task_id`` transitively, so the explicit param is redundant. Removed from the 25 flagged signatures; preserved on ``test_responses_api_streaming`` where it's still used inside the body to assert against the streaming context. - FA102 on _make_response_completed_event: the new test helper used a PEP 604 union (``str | None``) without ``from __future__ import annotations``. Switched to ``Optional[str]`` to keep the change local to the helper rather than retrofitting future annotations across the file.

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py Outdated

x added 3 commits April 30, 2026 17:12

x force-pushed the dpeticolas/prompt-cache-upstream branch from dc7de0f to eb8ff68 Compare April 30, 2026 21:15

x changed the title ~~feat(openai_agents): capture real Usage + enable prompt_cache_key routing~~ feat(openai_agents): expose real Usage and response_id, opt-in prompt_cache_key Apr 30, 2026

x changed the title ~~feat(openai_agents): expose real Usage and response_id, opt-in prompt_cache_key~~ feat(openai_agents): expose real response.usage, map response_id=response.id, opt-in prompt_cache_key Apr 30, 2026

x changed the title ~~feat(openai_agents): expose real response.usage, map response_id=response.id, opt-in prompt_cache_key~~ feat(openai_agents): expose real Usage/response_id, plumb previous_response_id, opt-in prompt_cache_key Apr 30, 2026

x changed the title ~~feat(openai_agents): expose real Usage/response_id, plumb previous_response_id, opt-in prompt_cache_key~~ feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching Apr 30, 2026

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py

x added 2 commits April 30, 2026 18:08

danielmiller98 approved these changes May 1, 2026

View reviewed changes

Merge branch 'main' into dpeticolas/prompt-cache-upstream

e281568

x enabled auto-merge May 1, 2026 15:34

smoreinis approved these changes May 4, 2026

View reviewed changes

Merge branch 'main' into dpeticolas/prompt-cache-upstream

0533561

auto-merge was automatically disabled May 4, 2026 17:47
Merge commits are not allowed on this repository

x merged commit ba5d64b into main May 4, 2026
32 checks passed

x deleted the dpeticolas/prompt-cache-upstream branch May 4, 2026 18:47

stainless-app Bot mentioned this pull request May 4, 2026

release: 0.11.0 #343

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(openai_agents): expose real `usage`, `response_id`, plumb `previous_response_id`, opt-in `prompt_cache_key` for stateful responses and prompt caching#335

feat(openai_agents): expose real `usage`, `response_id`, plumb `previous_response_id`, opt-in `prompt_cache_key` for stateful responses and prompt caching#335
x merged 8 commits intomainfrom
dpeticolas/prompt-cache-upstream

x commented Apr 30, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

x commented Apr 30, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Why these changes together

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

x commented Apr 30, 2026 •

edited by greptile-apps Bot

Loading