Skip to content

feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching#335

Merged
x merged 8 commits intomainfrom
dpeticolas/prompt-cache-upstream
May 4, 2026
Merged

feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching#335
x merged 8 commits intomainfrom
dpeticolas/prompt-cache-upstream

Conversation

@x
Copy link
Copy Markdown
Contributor

@x x commented Apr 30, 2026

Summary

Four changes to the streaming Responses API path in temporal_streaming_model.py.

My goal is to:

  • Make it easier to track prompt cache hits (usage)
  • Make it easier to get cache hits (prompt_cache_key)
  • Make it easier to use the responses API (response_id, previous_response_id)
  1. Capture real usage from ResponseCompletedEvent.response.usage and surface in the span's output dict.

    TemporalStreamingModel.get_response was constructing a zero-filled Usage and discarding event.response.usage. It now reads usage off the streaming protocol and falls back to zeros only on the error path (stream ends without a completed event).

    The len(''.join(reasoning_contents)) // 4 reasoning-tokens estimator is dropped — the real value arrives in the API response.

    The streaming_model_get_response span now carries {input_tokens, output_tokens, total_tokens, cached_input_tokens, reasoning_tokens} so traces show cache-hit rate.

  2. Return real response_id captured from ResponseCompletedEvent.response.id instead of the previously fabricated f"resp_{uuid.uuid4().hex[:8]}".

    OpenAI Agents SDK reads this field for previous_response_id chaining (run.py:145), exposes it as RunResult.last_response_id (result.py:108), and writes it into traces (tracing/span_data.py:164). A client-side UUID has never been issued by any server, so any caller that picks it up and tries to chain (the documented use case) gets a 400 "response not found".

    Latent since this file was added (commit 2f2a6ed7) because nothing in the codebase chains previous_response_id yet.

  3. Forward previous_response_id, conversation_id, and prompt from the SDK kwarg.

    The SDK's abstract Model.get_response has previous_response_id, conversation_id, prompt as required keyword-only params; the SDK threads previous_response_id down through _ServerConversationTracker when callers set it on Runner.run / RunConfig.

    Our implementation declared **kwargs # noqa: ARG002 and silently swallowed all three. Callers who used the official chaining API got no stateful behavior and no error. Replaced with explicit named params.

    This change is not necessarily required we need to pass previous_response_id but we could be doing that via extra_args. That said, the silent behavior is sketch. I could be convinced that the right thing to here is to raise an exception when we see them in kwargs and expect them in extra_args but this felt less dangerous and more intuitive to the caller.

  4. Plumb prompt_cache_key to responses.create as an opt-in parameter.

    Callers set it via model_settings.extra_args["prompt_cache_key"]. We do not auto-inject a default. I don't trust prompt_cache_key to be standard across all OpenAI-compatible endpoints.

    When unset, the parameter resolves to NOT_GIVEN and is omitted from the request body entirely.

Test plan

All 38 tests in test_streaming_model.py pass locally:

  • test_usage_captured_from_completed_event
  • test_usage_falls_back_when_no_completed_event
  • test_usage_emitted_in_span_output
  • test_response_id_captured_from_completed_event
  • test_response_id_is_none_when_no_completed_event — guards against the fake-UUID footgun
  • test_prompt_cache_key_not_sent_by_default — verifies NOT_GIVEN fallback for non-OpenAI compat
  • test_prompt_cache_key_forwarded_when_opted_in
  • test_previous_response_id_not_sent_by_default — verifies NOT_GIVEN fallback
  • test_previous_response_id_forwarded_via_sdk_kwarg — verifies the SDK's official chaining API now works
  • test_conversation_id_and_prompt_accepted_but_not_forwarded — verifies SDK contract compliance without surface-area expansion
  • All 28 pre-existing tests still pass (after stripping vestigial task_id kwargs)

Why these changes together

The motivating workstream (downstream of this PR) is a stateful-Responses-API migration for ST&S agents to chain via previous_response_id for 40–80% better cache utilization on reasoning models. That migration needs:

  • Real Usage to measure cache hit rate before/after.
  • Real response_id to actually pass back as previous_response_id.
  • previous_response_id actually reaching responses.create instead of being silently swallowed.
  • Optional prompt_cache_key for callers who want it without forcing it on everyone.

Once a release containing these changes lands and aimi-scale bumps its agentex-sdk pin, the runtime monkey-patch in st_s/*/agentex_usage_patch.py and the corresponding _apply_agentex_usage_patch() calls in each run_worker.py can be deleted.

Greptile Summary

Four correctness fixes to the Responses API streaming path: (1) real Usage is now read from ResponseCompletedEvent.response.usage and surfaced in the span; (2) response_id is the server-issued value instead of a fabricated UUID that would 400 on any chained turn; (3) previous_response_id, conversation_id, and prompt are explicitly accepted and forwarded rather than silently swallowed by **kwargs; (4) prompt_cache_key is opt-in via extra_args and resolves to NOT_GIVEN by default.

Confidence Score: 5/5

Safe to merge — all changes are correctness improvements with safe fallback paths and comprehensive test coverage.

No P0 or P1 findings. All four changes fix real latent bugs (fake UUID chaining, zero usage, silently dropped SDK params). Fallbacks are explicit (zeros, None, NOT_GIVEN). The removal of **kwargs is intentional and acknowledged. Tests cover every new code path.

No files require special attention.

Important Files Changed

Filename Overview
src/agentex/lib/core/temporal/plugins/openai_agents/models/temporal_streaming_model.py Captures real usage and response_id from ResponseCompletedEvent, forwards previous_response_id/conversation_id/prompt as explicit kwargs (replacing **kwargs swallowing), and plumbs opt-in prompt_cache_key — all correctness improvements with safe fallbacks.
src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py Adds 10 targeted tests for the new behaviour (usage capture, response_id, prompt_cache_key opt-in, previous_response_id chaining, conversation/prompt forwarding) and strips the now-vestigial task_id kwarg from 28 existing test calls.

Sequence Diagram

sequenceDiagram
    participant SDK as OpenAI Agents SDK
    participant TSM as TemporalStreamingModel
    participant OAI as OpenAI responses.create
    participant Redis as Redis (streaming)

    SDK->>TSM: get_response(..., previous_response_id, conversation_id, prompt)
    Note over TSM: pops prompt_cache_key from extra_args (NOT_GIVEN default)
    TSM->>OAI: responses.create(stream=True, previous_response_id, conversation, prompt, prompt_cache_key, ...)
    loop stream events
        OAI-->>TSM: ResponseTextDeltaEvent / etc.
        TSM->>Redis: stream partial content
    end
    OAI-->>TSM: ResponseCompletedEvent(response.id, response.usage)
    Note over TSM: captured_response_id = response.id<br/>captured_usage = response.usage
    TSM-->>SDK: ModelResponse(response_id=captured_response_id, usage=real_usage)
    Note over SDK: RunResult.last_response_id = captured_response_id<br/>(used for next turn's previous_response_id)
Loading

Reviews (7): Last reviewed commit: "Merge branch 'main' into dpeticolas/prom..." | Re-trigger Greptile

Comment thread src/agentex/lib/core/temporal/plugins/openai_agents/tests/test_streaming_model.py Outdated
x added 3 commits April 30, 2026 17:12
…_key

Two related changes to the streaming Responses API path in
`TemporalStreamingModel.get_response`. Both are observability/cache
improvements; neither changes how the API is called for callers who
don't opt in.

1. Capture real `Usage` from `ResponseCompletedEvent.response.usage`.
   Previously a zero-filled `Usage` was constructed and `event.response.usage`
   from the streaming protocol was discarded. The reasoning-tokens
   estimator (`len(''.join(reasoning_contents)) // 4`) is also dropped —
   the real value arrives in the API response. Falls back to zeros only
   when the stream ends without a `ResponseCompletedEvent` (error path).

2. Surface usage in the span's `output` dict. The
   `streaming_model_get_response` span now carries
   `{input_tokens, output_tokens, total_tokens, cached_input_tokens,
   reasoning_tokens}` so traces show cache-hit rate without external
   log scraping.

3. Plumb `prompt_cache_key` to `responses.create` as an opt-in. Callers
   set it via `model_settings.extra_args["prompt_cache_key"]`. We do not
   auto-inject a default — `prompt_cache_key` is not standard across
   OpenAI-compatible endpoints, and a non-OpenAI server that strictly
   validates request bodies could reject the field. When unset, the
   parameter resolves to `NOT_GIVEN` and is omitted from the request
   body entirely. Behavior on alternative providers is identical to
   today's unless a caller explicitly opts in.
`TemporalStreamingModel.get_response` was synthesizing a client-side
UUID for `ModelResponse.response_id`:

    response_id=f"resp_{uuid.uuid4().hex[:8]}"

Replace with the real `response.id` captured off
`ResponseCompletedEvent.response.id` (alongside the `Usage` capture
already happening in the same branch). On the error path, where the
stream ends without a `ResponseCompletedEvent`, we return `None` —
matching the documented `str | None` contract on
`ModelResponse.response_id`.

## Why this matters

The OpenAI Agents SDK reads `ModelResponse.response_id` in three places:

- `agents/run.py:145` — gates whether the SDK chains via
  `previous_response_id` on the next call (the conditional is
  None-tolerant: a None value just leaves the chain pointer alone).
- `agents/result.py:108` — exposes the value to user code as
  `RunResult.last_response_id`.
- `agents/tracing/span_data.py:164` — written into trace records.

A client-side UUID was never issued by any server. Any caller that
picks it up and tries to chain via `previous_response_id` (the
documented use case for `RunResult.last_response_id`) gets a 400
"response not found" from the API, surfacing far from the actual
cause.

Comparable SDK providers do this correctly:

- `agents/models/openai_responses.py:149`: `response_id=response.id`
- `agents/models/openai_chatcompletions.py:135`: `response_id=None`
- `agents/extensions/models/litellm_model.py:182`: `response_id=None`

`None` is the documented sentinel for "this provider doesn't support
response_id," and the SDK is built to handle it.

The bug has been latent since this file was added (commit 2f2a6ed,
Oct 10) because nothing in the codebase's call paths chains
`previous_response_id` yet. The first caller that does (e.g. a
multi-turn stateful Responses API workflow) triggers it.

## Compatibility

This change is invisible to callers that don't read `response_id` — and
nothing in `scale-agentex-python` reads it. A repo-wide grep finds zero
consumers; only the (now-fixed) write site exists. The field is
serialized into Temporal event history and trace records but consumed
only by the OpenAI Agents SDK, which already handles `None`.
… key

Seven new tests in `TestStreamingModelUsageResponseIdAndCacheKey`:

- Usage captured from `ResponseCompletedEvent.response.usage`
- Usage falls back to zeros when stream ends without a completed event
- Usage emitted in span output_data["usage"]
- response_id captured from `ResponseCompletedEvent.response.id`
- response_id is None (NOT a fabricated UUID) when stream ends without
  a completed event — guards against the previous footgun where a
  client-side UUID would be returned and silently break downstream
  `previous_response_id` chaining
- prompt_cache_key resolves to NOT_GIVEN by default (omitted from
  request body, safe for non-OpenAI endpoints)
- prompt_cache_key forwarded when caller opts in via
  `model_settings.extra_args["prompt_cache_key"]`, and popped from
  extra_args so it isn't passed twice

Pre-existing tests in `TestStreamingModelBasics` (test_responses_api_streaming,
test_task_id_threading, test_redis_context_creation) updated to set
`response.id=None` on their `MagicMock(spec=ResponseCompletedEvent)`
mocks. Without this, the auto-generated MagicMock attribute for
`response.id` flows into `ModelResponse.response_id` and trips
pydantic's `str | None` validation.
@x x force-pushed the dpeticolas/prompt-cache-upstream branch from dc7de0f to eb8ff68 Compare April 30, 2026 21:15
@x x changed the title feat(openai_agents): capture real Usage + enable prompt_cache_key routing feat(openai_agents): expose real Usage and response_id, opt-in prompt_cache_key Apr 30, 2026
@x x changed the title feat(openai_agents): expose real Usage and response_id, opt-in prompt_cache_key feat(openai_agents): expose real response.usage, map response_id=response.id, opt-in prompt_cache_key Apr 30, 2026
The OpenAI Agents SDK's `Model.get_response` abstract has three
keyword-only parameters: `previous_response_id`, `conversation_id`,
`prompt`. The SDK threads them down through `_ServerConversationTracker`
when callers use `Runner.run(..., previous_response_id=X)` or set
`RunConfig` with `auto_previous_response_id=True`.

`TemporalStreamingModel.get_response` was declared with
`**kwargs # noqa: ARG002`, which silently swallowed all three. Callers
who used the SDK's official chaining API saw their `previous_response_id`
disappear and got no stateful behavior — without an error.

This commit:

- Replaces `**kwargs` with explicit `previous_response_id`,
  `conversation_id`, `prompt` params, matching the abstract.
- Forwards `previous_response_id` to `responses.create` via
  `_non_null_or_not_given` (so `None` resolves to `NOT_GIVEN` and the
  field is omitted from the request body — identical behavior to today
  for callers that don't opt in).
- Accepts `conversation_id` and `prompt` for SDK contract compliance
  but does not forward them yet (marked `# noqa: ARG002`); they can be
  wired through later if a use case appears.

## Compatibility with non-OpenAI backends

Same opt-in pattern as `prompt_cache_key`. `TemporalStreamingModel`
calls `responses.create`, but the underlying client can be pointed at
any OpenAI-compatible server (LiteLLM proxy, Foundry, vLLM, etc.).
Some of those backends don't recognize `previous_response_id`. Because
we forward it only when explicitly set, callers who don't opt in see
no change in the wire request — the field is filtered out by
`NOT_GIVEN`. Callers who opt in are responsible for knowing whether
their backend supports it.

## Test housekeeping

The 27 existing tests that passed `task_id=sample_task_id` to
`get_response` were relying on `**kwargs` to silently swallow it.
Production reads `task_id` from a ContextVar (set by
`ContextInterceptor` in real Temporal flows, set by the
`_streaming_context_vars` fixture in tests), not from a function
argument. The kwarg was vestigial cruft. Removed.
@x x changed the title feat(openai_agents): expose real response.usage, map response_id=response.id, opt-in prompt_cache_key feat(openai_agents): expose real Usage/response_id, plumb previous_response_id, opt-in prompt_cache_key Apr 30, 2026
@x x changed the title feat(openai_agents): expose real Usage/response_id, plumb previous_response_id, opt-in prompt_cache_key feat(openai_agents): expose real usage, response_id, plumb previous_response_id, opt-in prompt_cache_key for stateful responses and prompt caching Apr 30, 2026
x added 2 commits April 30, 2026 18:08
…create

The SDK's ``Model.get_response`` abstract has three Responses API
server-state parameters: ``previous_response_id``, ``conversation_id``,
``prompt``. The prior commit wired up ``previous_response_id`` and
accepted the other two for SDK contract compliance but discarded them
with ``# noqa: ARG002``.

Accept-and-discard is a code smell: callers using the SDK's
``Runner.run(conversation_id=..., prompt=...)`` API would see their
arguments silently dropped. Since both map directly to ``responses.create``
kwargs and we're already on that endpoint, the cost of forwarding is two
lines and removes the smell entirely.

- ``conversation_id`` (SDK abstract name) → ``conversation`` (responses.create
  endpoint kwarg). The ``Conversation`` type accepts ``str`` directly, so
  no translation is needed.
- ``prompt`` is the same name on both sides.

Both follow the same opt-in pattern as ``previous_response_id`` and
``prompt_cache_key``: ``None`` resolves to ``NOT_GIVEN`` and is omitted
from the request body, so behavior on alternative OpenAI-compatible
backends is unchanged unless a caller explicitly opts in.
… union

Two ruff fixes for the test file:

- ARG002 on 25 test method signatures: the prior commit
  (forward previous_response_id from SDK kwarg) stripped the
  vestigial ``task_id=sample_task_id`` kwargs from get_response calls,
  but left ``sample_task_id`` in the test method parameter lists.
  The contextvars fixture (``_streaming_context_vars``) already pulls
  ``sample_task_id`` transitively, so the explicit param is redundant.
  Removed from the 25 flagged signatures; preserved on
  ``test_responses_api_streaming`` where it's still used inside the
  body to assert against the streaming context.

- FA102 on _make_response_completed_event: the new test helper used a
  PEP 604 union (``str | None``) without ``from __future__ import
  annotations``. Switched to ``Optional[str]`` to keep the change
  local to the helper rather than retrofitting future annotations
  across the file.
@x x enabled auto-merge May 1, 2026 15:34
auto-merge was automatically disabled May 4, 2026 17:47

Merge commits are not allowed on this repository

@x x merged commit ba5d64b into main May 4, 2026
32 checks passed
@x x deleted the dpeticolas/prompt-cache-upstream branch May 4, 2026 18:47
@stainless-app stainless-app Bot mentioned this pull request May 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants