Skip to content

feat(translate): eager SSE Prelude — TTFB decouples from upstream prefill#220

Merged
steventohme merged 4 commits into
mainfrom
steven/router-sse-prelude-v2
May 20, 2026
Merged

feat(translate): eager SSE Prelude — TTFB decouples from upstream prefill#220
steventohme merged 4 commits into
mainfrom
steven/router-sse-prelude-v2

Conversation

@steventohme
Copy link
Copy Markdown
Collaborator

Stacked on #$(gh pr list --head makosblade/inject-model-choice-non-anthropic --json number --jq '.[0].number' 2>/dev/null || echo '?') (the routing-marker-injection PR). Assume that merges first.

Why

Cross-format SSE clients (and Codex via /v1/responses) waited the full upstream prefill + first-decode for any visible byte, because no frame flushed until upstream's first `Write()`. Even after the routing decision was made we sat on it.

What

Add a `Prelude(streaming bool)` method on each outermost SSE writer. Proxy entry points call it right after the routing decision, which commits HTTP 200 and emits the writer's format-specific first frame immediately.

Path Wrapper Prelude emits
Claude Code → non-Anthropic upstream `AnthropicSSETranslator` `message_start` + routing-marker text block
Cursor / OpenAI chat completions `OpenAIRoutingMarkerWriter` marker `chat.completion.chunk`
Codex /v1/responses `ResponsesWriter` `response.created`
Gemini same-format `GeminiRoutingMarkerWriter` marker candidate chunk

The marker-wrap is skipped when the outer sink is already `ResponsesWriter` (Codex injects its own badge on the first text delta — wrapping would produce a duplicate).

Trade-off

Prelude locks the response to HTTP 200. Any later upstream error must surface as an in-stream error event rather than an HTTP status. Clients already handle this; middle-boxes that look at HTTP status won't see 5xx.

Test plan

  • `go test ./internal/translate/... ./internal/proxy/...` green
  • New: `TestOpenAIRoutingMarkerWriter_PreludeFiresBeforeUpstream` — proves marker flushes without any upstream Write, and the upstream-triggered fallback path skips re-emitting.
  • New: `TestOpenAIRoutingMarkerWriter_PreludeNoOpWhenNonStreaming` — non-streaming requests untouched.
  • Manual: hit /v1/messages with Claude Code, /v1/chat/completions with Cursor, /v1/responses with Codex; confirm visible first-frame latency drops to ~routing time.

🤖 Generated with Claude Code

makosblade and others added 2 commits May 20, 2026 15:45
…eams

The routing marker (✦ Weave Router → model · reason) was only visible in
Anthropic-format responses. Extend it to OpenAI and Gemini surfaces so
Cursor and other non-Anthropic clients can see which model was chosen.

- Add OpenAIRoutingMarkerWriter: emits a chat.completion.chunk with the
  marker before the first upstream chunk in streaming responses
- Add GeminiRoutingMarkerWriter: emits a Gemini candidate chunk with the
  marker before upstream data
- Extend StripRoutingMarkerFromMessages to handle OpenAI string content
  (previously only stripped from Anthropic content-block arrays)
- Strip markers from inbound OpenAI requests to prevent accumulation

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efill

First-byte latency was gated on the upstream provider's first token because
no SSE frame flushed until upstream's first Write(). For cross-format and
Codex paths this meant TTFB = route_ms + upstream_connect + prefill + first
decode, even though we knew the routing decision much earlier.

Add a Prelude() method on each outermost writer that commits HTTP 200 +
emits its first format-specific frame immediately after the routing
decision. Locks the response to 200 (later upstream errors must surface
in-stream rather than as an HTTP status) but slashes perceived latency.

  * OpenAIRoutingMarkerWriter.Prelude   — Cursor / OpenAI chat completions
  * GeminiRoutingMarkerWriter.Prelude   — Gemini same-format
  * ResponsesWriter.Prelude             — Codex (response.created)
  * AnthropicSSETranslator.Prelude      — Claude Code -> non-Anthropic
                                          upstream (message_start + marker)

The marker-writer wrap is skipped when the outer sink is ResponsesWriter
since Codex injects its own badge on the first text delta; double-marker
would otherwise appear.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread internal/translate/responses.go
The v0.55 cluster bundle's model_registry.json reintroduces four OSS
models that catalog.go didn't list, so ValidateDeployed panicked at
boot when ROUTER_SWITCH_TIER_UPGRADE_ENABLED was on.

Adds:
  * mistralai/mistral-small-2603         (TierMid,  OpenRouter)
  * qwen/qwen3-30b-a3b-instruct-2507     (TierMid,  Fireworks + OR fallback)
  * qwen/qwen3-coder                     (TierHigh, Fireworks + OR fallback)
  * qwen/qwen3.5-flash-02-23             (TierLow,  OpenRouter)

The two Fireworks-dedicated rows carry a trailing OpenRouter binding so
managed-prod deploys without a Fireworks key still resolve them — same
shape as deepseek-v4-pro / kimi-k2.6. Pricing reflects OpenRouter list
prices on 2026-05-20; refine if the trainer surfaces real cost data.

Regenerated install/install.sh + install/cc-statusline.sh via
\`go run ./cmd/genprices\` per catalog/CLAUDE.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@steventohme steventohme force-pushed the steven/router-sse-prelude-v2 branch from 4c03fe4 to 0a659ac Compare May 20, 2026 23:23
…irst

Bugbot caught: ResponsesWriter.Prelude sets httpHeadersSent=true before
routing completes (Prelude fires immediately at ProxyOpenAIResponses
entry, before ProxyOpenAIChatCompletion runs routing). The guard then
short-circuited every later WriteHeader call, including the one where
the proxy has stamped x-router-model with the actually-routed model.

Result: response.completed reported the requested model name forever,
and computeBadgeText never showed the swap indicator because
t.model == t.requestedModel for the whole stream.

Move the header-read above the httpHeadersSent guard so it always runs,
while still letting the guard skip the duplicate inner.WriteHeader call.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@steventohme steventohme changed the base branch from makosblade/inject-model-choice-non-anthropic to main May 20, 2026 23:25
@steventohme steventohme merged commit 9834501 into main May 20, 2026
8 checks passed
@steventohme steventohme deleted the steven/router-sse-prelude-v2 branch May 20, 2026 23:32
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 7c7dbac. Configure here.

Comment thread internal/proxy/gemini.go
log.Error("Gemini routing-marker prelude failed", "err", err)
}
sink = mw
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gemini routing marker lacks ingress stripping

Medium Severity

The new GeminiRoutingMarkerWriter injects routing markers into Gemini-format SSE responses as real content (text parts in candidates). StripRoutingMarkerFromMessages is called on the Anthropic and OpenAI ingress paths but not on the Gemini path (ProxyGemini), and it only handles messages[] format — not Gemini's contents[].parts[].text. Clients that echo back conversation history will re-send the marker every turn, compounding token waste and destabilizing prefix caching at Google.

Additional Locations (1)
Fix in Cursor Fix in Web

Triggered by learned rule: Router-injected response decorations must be stripped on ingress

Reviewed by Cursor Bugbot for commit 7c7dbac. Configure here.

t.statusCode = http.StatusOK
t.streaming = true
t.inner.WriteHeader(http.StatusOK)
t.headersEmitted = true
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prelude skips headersEmitted guard unlike other implementations

Low Severity

AnthropicSSETranslator.Prelude calls t.inner.WriteHeader(http.StatusOK) unconditionally without checking t.headersEmitted first. All three other Prelude implementations (GeminiRoutingMarkerWriter, OpenAIRoutingMarkerWriter, ResponsesWriter) guard this call with a headersEmitted/httpHeadersSent check. This inconsistency could cause a double WriteHeader call on the inner writer if the control flow ever changes so that WriteHeader is called before Prelude.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7c7dbac. Configure here.

steventohme added a commit that referenced this pull request May 24, 2026
… failover

The merged-in eager SSE Prelude (main #220) writes HTTP 200 +
message_start to the client writer before the upstream is invoked, so
firstByteGuard.written flipped pre-upstream and every multi-binding
streaming request short-circuited to a single attempt. After the merge,
failover was dead-on-arrival on exactly the OSS deepseek/qwen/moonshot
paths the feature was designed for.

Replace firstByteGuard with preludeBuffer that absorbs pre-upstream
writes into memory and commits to the inner writer only when the first
post-seal write arrives (= upstream produced its first byte). Lifecycle:
  - newPreludeBuffer(w) snapshots the inner Header() so Discard()
    can undo Prelude's Set/Del.
  - Pre-Seal writes/WriteHeader land in the buffer; inner untouched.
  - Seal() marks the end of the Prelude phase.
  - First post-Seal write triggers commit(): flush bufStatus + bufBody
    + Flush, then pass through. Committed() is the new retry gate.
  - Discard() resets pre-commit state and restores Header().

Conditional wrap: ONLY engage preludeBuffer when len(bindings) > 1.
Single-binding requests keep main's TTFB-decoupled Prelude semantics
verbatim — the buffer adds no latency to the only path that gets the
Prelude TTFB win.

Wired through:
  - ProxyMessages: preludeBuffer when multi-binding; per-attempt closure
    calls buf.Seal() between translator.Prelude and p.Proxy.
  - ProxyOpenAIChatCompletion: same pattern; OpenAIRoutingMarkerWriter
    moves inside the per-attempt closure (makeMarkerSink) so retries
    re-emit into a fresh buffer state.
  - ProxyOpenAIResponses: explicit wrapper.Prelude removed at the
    Responses entry point; deferred to ProxyOpenAIChatCompletion which
    fires it eagerly only when single-binding. Multi-binding /v1/responses
    relies on ResponsesWriter's lazy emitCreated-on-first-Write — small
    TTFB regression for Codex on multi-binding models, the trade for
    failover correctness on /v1/responses.

Format-specific exhaustion rendering via new failoverInputs.flushErr:
  - ProxyMessages uses flushUpstreamErrorAsAnthropic, translating the
    upstream OpenAI/Fireworks JSON envelope via
    translate.OpenAIToAnthropicError so the Anthropic-format client sees
    `{"type":"error","error":{...}}` rather than the raw upstream shape.
  - ProxyOpenAIChatCompletion uses flushBufferedIfPresent (passthrough).
  - Hop-by-hop headers and Content-Type are scrubbed/forced before write.

Tests:
  - Existing dispatch tests updated to use preludeBuffer with Seal()
    inside the test closure (mimics production order).
  - Three new TestPreludeBuffer_* cases assert: pre-seal buffering +
    post-seal first-write commit + body ordering, Discard's header
    restore across Set/Del, pre-commit Flush no-op.
  - All 21 packages green via go test -tags=no_onnx ./...

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
steventohme added a commit that referenced this pull request May 24, 2026
…outer-200

Wires real Service + real openaicompat clients + two httptest upstreams
to assert the full chain: dispatch loop, preludeBuffer, format-specific
flushErr, per-attempt prep rebuild. Catches both blocker-class holes by
construction:

  - TestProxyMessages_FireworksFailureFallbackToOpenRouter exercises a
    503 from Fireworks + 200 SSE from OpenRouter. Asserts the client
    receives valid Anthropic SSE (the failover is invisible at the wire
    layer), the x-router-fallback-from header surfaces the primary, and
    OpenRouter's request body carries the `provider`/`reasoning` gates
    that emit_openai.go only writes when opts.TargetProvider ==
    openrouter — proves the per-attempt prep rebuild is wired
    end-to-end.

  - TestProxyMessages_BothBindingsFail exhausts every binding with
    upstream-shape JSON errors and asserts that the customer-facing
    response is rendered in Anthropic's `{"type":"error", ...}` shape
    via translate.OpenAIToAnthropicError, not the raw OpenAI envelope.
    Validates flushUpstreamErrorAsAnthropic.

  - TestProxyMessages_SingleBindingPreservesEagerPrelude wires a
    single-binding Anthropic-native model and asserts the response
    streams cleanly with no fallback header — preludeBuffer is NOT
    engaged on this path, preserving main #220's TTFB-decoupled Prelude
    semantics for everything that isn't a multi-binding OSS model.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants