feat(translate): eager SSE Prelude — TTFB decouples from upstream prefill#220
Conversation
…eams The routing marker (✦ Weave Router → model · reason) was only visible in Anthropic-format responses. Extend it to OpenAI and Gemini surfaces so Cursor and other non-Anthropic clients can see which model was chosen. - Add OpenAIRoutingMarkerWriter: emits a chat.completion.chunk with the marker before the first upstream chunk in streaming responses - Add GeminiRoutingMarkerWriter: emits a Gemini candidate chunk with the marker before upstream data - Extend StripRoutingMarkerFromMessages to handle OpenAI string content (previously only stripped from Anthropic content-block arrays) - Strip markers from inbound OpenAI requests to prevent accumulation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…efill
First-byte latency was gated on the upstream provider's first token because
no SSE frame flushed until upstream's first Write(). For cross-format and
Codex paths this meant TTFB = route_ms + upstream_connect + prefill + first
decode, even though we knew the routing decision much earlier.
Add a Prelude() method on each outermost writer that commits HTTP 200 +
emits its first format-specific frame immediately after the routing
decision. Locks the response to 200 (later upstream errors must surface
in-stream rather than as an HTTP status) but slashes perceived latency.
* OpenAIRoutingMarkerWriter.Prelude — Cursor / OpenAI chat completions
* GeminiRoutingMarkerWriter.Prelude — Gemini same-format
* ResponsesWriter.Prelude — Codex (response.created)
* AnthropicSSETranslator.Prelude — Claude Code -> non-Anthropic
upstream (message_start + marker)
The marker-writer wrap is skipped when the outer sink is ResponsesWriter
since Codex injects its own badge on the first text delta; double-marker
would otherwise appear.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The v0.55 cluster bundle's model_registry.json reintroduces four OSS models that catalog.go didn't list, so ValidateDeployed panicked at boot when ROUTER_SWITCH_TIER_UPGRADE_ENABLED was on. Adds: * mistralai/mistral-small-2603 (TierMid, OpenRouter) * qwen/qwen3-30b-a3b-instruct-2507 (TierMid, Fireworks + OR fallback) * qwen/qwen3-coder (TierHigh, Fireworks + OR fallback) * qwen/qwen3.5-flash-02-23 (TierLow, OpenRouter) The two Fireworks-dedicated rows carry a trailing OpenRouter binding so managed-prod deploys without a Fireworks key still resolve them — same shape as deepseek-v4-pro / kimi-k2.6. Pricing reflects OpenRouter list prices on 2026-05-20; refine if the trainer surfaces real cost data. Regenerated install/install.sh + install/cc-statusline.sh via \`go run ./cmd/genprices\` per catalog/CLAUDE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
4c03fe4 to
0a659ac
Compare
…irst Bugbot caught: ResponsesWriter.Prelude sets httpHeadersSent=true before routing completes (Prelude fires immediately at ProxyOpenAIResponses entry, before ProxyOpenAIChatCompletion runs routing). The guard then short-circuited every later WriteHeader call, including the one where the proxy has stamped x-router-model with the actually-routed model. Result: response.completed reported the requested model name forever, and computeBadgeText never showed the swap indicator because t.model == t.requestedModel for the whole stream. Move the header-read above the httpHeadersSent guard so it always runs, while still letting the guard skip the duplicate inner.WriteHeader call. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 7c7dbac. Configure here.
| log.Error("Gemini routing-marker prelude failed", "err", err) | ||
| } | ||
| sink = mw | ||
| } |
There was a problem hiding this comment.
Gemini routing marker lacks ingress stripping
Medium Severity
The new GeminiRoutingMarkerWriter injects routing markers into Gemini-format SSE responses as real content (text parts in candidates). StripRoutingMarkerFromMessages is called on the Anthropic and OpenAI ingress paths but not on the Gemini path (ProxyGemini), and it only handles messages[] format — not Gemini's contents[].parts[].text. Clients that echo back conversation history will re-send the marker every turn, compounding token waste and destabilizing prefix caching at Google.
Additional Locations (1)
Triggered by learned rule: Router-injected response decorations must be stripped on ingress
Reviewed by Cursor Bugbot for commit 7c7dbac. Configure here.
| t.statusCode = http.StatusOK | ||
| t.streaming = true | ||
| t.inner.WriteHeader(http.StatusOK) | ||
| t.headersEmitted = true |
There was a problem hiding this comment.
Prelude skips headersEmitted guard unlike other implementations
Low Severity
AnthropicSSETranslator.Prelude calls t.inner.WriteHeader(http.StatusOK) unconditionally without checking t.headersEmitted first. All three other Prelude implementations (GeminiRoutingMarkerWriter, OpenAIRoutingMarkerWriter, ResponsesWriter) guard this call with a headersEmitted/httpHeadersSent check. This inconsistency could cause a double WriteHeader call on the inner writer if the control flow ever changes so that WriteHeader is called before Prelude.
Reviewed by Cursor Bugbot for commit 7c7dbac. Configure here.
… failover The merged-in eager SSE Prelude (main #220) writes HTTP 200 + message_start to the client writer before the upstream is invoked, so firstByteGuard.written flipped pre-upstream and every multi-binding streaming request short-circuited to a single attempt. After the merge, failover was dead-on-arrival on exactly the OSS deepseek/qwen/moonshot paths the feature was designed for. Replace firstByteGuard with preludeBuffer that absorbs pre-upstream writes into memory and commits to the inner writer only when the first post-seal write arrives (= upstream produced its first byte). Lifecycle: - newPreludeBuffer(w) snapshots the inner Header() so Discard() can undo Prelude's Set/Del. - Pre-Seal writes/WriteHeader land in the buffer; inner untouched. - Seal() marks the end of the Prelude phase. - First post-Seal write triggers commit(): flush bufStatus + bufBody + Flush, then pass through. Committed() is the new retry gate. - Discard() resets pre-commit state and restores Header(). Conditional wrap: ONLY engage preludeBuffer when len(bindings) > 1. Single-binding requests keep main's TTFB-decoupled Prelude semantics verbatim — the buffer adds no latency to the only path that gets the Prelude TTFB win. Wired through: - ProxyMessages: preludeBuffer when multi-binding; per-attempt closure calls buf.Seal() between translator.Prelude and p.Proxy. - ProxyOpenAIChatCompletion: same pattern; OpenAIRoutingMarkerWriter moves inside the per-attempt closure (makeMarkerSink) so retries re-emit into a fresh buffer state. - ProxyOpenAIResponses: explicit wrapper.Prelude removed at the Responses entry point; deferred to ProxyOpenAIChatCompletion which fires it eagerly only when single-binding. Multi-binding /v1/responses relies on ResponsesWriter's lazy emitCreated-on-first-Write — small TTFB regression for Codex on multi-binding models, the trade for failover correctness on /v1/responses. Format-specific exhaustion rendering via new failoverInputs.flushErr: - ProxyMessages uses flushUpstreamErrorAsAnthropic, translating the upstream OpenAI/Fireworks JSON envelope via translate.OpenAIToAnthropicError so the Anthropic-format client sees `{"type":"error","error":{...}}` rather than the raw upstream shape. - ProxyOpenAIChatCompletion uses flushBufferedIfPresent (passthrough). - Hop-by-hop headers and Content-Type are scrubbed/forced before write. Tests: - Existing dispatch tests updated to use preludeBuffer with Seal() inside the test closure (mimics production order). - Three new TestPreludeBuffer_* cases assert: pre-seal buffering + post-seal first-write commit + body ordering, Discard's header restore across Set/Del, pre-commit Flush no-op. - All 21 packages green via go test -tags=no_onnx ./... Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…outer-200
Wires real Service + real openaicompat clients + two httptest upstreams
to assert the full chain: dispatch loop, preludeBuffer, format-specific
flushErr, per-attempt prep rebuild. Catches both blocker-class holes by
construction:
- TestProxyMessages_FireworksFailureFallbackToOpenRouter exercises a
503 from Fireworks + 200 SSE from OpenRouter. Asserts the client
receives valid Anthropic SSE (the failover is invisible at the wire
layer), the x-router-fallback-from header surfaces the primary, and
OpenRouter's request body carries the `provider`/`reasoning` gates
that emit_openai.go only writes when opts.TargetProvider ==
openrouter — proves the per-attempt prep rebuild is wired
end-to-end.
- TestProxyMessages_BothBindingsFail exhausts every binding with
upstream-shape JSON errors and asserts that the customer-facing
response is rendered in Anthropic's `{"type":"error", ...}` shape
via translate.OpenAIToAnthropicError, not the raw OpenAI envelope.
Validates flushUpstreamErrorAsAnthropic.
- TestProxyMessages_SingleBindingPreservesEagerPrelude wires a
single-binding Anthropic-native model and asserts the response
streams cleanly with no fallback header — preludeBuffer is NOT
engaged on this path, preserving main #220's TTFB-decoupled Prelude
semantics for everything that isn't a multi-binding OSS model.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>


Stacked on #$(gh pr list --head makosblade/inject-model-choice-non-anthropic --json number --jq '.[0].number' 2>/dev/null || echo '?') (the routing-marker-injection PR). Assume that merges first.
Why
Cross-format SSE clients (and Codex via /v1/responses) waited the full upstream prefill + first-decode for any visible byte, because no frame flushed until upstream's first `Write()`. Even after the routing decision was made we sat on it.
What
Add a `Prelude(streaming bool)` method on each outermost SSE writer. Proxy entry points call it right after the routing decision, which commits HTTP 200 and emits the writer's format-specific first frame immediately.
The marker-wrap is skipped when the outer sink is already `ResponsesWriter` (Codex injects its own badge on the first text delta — wrapping would produce a duplicate).
Trade-off
Prelude locks the response to HTTP 200. Any later upstream error must surface as an in-stream error event rather than an HTTP status. Clients already handle this; middle-boxes that look at HTTP status won't see 5xx.
Test plan
🤖 Generated with Claude Code