feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743
feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743heavygee wants to merge 49 commits into
Conversation
|
Stack note: This PR is blocked by #692. For review, prefer the incremental diff on the fork: heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends (4 commits: catalog scaffold, backend chooser, Gemini/Qwen descriptions, Playwright dogfood script.) |
There was a problem hiding this comment.
Findings
- [Major] Qwen proxy forwards arbitrary client frames with the hub API key —
POST /api/voice/qwen-tokencorrectly keeps the DashScope key server-side, but the new/api/voice/qwen-wsproxy then opens DashScope with that key and blindly forwards every browser frame to upstream. Unlike the Gemini path, which rejects client-provided setup frames, any authenticated web client can connect directly and send its ownsession.update/response.createpayloads, turning the hub into a generic DashScope proxy and allowing client-controlled instructions/tools under server credentials. Evidence:hub/src/web/server.ts:159.
Suggested fix:Move the initialconst allowedQwenRuntimeEvents = new Set([ 'input_audio_buffer.append', 'input_audio_buffer.commit', 'response.create', 'conversation.item.create' ]) function parseQwenClientEvent(message: string | ArrayBuffer | Uint8Array): { type?: string } | null { try { return JSON.parse(decodeWsText(message)) as { type?: string } } catch { return null } } message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) { const event = parseQwenClientEvent(message) if (!event?.type || !allowedQwenRuntimeEvents.has(event.type)) { try { clientWs.close(1008, 'Client-provided Qwen setup is not allowed') } catch { /* */ } return } const upstream = upstreamMap.get(clientWs) if (upstream?.readyState === WebSocket.OPEN) { upstream.send(message) } }
session.updateconstruction into the hub proxy, as Gemini does withbuildGeminiLiveSetupMessage, and only let the browser send runtime audio/tool-response events afterward.
Questions
- None.
Summary
- Review mode: initial
- One issue found: Qwen realtime proxy needs the same server-owned setup boundary as Gemini before this is safe to merge.
Testing
- Not run (automation; static review only, per PR security instructions).
HAPI Bot
|
Noise check (re: default diff vs
GitHub would not let us retarget base to Automated review note: The |
Rebased from Overbaker/hapi#401 onto current main. Adds a pluggable voice backend architecture that extends the existing ElevenLabs integration: - Gemini 2.5 Live (gemini-live): Google real-time audio via WebSocket with full function calling (messageCodingAgent, processPermissionRequest) - Qwen Realtime (qwen-realtime): Alibaba DashScope via hub WebSocket proxy (browser cannot set Authorization header directly) - VoiceBackendSession: dynamic backend selector with React.lazy loading, gates voice button until backend module is registered - Hub WS proxies: JWT-authenticated /api/voice/gemini-ws and /api/voice/qwen-ws endpoints in Bun.serve, with message queueing during upstream connect to prevent dropped setup frames - AudioWorklet pipeline: inline Blob URL recorder, 24 kHz PCM player, serial tool call execution, AudioContext created in user gesture for mobile - Backend discovery: GET /voice/backend + POST /voice/gemini-token / POST /voice/qwen-token hub routes; frontend auto-detects active backend Merge notes: - Rebased 135 upstream commits cleanly; HappyComposer keeps upstream's configurable enter-behavior setting (supersedes hard-coded Ctrl+Enter) - Converted gemini test files from bun:test to vitest (web package uses vitest) - All 221 hub tests and 636 web tests pass; TypeScript clean
turnComplete handler was unconditionally calling setMuted(false), which re-enabled the mic track even when the user had manually muted. Now restores to state.micMuted instead.
buildGeminiLiveConfig was appending VOICE_CHINESE_LANGUAGE_BLOCK which forced Gemini to always respond in Mandarin regardless of user locale. Gemini now uses the neutral base prompt and responds in the language the user speaks to it, consistent with the ElevenLabs behaviour.
If the session closes while Gemini is mid-speech, cleanup() left state.modelSpeaking=true. The next startSession() would then drop all mic audio in sendAudioChunk() until a model turn eventually flipped the flag — effectively deaf until page reload.
ws.onclose operated on module-level state.ws, not the socket that fired the event. A rapid stop/restart could cause the old socket's onclose to call cleanup() after the new socket was assigned, tearing down the live session. Guard with `if (state.ws !== ws) return` before cleanup. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Matches the Gemini fix — both backends now use VOICE_SYSTEM_PROMPT without the Chinese language block, giving consistent English-default behaviour across all non-ElevenLabs backends. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Adds a "Proactive voice" toggle (default: off = reactive) to the Voice Assistant settings section. Reactive (default): initial context and agent-ready events are fed silently; the assistant waits for the user to speak first. Proactive: original behaviour — Gemini/Qwen narrate context on connect and speak unprompted when the agent finishes a task. ElevenLabs is also affected via onReady sending a user message rather than a silent update. Covers all three backends uniformly. localStorage key: hapi-voice-proactive. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
… visibility - hub/server.ts: add toClientCloseCode() to normalize reserved upstream close codes (1005/1006/1015) to 1011 before forwarding to browser; abnormal upstream drops (1006) would otherwise throw on clientWs.close() and leave the browser socket open - realtime/index.ts: remove static GeminiLiveVoiceSession and QwenVoiceSession barrel exports; VoiceBackendSession lazy-imports both, so barrel re-exports created static dependencies that defeated the intended code-split - App.tsx: gate global useVisibilityReporter on !sessionEventSubscription so the always-on SSE connection does not suppress native Web Push notifications for sessions the user is not currently viewing via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…toggle label - buildGeminiLiveConfig() now accepts optional language param; appends VOICE_CHINESE_LANGUAGE_BLOCK only when language === 'zh' - GeminiLiveVoiceSession passes config.language through - QwenVoiceSession conditionally builds basePrompt from language setting - Fixes silent no-op when user selects Chinese in voice settings on Gemini/Qwen backends (was ElevenLabs-only) - Rename voice-start toggle label to 'Start voice session with summary' - Fix description: clarifies the choice is about session-open behaviour (summary vs greeting), not ongoing narration via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Gemini Live has no built-in first-message like ElevenLabs agents do; without an explicit turnComplete:true it sits silently. In reactive mode (default, toggle off) now sends a greeting instruction after any silent context feed so Gemini introduces itself and invites the user to speak. Proactive mode is unchanged: the context summary is the opening speech. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…reeting - VOICE_SYSTEM_PROMPT: explicit instruction never to call itself Gemini, Google, or any underlying model/provider name — always HAPI - Greeting trigger text: instruct to greet as HAPI only, suppress model name and any reference to context/recent activity in the opening line via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Gemini + Qwen client:
- onerror now sets setupDone/sessionReady and nulls state.ws before
calling reject(), so the stale-close guard trips in onclose and
prevents a duplicate statusCallback('error') on WS failure
Gemini client:
- Proactive mode with no initialContext now falls through to the
greeting trigger instead of sitting silently
- Remove unused handleBargeIn callback (dead code)
Qwen client:
- Add input_audio_sample_rate: 16000 to session.update so PCM rate
is declared explicitly rather than relying on DashScope's default
Hub proxy:
- Remove no-op ternary in Gemini flush loop and message handler
(typeof x === 'string' ? x : x); use upstream.send(msg) directly
- Qwen onerror now calls upstreamMap.delete() before closing client,
eliminating the stale map entry window
- Align Qwen hub fallback model string with QWEN_REALTIME_MODEL
constant ('qwen3-omni-flash-realtime')
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
hub/voice.ts:
- Replace string-concat WS URL construction with buildVoiceWsUrl() which
uses URL API to set protocol/pathname cleanly — fixes double-slash when
HAPI_PUBLIC_URL has a trailing slash (would silently skip the proxy route)
QwenVoiceSession.tsx:
- Wrap tool definitions in {type:'function', function:{...}} as required
by Qwen-Omni realtime schema — previous flat shape caused session.update
rejection before audio capture could start
- Use pcm16/pcm24 audio formats matching DashScope spec; remove
input_audio_sample_rate (encoded in format name)
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
…ose codes
GeminiLiveVoiceSession + QwenVoiceSession:
- startAudioCapture() is now async and awaits recorder.start() before
calling setMuted() — previously setMuted ran before getUserMedia resolved
so a session restarted while muted would open the mic anyway
- statusCallback('connected') now fires after audio is ready
- setMuted() called unconditionally (not just when true) to correctly
apply saved state in either direction
hub/src/web/server.ts:
- Both Gemini and Qwen close() handlers now pass the client code through
toClientCloseCode() before forwarding to upstream — prevents reserved
codes (e.g. 1006) from causing WebSocket.close() to throw and leave
the upstream session open until provider timeout
- Reason string capped at 123 bytes (WebSocket protocol limit)
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
An unhandled rejection inside the async onmessage callback does not
propagate to the outer startSession Promise — the UI hangs on
'connecting' and the provider socket stays partially open. Wrapping
the await in try/catch calls cleanup()/statusCallback('error')/reject()
so failures surface correctly.
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs
fetchVoiceBackend no longer catches errors and defaults to 'elevenlabs' — any
network or server failure now throws so VoiceBackendSession can surface it via
onStatusChange('error', ...) rather than silently mounting the wrong backend.
VoiceBackendSession also resets backend state to null when api changes, so
a stale ElevenLabs registration from a prior discovery cannot persist into
a new session.
via [HAPI](https://hapi.run)
Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs Unknown backend strings (future values, typos) now throw rather than defaulting to elevenlabs, closing the narrow remaining form of the original misrouting bug. Also removes the unnecessary `as VoiceBackendResponse` cast. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…r base64 uploads Qwen session.updated handler now sends the same proactive summary or greeting trigger that Gemini does — previously it started silently in both proactive and reactive modes. maxHttpBufferSize raised to 68 MiB to account for base64 expansion: 50 MiB decoded files become ~66.7 MiB as base64 JSON, so the previous 55 MiB ceiling would disconnect uploads above ~41 MiB before they reached the CLI. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
….update for Qwen text Qwen's realtime API only supports conversation.item.create for function_call_output. Sending it with type:'message' for greetings/context was invalid and could fail before the user spoke. sendTextMessage and sendContextualUpdate now update session instructions via session.update (accumulating context into the system prompt) and trigger response.create only when a spoken reply is needed — matching Qwen's supported client event surface. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
…n start session.updated now returns early after the first ack — subsequent session.update calls (instruction appends) also echo session.updated but must not re-trigger audio capture or the greeting path. currentSessionConfig is now reset to null at the top of startSession so a stale config from a failed previous session cannot leak into the new one. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Without this guard, a missing wsUrl in the hub token response would silently attempt to connect directly to Google with "proxied" as the API key — producing a confusing auth failure instead of a clear error. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
DashScope realtime API accepts only 'pcm' for both input and output audio formats. The pcm16/pcm24 values caused session.update rejection before audio capture could start, leaving the Qwen backend unusable. Also updates the default voice from Mia (not in the qwen3-omni-flash- realtime voice list) to Cherry, which is documented as supported. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Failed token fetch, microphone denial, or WebSocket error during setup left state.playbackContext open. Each failure path now calls cleanup() before throwing/rejecting, preventing AudioContext leaks on mobile browsers with hard limits on concurrent contexts. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>
Reverts changes to files that shouldn't differ from upstream: - .gitignore: remove fork-only AGENTS.local.md entry - web/src/App.tsx: restore dual-subscription SSE pattern (scope-aware) - web/src/hooks/useSSE.ts: restore SSEScope/scope parameter - web/src/hooks/useSSE.test.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.test.ts: restore (was accidentally deleted) - hub/src/sync/syncEngine.ts: restore (off-topic change)
Hub sends HAPI-owned Gemini setup on proxy connect and rejects client setup frames. Qwen proxy always uses QWEN_REALTIME_MODEL instead of a client query parameter. Shared buildGeminiLiveSetupMessage() keeps wire format in one place. Co-authored-by: Cursor <cursoragent@cursor.com>
Expose configured backends from hub, let Settings pick provider when multiple API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths. Co-authored-by: Cursor <cursoragent@cursor.com>
Surface catalog descriptions on the voice row and in the picker, with a hint when preview is ElevenLabs-only. Disabled preview buttons stay visible. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When hub/.env (operator-local) sets GEMINI_API_KEY, the 'falls back to elevenlabs for unknown VOICE_BACKEND values' test leaks gemini-live into the backends list. Delete the four non-elevenlabs key env vars defensively at the start of the test, mirroring the cleanup pattern used by the other tests in the same describe block. No behavior change. Co-authored-by: Cursor <cursoragent@cursor.com>
Stacks on voice-selection-all-backends. Adds shared voice-personality presets, Settings accordions (character + backend-specific tuning), and ElevenLabs session overrides from stored preferences. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace append-only personality notes with the bundled HAPI voice system prompt in an advanced editor. User edits replace the base instruction for ElevenLabs, Gemini (incl. hub proxy), and Qwen. Presets only drive TTS sliders unless the user appends delivery text explicitly. Co-authored-by: Cursor <cursoragent@cursor.com>
Split voice instructions into platform fixtures, provider guardrails, and editable identity/character; compose at runtime for ElevenLabs, Gemini, and Qwen. Session history uses a small connect bootstrap plus deferred contextual chunks, gated by the proactive-summary setting. Settings UI shows wire budgets and read-only fixtures. Co-authored-by: Cursor <cursoragent@cursor.com>
The web workspace runs tests via vitest run. Both voice tests imported describe/expect/test from 'bun:test', which vite cannot bundle and caused 'Cannot bundle built-in module bun:test' transform failures during driver soup verify. Only generic test APIs are used; switching the import to 'vitest' is a no-op behavior change. Co-authored-by: Cursor <cursoragent@cursor.com>
|
Heads-up from a downstream soup-rebuild (heavygee/hapi driver/integration, 2026-05-31): To get this branch + the voice-advanced layer building cleanly on the current 1. On
The 2. On
Both files imported Surfacing in case you re-do the rebase from scratch and overwrite these locally - they would silently come back as test failures on the next CI/build. No action needed if you keep them. |
7469b27 to
238ad4c
Compare
There was a problem hiding this comment.
Findings
- [Major] Qwen proxy still forwards runtime frames before setup is acknowledged — the browser waits for
session.updatedbefore starting audio, but an authenticated client can connect directly and sendresponse.create,conversation.item.create, or instruction-onlysession.updatewhilependingSetupMapis still set. Those frames are forwarded immediately athub/src/web/server.ts:188, before the hub-ownedsession.updatesent athub/src/web/server.ts:159is acknowledged. That can run the upstream session under the provider default config or race ahead of HAPI's locked tools/voice/instructions, reopening part of the generic-proxy boundary the previous review called out.
Suggested fix:const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() // after session.created pendingSetupMap.delete(clientWs) setupAckMap.set(clientWs, false) upstream.send(pendingSetup) // after session.updated for the hub-owned setup setupAckMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) message(clientWs, message) { if (!isQwenSafeClientFrame(message)) { /* close */ return } const upstream = upstreamMap.get(clientWs) if (upstream?.readyState !== WebSocket.OPEN) return if (pendingSetupMap.has(clientWs) || setupAckMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message) }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous Qwen proxy issue is partly addressed, but the proxy still needs to gate client frames until the hub-owned Qwen setup has been acknowledged.
Testing
- Not run (automation; static review only, per PR security instructions).
HAPI Bot
The previous Qwen proxy hardening sent the hub-owned session.update once upstream's session.created landed, but never waited for session.updated before forwarding client frames. An authenticated client could push response.create / conversation.item.create / instruction-only session.update during that window and run the upstream session under the provider default config or partially-applied state - reopening part of the generic-proxy boundary the prior review called out. Changes: - Extract createQwenProxyWebSocketHandler into hub/src/web/qwenProxyHandler.ts so the gating behaviour can be unit-tested without spinning up Bun.serve. - Track a per-client setupAcked WeakMap and a pendingClientFrames queue. Client frames are queued (after passing isQwenSafeClientFrame) until the upstream emits session.updated; on ack the queue is flushed in order. - Clear the WeakMaps on upstream error / upstream close / client close so no state leaks across reused references. - Inject the upstream WebSocket constructor for tests; production keeps the real global WebSocket. Tests: 4 new ack-gate cases + existing 245 hub tests still green. Resolves the [Major] thread on PR tiann#743 follow-up review. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Findings
- [Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup on upstream open, but it then flushes queued non-setup frames immediately and forwards later frames whenever the upstream socket is open. A direct authenticated client can therefore send
clientContent/tool frames before Google acknowledges the hub-owned setup withsetupComplete, racing execution ahead of HAPI's locked model, tools, system prompt, language, and voice config. Evidence:hub/src/web/server.ts:77andhub/src/web/server.ts:107.
Suggested fix:const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName))) } upstream.onmessage = (event) => { const text = typeof event.data === 'string' ? event.data : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer)) if (setupCompleteMap.get(clientWs) === false && JSON.parse(text).setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } if (clientWs.readyState === 1) clientWs.send(event.data) } message(clientWs, message) { if (isGeminiSetupFrame(message)) { clientWs.close(1008, 'Client-provided Gemini setup is not allowed') return } const upstream = upstreamMap.get(clientWs) if (!upstream || upstream.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message) }
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous Qwen ack-gate issue appears addressed in the follow-up, but the newly added Gemini proxy still has the same setup-ack race at its proxy boundary.
Testing
- Not run (automation; static review only, per PR security instructions).
HAPI Bot
| upstream.onopen = () => { | ||
| // Hub-owned setup only — never forward client setup (prevents generic Gemini proxy abuse). | ||
| upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName))) | ||
| for (const queued of pending.splice(0)) { |
There was a problem hiding this comment.
[Major] Gate Gemini client frames until setupComplete
This flushes queued non-setup frames immediately after sending HAPI's setup, before Google has acknowledged it with setupComplete; once the upstream socket is open, message() also forwards later frames directly. A direct authenticated client can send clientContent or tool frames before HAPI's hub-owned model, tools, system prompt, language, and voice setup is installed, so the session can race ahead under default or partially-applied config.
Suggested fix:
const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
upstream.onopen = () => {
setupCompleteMap.set(clientWs, false)
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
}
// On the first upstream setupComplete, flip the gate and flush queued client frames.
if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
setupCompleteMap.set(clientWs, true)
for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
pendingClientFrames.delete(clientWs)
}
message(clientWs, message) {
if (isGeminiSetupFrame(message)) {
clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
return
}
const upstream = upstreamMap.get(clientWs)
if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
const pending = pendingClientFrames.get(clientWs) ?? []
pending.push(message)
pendingClientFrames.set(clientWs, pending)
return
}
upstream.send(message)
}ElevenLabs convai rejects sessions whose `overrides` payload references any
property the agent has not explicitly authorized. The server reply is a malformed
error packet on the LiveKit data channel that the convai-react SDK then deref's
as `event.error_type` (undefined -> TypeError -> disconnect), so the symptom is
a generic "Unknown DataChannel error" with no clue.
Empty-prefs sessions now emit exactly `{ agent: { language } }` for byte-parity
with the upstream/main baseline that the configured agent permits today:
- agent.prompt: only when user has customized identity or character
- tts (full slider object): only when preset diverges from balanced defaults
- tts: { voice_id }: only when a voice was picked and sliders are at defaults
Co-authored-by: Cursor <cursoragent@cursor.com>
…onvAI overrides UI: wrap fixtures, identity, character, delivery preset, and tuning sliders in a single collapsed "Advanced voice settings" disclosure. Defaults stay quiet; sub-sections start collapsed when the master opens. Shows a "customized" badge if any layer differs from defaults so a user who tweaked settings still knows where to find them. Hub: on every /voice/token resolution, PATCH the resolved ConvAI agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt, tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process (cached per agent_id), best-effort (non-fatal on PATCH error). Fixes the "Cannot read properties of undefined (reading 'error_type')" crash on operators who have an existing agent that predates the override declaration. Co-authored-by: Cursor <cursoragent@cursor.com>
Brings layered voice prompt UI, context bootstrap + streaming, and ConvAI agent override reconciliation into the existing voice picker PR. All new surfaces live behind a single collapsed "Advanced voice settings" disclosure; defaults stay quiet for users who only want to pick a voice. Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # hub/src/web/server.ts
There was a problem hiding this comment.
Findings
-
[Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup, then immediately flushes queued client frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can race
clientContentor tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's model, tools, prompt, language, and voice configuration are installed. Evidence:hub/src/web/server.ts:88andhub/src/web/server.ts:118.
Suggested fix:const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction))) } upstream.onmessage = (event) => { const text = typeof event.data === 'string' ? event.data : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer)) const parsed = JSON.parse(text) as { setupComplete?: unknown } if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } if (clientWs.readyState === 1) clientWs.send(event.data) } if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return }
-
[Major] Pass decoded systemInstruction into Qwen setup —
/api/voice/qwen-wsdecodes?systemPrompt=and stores it onws.data, and the Qwen client sends that param for advanced voice settings, butcreateQwenProxyWebSocketHandler()casts the data withoutsystemInstructionand callsbuildQwenSessionUpdateMessage(data.language, data.voiceName). Qwen sessions therefore ignore the user-edited identity/character/delivery prompt while Gemini uses it. Evidence:hub/src/web/qwenProxyHandler.ts:53andhub/src/web/qwenProxyHandler.ts:61.
Suggested fix:const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))
Questions
- None.
Summary
Review mode: follow-up after new commits
The previous Gemini setup-ack race remains present, and the new Qwen prompt transport path drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.
Testing
- Not run (automation; static review only).
HAPI Bot
| data.voiceName, | ||
| data.systemInstruction | ||
| ))) | ||
| for (const queued of pending.splice(0)) { |
There was a problem hiding this comment.
[Major] Gate Gemini client frames until setupComplete
This flushes queued client frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's model, tools, prompt, language, and voice setup is installed.
Suggested fix:
const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
upstream.onopen = () => {
setupCompleteMap.set(clientWs, false)
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}
if (setupCompleteMap.get(clientWs) !== true) {
const pending = pendingClientFrames.get(clientWs) ?? []
pending.push(message)
pendingClientFrames.set(clientWs, pending)
return
}
// Flip the gate and flush only after the first upstream setupComplete frame.| }) | ||
|
|
||
| upstreamMap.set(clientWs, upstream) | ||
| pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName))) |
There was a problem hiding this comment.
[Major] Pass the decoded prompt into Qwen's hub-owned setup
The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.
Suggested fix:
const data = clientWs.data as {
apiKey: string
model: string
language?: string
voiceName?: string
systemInstruction?: string
}
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
data.language,
data.voiceName,
data.systemInstruction
)))…service Live-tested against DashScope international API: - Model: qwen3-omni-flash-realtime → qwen3.5-omni-flash-realtime (previous model ID did not exist on DashScope) - Default voice: Cherry → Tina (confirmed from session.created response on qwen3.5-omni-flash-realtime) - Default WS base: dashscope.aliyuncs.com → dashscope-intl.aliyuncs.com (international accounts use the -intl endpoint; China endpoint rejects international API keys; QWEN_REALTIME_WS_URL env var still overrides)
Two dogfooding fixes verified against live Qwen Realtime session:
sendTextMessage: switch from instruction-injection to conversation.item.create
Qwen Realtime requires a user conversation item before response.create.
The previous approach (updateInstructions + response.create) produced
"input messages do not contain elements with role user" errors. Now sends
{type:message, role:user, content:[{type:input_text}]} then response.create.
sendContextualUpdate is unchanged (instruction-only, no response trigger).
Language handling: replace zh-only branch with buildVoiceLanguageBlock()
Previously, only language='zh' added any instruction; all other languages
(including English) sent no language block, causing Qwen to drift to Chinese.
buildVoiceLanguageBlock() now covers three cases:
- 'zh'/'zh-*': existing Chinese block (unchanged)
- explicit code ('en','es','fr',...): "Always respond in [Language]"
- undefined/auto: "Detect the user's language and maintain it"
Applied to buildGeminiLiveConfig, buildQwenSessionUpdateMessage, and the
client-side currentInstructions mirror in QwenVoiceSession.
Also removes the Gemini hub proxy's zh-only filter, which was discarding
explicit language selections other than Chinese.
…-backends # Conflicts: # hub/src/web/server.ts # shared/src/voice.ts # web/src/realtime/QwenVoiceSession.tsx
There was a problem hiding this comment.
Findings
-
[Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup, then immediately flushes queued frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can still race
clientContent/tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence:hub/src/web/server.ts:88andhub/src/web/server.ts:118.
Suggested fix:const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage( data.language, data.voiceName, data.systemInstruction ))) } upstream.onmessage = (event) => { const text = typeof event.data === 'string' ? event.data : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer)) const parsed = JSON.parse(text) as { setupComplete?: unknown } if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } if (clientWs.readyState === 1) clientWs.send(event.data) } if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return }
-
[Major] Pass decoded systemInstruction into Qwen setup -
/api/voice/qwen-wsdecodes?systemPrompt=and stores it onws.data, and the Qwen client now sends that param for advanced voice settings. The proxy still casts data withoutsystemInstructionand callsbuildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence:hub/src/web/qwenProxyHandler.ts:53andhub/src/web/qwenProxyHandler.ts:61.
Suggested fix:const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))
Questions
- None.
Summary
Review mode: follow-up after new commits
The prior Gemini setup-ack race remains unresolved, and the Qwen prompt transport path still drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.
Testing
- Not run (automation;
bunis not installed in this runner).
HAPI Bot
| data.voiceName, | ||
| data.systemInstruction | ||
| ))) | ||
| for (const queued of pending.splice(0)) { |
There was a problem hiding this comment.
[Major] Gate Gemini client frames until setupComplete
This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.
Suggested fix:
const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
upstream.onopen = () => {
setupCompleteMap.set(clientWs, false)
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}
if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
const pending = pendingClientFrames.get(clientWs) ?? []
pending.push(message)
pendingClientFrames.set(clientWs, pending)
return
}
// Flip the gate and flush only after the first upstream setupComplete frame.| }) | ||
|
|
||
| upstreamMap.set(clientWs, upstream) | ||
| pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName))) |
There was a problem hiding this comment.
[Major] Pass the decoded prompt into Qwen's hub-owned setup
The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.
Suggested fix:
const data = clientWs.data as {
apiKey: string
model: string
language?: string
voiceName?: string
systemInstruction?: string
}
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
data.language,
data.voiceName,
data.systemInstruction
)))638871b to
9d4a5f5
Compare
There was a problem hiding this comment.
Findings
-
[Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup and then flushes queued client frames immediately on upstream open, and later forwards frames whenever the upstream socket is open. A direct authenticated client can still race
clientContentor tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidencehub/src/web/server.ts:88.
Suggested fix:const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage( data.language, data.voiceName, data.systemInstruction ))) } upstream.onmessage = (event) => { const text = typeof event.data === 'string' ? event.data : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer)) const parsed = JSON.parse(text) as { setupComplete?: unknown } if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } if (clientWs.readyState === 1) clientWs.send(event.data) } if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message)
-
[Major] Pass decoded systemInstruction into Qwen setup -
/api/voice/qwen-wsdecodes?systemPrompt=and stores it onws.data, and the Qwen client sends that param for advanced voice settings. The extracted proxy handler still casts data withoutsystemInstructionand callsbuildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidencehub/src/web/qwenProxyHandler.ts:61.
Suggested fix:const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))
Questions
- None.
Summary
Review mode: follow-up after new commits
The prior Gemini setup-ack race remains unresolved, and the Qwen prompt transport path still drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.
Testing
- Not run (automation; PR review security policy says not to execute PR code).
HAPI Bot
| data.voiceName, | ||
| data.systemInstruction | ||
| ))) | ||
| for (const queued of pending.splice(0)) { |
There was a problem hiding this comment.
[Major] Gate Gemini client frames until setupComplete
This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.
Suggested fix:
const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
upstream.onopen = () => {
setupCompleteMap.set(clientWs, false)
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
data.language,
data.voiceName,
data.systemInstruction
)))
}
if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
const pending = pendingClientFrames.get(clientWs) ?? []
pending.push(message)
pendingClientFrames.set(clientWs, pending)
return
}
// Flip the gate and flush only after the first upstream setupComplete frame.| }) | ||
|
|
||
| upstreamMap.set(clientWs, upstream) | ||
| pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName))) |
There was a problem hiding this comment.
[Major] Pass the decoded prompt into Qwen's hub-owned setup
The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.
Suggested fix:
const data = clientWs.data as {
apiKey: string
model: string
language?: string
voiceName?: string
systemInstruction?: string
}
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
data.language,
data.voiceName,
data.systemInstruction
)))Discovered during dogfooding that the advanced settings disclosure mixed
two unrelated concerns: acoustic delivery and conversational behaviour.
Remove the top-level "Advanced voice settings" collapsible entirely.
Replace with two flat, always-visible subsections:
How It Sounds — voice backend, voice picker, acoustic tuning sliders
(stability, expressiveness, speaking rate, similarity, affective
dialog, per-backend hints). All controls that affect the audio signal.
How It Responds — language, proactive toggle, identity, character
prompt, delivery preset, platform fixtures. All controls that affect
what the agent chooses to say and in what language.
Language moves from above How It Sounds (where it was awkwardly grouped
with the backend picker) into How It Responds, where it belongs.
Proactive toggle moves from the separate "Session behavior" block into
How It Responds — it is a conversational behaviour control, not an
acoustic one.
VoiceAdvancedControls exports two named components (VoiceRespondsControls,
VoiceSoundsControls); the old VoiceAdvancedControls export is kept as a
deprecated re-export for any external callers.
zh-CN locale: 声音效果 / 对话风格.
There was a problem hiding this comment.
Findings
-
[Major] Gate Gemini client frames until
setupComplete- the proxy sends HAPI's setup and then immediately flushes queued client frames, andmessage()forwards later frames whenever the upstream socket is open. A direct authenticated client can still raceclientContentor tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidencehub/src/web/server.ts:88.
Suggested fix:const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage( data.language, data.voiceName, data.systemInstruction ))) } upstream.onmessage = (event) => { const text = typeof event.data === 'string' ? event.data : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer)) const parsed = JSON.parse(text) as { setupComplete?: unknown } if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } if (clientWs.readyState === 1) clientWs.send(event.data) } if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message)
-
[Major] Pass decoded
systemInstructioninto Qwen setup -/api/voice/qwen-wsdecodes?systemPrompt=and stores it onws.data, and the Qwen client sends that param for advanced voice settings. The extracted proxy handler still casts data withoutsystemInstructionand callsbuildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidencehub/src/web/qwenProxyHandler.ts:61.
Suggested fix:const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))
Questions
- None.
Summary
- Review mode: follow-up after new commits
- The previous Qwen ack gate issue is partly addressed, but the Gemini setup race remains and Qwen still drops the decoded custom prompt before upstream setup. Add proxy-level tests covering Gemini
setupCompletegating and Qwen custom prompt propagation.
Testing
- Not run (automation; PR review security policy says not to execute PR code).
HAPI Bot
| data.voiceName, | ||
| data.systemInstruction | ||
| ))) | ||
| for (const queued of pending.splice(0)) { |
There was a problem hiding this comment.
[Major] Gate Gemini client frames until setupComplete
This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.
Suggested fix:
const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
upstream.onopen = () => {
setupCompleteMap.set(clientWs, false)
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
data.language,
data.voiceName,
data.systemInstruction
)))
}
// Queue all client frames until the first upstream setupComplete frame.
if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
const pending = pendingClientFrames.get(clientWs) ?? []
pending.push(message)
pendingClientFrames.set(clientWs, pending)
return
}
upstream.send(message)| }) | ||
|
|
||
| upstreamMap.set(clientWs, upstream) | ||
| pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName))) |
There was a problem hiding this comment.
[Major] Pass the decoded prompt into Qwen's hub-owned setup
The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.
Suggested fix:
const data = clientWs.data as {
apiKey: string
model: string
language?: string
voiceName?: string
systemInstruction?: string
}
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
data.language,
data.voiceName,
data.systemInstruction
)))
Summary
Two voice features against #742, stacked on #692. The visible default surface stays small (one picker, one toggle); everything else lives behind a single collapsed disclosure so a user who just wants to pick a voice never sees the tuning UI.
Backend-aware voice picker
shared/voicePickerCatalog.ts- static Gemini/Qwen voice lists, per-backend localStorage keys, resolve helpersGET /api/voice/backendreturns{ backend, backends }when multiple providers are configuredGET /api/voice/voices- ElevenLabs list available even when default backend is Geminigemini-ws-?voice=query param wired into setup messageVoiceSessionConfig.voiceName+ stored preference per backendComposed system prompt + bootstrap-and-stream context (folded in from
feat/voice-advanced-controls)shared/voicePromptLayers.ts: platform fixtures (read-only - tool contracts, routing, TTS rules) + provider guardrails + user-editable identity + user-editable character.composeVoiceAgentPromptmerges them.sendContextualUpdateafter connect. Honest UI wire-budget hints.{agent:{language:'en'}}payload (byte-parity with upstream baseline). Custom layers/sliders/voiceId opt in their respective override fields. Fixes the unauthorized-override crash (Cannot read properties of undefined ('error_type'))./voice/tokenresolution the hub PATCHes the agent'splatform_settings.overridesto matchbuildVoiceAgentConfig()(agent.prompt + tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process and best-effort - existing agents that predate the override declaration now self-heal on next session start instead of requiring operator-side console edits.UI discipline
web/src/components/settings/VoiceAdvancedControls.tsxwraps fixtures preview, identity editor, character editor, delivery preset selector, and tuning sliders inside one masterAdvanced voice settingsdisclosure (collapsed by default).customizedbadge appears next to the master title if any layer differs from defaults so power users still find their tweaks.Test plan
bunx tsc --noEmit(hub + web)bun testvoice routes (hub) - 20 pass including newreconciles platform_settings.overrides on existing agentstestbun testvoice client tests (web) -voicePersonalitySession+voiceContextPlangreenhapi-driver-rebuild --build-web --verifythen dogfood the three backends end-to-end on driverMerge order
feat/pluggable-voice-backend)upstream/main(PR diff should drop to this PR's own delta), merge feat(voice): backend voice picker + advanced controls behind disclosure (#742) #743Issues
Ref #742
Blocked by #692