feat(voice): backend voice picker + advanced controls behind disclosure (#742) by heavygee · Pull Request #743 · tiann/hapi

heavygee · 2026-05-30T22:49:10Z

Reviewers: do not use the default "Files changed" tab on this PR.
This branch is stacked on #692 (feat/pluggable-voice-backend). Against tiann/main GitHub will show the union of #692 + this PR.
Review only: heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends
Do not merge until #692 lands, then rebase onto main so this PR shrinks to its own delta only.

Summary

Two voice features against #742, stacked on #692. The visible default surface stays small (one picker, one toggle); everything else lives behind a single collapsed disclosure so a user who just wants to pick a voice never sees the tuning UI.

Backend-aware voice picker

shared/voicePickerCatalog.ts - static Gemini/Qwen voice lists, per-backend localStorage keys, resolve helpers
Hub GET /api/voice/backend returns { backend, backends } when multiple providers are configured
Hub GET /api/voice/voices - ElevenLabs list available even when default backend is Gemini
Hub gemini-ws - ?voice= query param wired into setup message
Web Settings - voice backend chooser (when 2+ backends), voice list follows selection, Gemini/Qwen descriptions, preview-is-EL-only hint
Voice sessions - VoiceSessionConfig.voiceName + stored preference per backend

Composed system prompt + bootstrap-and-stream context (folded in from `feat/voice-advanced-controls`)

Layered prompt in shared/voicePromptLayers.ts: platform fixtures (read-only - tool contracts, routing, TTS rules) + provider guardrails + user-editable identity + user-editable character. composeVoiceAgentPrompt merges them.
Bootstrap + stream context: small initial conversation payload at handshake (~4 KB) plus streaming chunks via sendContextualUpdate after connect. Honest UI wire-budget hints.
All three backends: ElevenLabs ConvAI, Gemini Live, and Qwen Realtime each compose + bootstrap + stream.
ElevenLabs minimal-overrides discipline: empty prefs produce a minimal {agent:{language:'en'}} payload (byte-parity with upstream baseline). Custom layers/sliders/voiceId opt in their respective override fields. Fixes the unauthorized-override crash (Cannot read properties of undefined ('error_type')).
Hub-side ConvAI override reconciliation: on every /voice/token resolution the hub PATCHes the agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt + tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process and best-effort - existing agents that predate the override declaration now self-heal on next session start instead of requiring operator-side console edits.

UI discipline

New web/src/components/settings/VoiceAdvancedControls.tsx wraps fixtures preview, identity editor, character editor, delivery preset selector, and tuning sliders inside one master Advanced voice settings disclosure (collapsed by default).
Sub-sections (fixtures / identity / character / delivery / tuning) start collapsed when the master opens.
A customized badge appears next to the master title if any layer differs from defaults so power users still find their tweaks.
Backend picker + voice picker remain at the top, outside the disclosure.

Test plan

bunx tsc --noEmit (hub + web)
bun test voice routes (hub) - 20 pass including new reconciles platform_settings.overrides on existing agents test
bun test voice client tests (web) - voicePersonalitySession + voiceContextPlan green
hapi-driver-rebuild --build-web --verify then dogfood the three backends end-to-end on driver
Operator dogfood on driver soup (PR review gate)

Merge order

Merge feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #692 (feat/pluggable-voice-backend)
Rebase this branch onto upstream/main (PR diff should drop to this PR's own delta), merge feat(voice): backend voice picker + advanced controls behind disclosure (#742) #743

Issues

Ref #742
Blocked by #692

heavygee · 2026-05-31T00:53:06Z

Stack note: This PR is blocked by #692. For review, prefer the incremental diff on the fork:

heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends

(4 commits: catalog scaffold, backend chooser, Gemini/Qwen descriptions, Playwright dogfood script.)

github-actions

Findings

[Major] Qwen proxy forwards arbitrary client frames with the hub API key — POST /api/voice/qwen-token correctly keeps the DashScope key server-side, but the new /api/voice/qwen-ws proxy then opens DashScope with that key and blindly forwards every browser frame to upstream. Unlike the Gemini path, which rejects client-provided setup frames, any authenticated web client can connect directly and send its own session.update/response.create payloads, turning the hub into a generic DashScope proxy and allowing client-controlled instructions/tools under server credentials. Evidence: hub/src/web/server.ts:159.
Suggested fix:

const allowedQwenRuntimeEvents = new Set([
    'input_audio_buffer.append',
    'input_audio_buffer.commit',
    'response.create',
    'conversation.item.create'
])

function parseQwenClientEvent(message: string | ArrayBuffer | Uint8Array): { type?: string } | null {
    try {
        return JSON.parse(decodeWsText(message)) as { type?: string }
    } catch {
        return null
    }
}

message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) {
    const event = parseQwenClientEvent(message)
    if (!event?.type || !allowedQwenRuntimeEvents.has(event.type)) {
        try { clientWs.close(1008, 'Client-provided Qwen setup is not allowed') } catch { /* */ }
        return
    }

    const upstream = upstreamMap.get(clientWs)
    if (upstream?.readyState === WebSocket.OPEN) {
        upstream.send(message)
    }
}

Move the initial session.update construction into the hub proxy, as Gemini does with buildGeminiLiveSetupMessage, and only let the browser send runtime audio/tool-response events afterward.

Questions

None.

Summary

Review mode: initial
One issue found: Qwen realtime proxy needs the same server-owned setup boundary as Gemini before this is safe to merge.

Testing

Not run (automation; static review only, per PR security instructions).

HAPI Bot

heavygee · 2026-05-31T01:06:56Z

Noise check (re: default diff vs main)

Compare base	Commits	Files
`tiann/main` (GitHub default)	29	35
`feat/pluggable-voice-backend` (#692 tip)	4	23

GitHub would not let us retarget base to feat/pluggable-voice-backend on tiann/hapi (branch only exists on heavygee/hapi). Until #692 merges, use the compare link in the PR description for review.

Automated review note: The github-actions MAJOR on Qwen proxy (hub/src/web/server.ts) is from the #692 stack in this branch, not from the 4 #742 commits — please route that feedback to #692 if still open.

Rebased from Overbaker/hapi#401 onto current main. Adds a pluggable voice backend architecture that extends the existing ElevenLabs integration: - Gemini 2.5 Live (gemini-live): Google real-time audio via WebSocket with full function calling (messageCodingAgent, processPermissionRequest) - Qwen Realtime (qwen-realtime): Alibaba DashScope via hub WebSocket proxy (browser cannot set Authorization header directly) - VoiceBackendSession: dynamic backend selector with React.lazy loading, gates voice button until backend module is registered - Hub WS proxies: JWT-authenticated /api/voice/gemini-ws and /api/voice/qwen-ws endpoints in Bun.serve, with message queueing during upstream connect to prevent dropped setup frames - AudioWorklet pipeline: inline Blob URL recorder, 24 kHz PCM player, serial tool call execution, AudioContext created in user gesture for mobile - Backend discovery: GET /voice/backend + POST /voice/gemini-token / POST /voice/qwen-token hub routes; frontend auto-detects active backend Merge notes: - Rebased 135 upstream commits cleanly; HappyComposer keeps upstream's configurable enter-behavior setting (supersedes hard-coded Ctrl+Enter) - Converted gemini test files from bun:test to vitest (web package uses vitest) - All 221 hub tests and 636 web tests pass; TypeScript clean

turnComplete handler was unconditionally calling setMuted(false), which re-enabled the mic track even when the user had manually muted. Now restores to state.micMuted instead.

buildGeminiLiveConfig was appending VOICE_CHINESE_LANGUAGE_BLOCK which forced Gemini to always respond in Mandarin regardless of user locale. Gemini now uses the neutral base prompt and responds in the language the user speaks to it, consistent with the ElevenLabs behaviour.

If the session closes while Gemini is mid-speech, cleanup() left state.modelSpeaking=true. The next startSession() would then drop all mic audio in sendAudioChunk() until a model turn eventually flipped the flag — effectively deaf until page reload.

ws.onclose operated on module-level state.ws, not the socket that fired the event. A rapid stop/restart could cause the old socket's onclose to call cleanup() after the new socket was assigned, tearing down the live session. Guard with `if (state.ws !== ws) return` before cleanup. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Matches the Gemini fix — both backends now use VOICE_SYSTEM_PROMPT without the Chinese language block, giving consistent English-default behaviour across all non-ElevenLabs backends. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Adds a "Proactive voice" toggle (default: off = reactive) to the Voice Assistant settings section. Reactive (default): initial context and agent-ready events are fed silently; the assistant waits for the user to speak first. Proactive: original behaviour — Gemini/Qwen narrate context on connect and speak unprompted when the agent finishes a task. ElevenLabs is also affected via onReady sending a user message rather than a silent update. Covers all three backends uniformly. localStorage key: hapi-voice-proactive. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

… visibility - hub/server.ts: add toClientCloseCode() to normalize reserved upstream close codes (1005/1006/1015) to 1011 before forwarding to browser; abnormal upstream drops (1006) would otherwise throw on clientWs.close() and leave the browser socket open - realtime/index.ts: remove static GeminiLiveVoiceSession and QwenVoiceSession barrel exports; VoiceBackendSession lazy-imports both, so barrel re-exports created static dependencies that defeated the intended code-split - App.tsx: gate global useVisibilityReporter on !sessionEventSubscription so the always-on SSE connection does not suppress native Web Push notifications for sessions the user is not currently viewing via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…toggle label - buildGeminiLiveConfig() now accepts optional language param; appends VOICE_CHINESE_LANGUAGE_BLOCK only when language === 'zh' - GeminiLiveVoiceSession passes config.language through - QwenVoiceSession conditionally builds basePrompt from language setting - Fixes silent no-op when user selects Chinese in voice settings on Gemini/Qwen backends (was ElevenLabs-only) - Rename voice-start toggle label to 'Start voice session with summary' - Fix description: clarifies the choice is about session-open behaviour (summary vs greeting), not ongoing narration via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Gemini Live has no built-in first-message like ElevenLabs agents do; without an explicit turnComplete:true it sits silently. In reactive mode (default, toggle off) now sends a greeting instruction after any silent context feed so Gemini introduces itself and invites the user to speak. Proactive mode is unchanged: the context summary is the opening speech. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…reeting - VOICE_SYSTEM_PROMPT: explicit instruction never to call itself Gemini, Google, or any underlying model/provider name — always HAPI - Greeting trigger text: instruct to greet as HAPI only, suppress model name and any reference to context/recent activity in the opening line via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Gemini + Qwen client: - onerror now sets setupDone/sessionReady and nulls state.ws before calling reject(), so the stale-close guard trips in onclose and prevents a duplicate statusCallback('error') on WS failure Gemini client: - Proactive mode with no initialContext now falls through to the greeting trigger instead of sitting silently - Remove unused handleBargeIn callback (dead code) Qwen client: - Add input_audio_sample_rate: 16000 to session.update so PCM rate is declared explicitly rather than relying on DashScope's default Hub proxy: - Remove no-op ternary in Gemini flush loop and message handler (typeof x === 'string' ? x : x); use upstream.send(msg) directly - Qwen onerror now calls upstreamMap.delete() before closing client, eliminating the stale map entry window - Align Qwen hub fallback model string with QWEN_REALTIME_MODEL constant ('qwen3-omni-flash-realtime') via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

hub/voice.ts: - Replace string-concat WS URL construction with buildVoiceWsUrl() which uses URL API to set protocol/pathname cleanly — fixes double-slash when HAPI_PUBLIC_URL has a trailing slash (would silently skip the proxy route) QwenVoiceSession.tsx: - Wrap tool definitions in {type:'function', function:{...}} as required by Qwen-Omni realtime schema — previous flat shape caused session.update rejection before audio capture could start - Use pcm16/pcm24 audio formats matching DashScope spec; remove input_audio_sample_rate (encoded in format name) via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…ose codes GeminiLiveVoiceSession + QwenVoiceSession: - startAudioCapture() is now async and awaits recorder.start() before calling setMuted() — previously setMuted ran before getUserMedia resolved so a session restarted while muted would open the mic anyway - statusCallback('connected') now fires after audio is ready - setMuted() called unconditionally (not just when true) to correctly apply saved state in either direction hub/src/web/server.ts: - Both Gemini and Qwen close() handlers now pass the client code through toClientCloseCode() before forwarding to upstream — prevents reserved codes (e.g. 1006) from causing WebSocket.close() to throw and leave the upstream session open until provider timeout - Reason string capped at 123 bytes (WebSocket protocol limit) via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

An unhandled rejection inside the async onmessage callback does not propagate to the outer startSession Promise — the UI hangs on 'connecting' and the provider socket stays partially open. Wrapping the await in try/catch calls cleanup()/statusCallback('error')/reject() so failures surface correctly. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…alling back to ElevenLabs fetchVoiceBackend no longer catches errors and defaults to 'elevenlabs' — any network or server failure now throws so VoiceBackendSession can surface it via onStatusChange('error', ...) rather than silently mounting the wrong backend. VoiceBackendSession also resets backend state to null when api changes, so a stale ElevenLabs registration from a prior discovery cannot persist into a new session. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…alling back to ElevenLabs Unknown backend strings (future values, typos) now throw rather than defaulting to elevenlabs, closing the narrow remaining form of the original misrouting bug. Also removes the unnecessary `as VoiceBackendResponse` cast. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…r base64 uploads Qwen session.updated handler now sends the same proactive summary or greeting trigger that Gemini does — previously it started silently in both proactive and reactive modes. maxHttpBufferSize raised to 68 MiB to account for base64 expansion: 50 MiB decoded files become ~66.7 MiB as base64 JSON, so the previous 55 MiB ceiling would disconnect uploads above ~41 MiB before they reached the CLI. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

….update for Qwen text Qwen's realtime API only supports conversation.item.create for function_call_output. Sending it with type:'message' for greetings/context was invalid and could fail before the user spoke. sendTextMessage and sendContextualUpdate now update session instructions via session.update (accumulating context into the system prompt) and trigger response.create only when a spoken reply is needed — matching Qwen's supported client event surface. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

…n start session.updated now returns early after the first ack — subsequent session.update calls (instruction appends) also echo session.updated but must not re-trigger audio capture or the greeting path. currentSessionConfig is now reset to null at the top of startSession so a stale config from a failed previous session cannot leak into the new one. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Without this guard, a missing wsUrl in the hub token response would silently attempt to connect directly to Google with "proxied" as the API key — producing a confusing auth failure instead of a clear error. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

DashScope realtime API accepts only 'pcm' for both input and output audio formats. The pcm16/pcm24 values caused session.update rejection before audio capture could start, leaving the Qwen backend unusable. Also updates the default voice from Mia (not in the qwen3-omni-flash- realtime voice list) to Cherry, which is documented as supported. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Failed token fetch, microphone denial, or WebSocket error during setup left state.playbackContext open. Each failure path now calls cleanup() before throwing/rejecting, preventing AudioContext leaks on mobile browsers with hard limits on concurrent contexts. via [HAPI](https://hapi.run) Co-Authored-By: HAPI <noreply@hapi.run>

Reverts changes to files that shouldn't differ from upstream: - .gitignore: remove fork-only AGENTS.local.md entry - web/src/App.tsx: restore dual-subscription SSE pattern (scope-aware) - web/src/hooks/useSSE.ts: restore SSEScope/scope parameter - web/src/hooks/useSSE.test.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.ts: restore (was accidentally deleted) - web/src/lib/appSseSubscriptions.test.ts: restore (was accidentally deleted) - hub/src/sync/syncEngine.ts: restore (off-topic change)

Hub sends HAPI-owned Gemini setup on proxy connect and rejects client setup frames. Qwen proxy always uses QWEN_REALTIME_MODEL instead of a client query parameter. Shared buildGeminiLiveSetupMessage() keeps wire format in one place. Co-authored-by: Cursor <cursoragent@cursor.com>

Expose configured backends from hub, let Settings pick provider when multiple API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths. Co-authored-by: Cursor <cursoragent@cursor.com>

Surface catalog descriptions on the voice row and in the picker, with a hint when preview is ElevenLabs-only. Disabled preview buttons stay visible. Co-authored-by: Cursor <cursoragent@cursor.com>

Co-authored-by: Cursor <cursoragent@cursor.com>

When hub/.env (operator-local) sets GEMINI_API_KEY, the 'falls back to elevenlabs for unknown VOICE_BACKEND values' test leaks gemini-live into the backends list. Delete the four non-elevenlabs key env vars defensively at the start of the test, mirroring the cleanup pattern used by the other tests in the same describe block. No behavior change. Co-authored-by: Cursor <cursoragent@cursor.com>

Stacks on voice-selection-all-backends. Adds shared voice-personality presets, Settings accordions (character + backend-specific tuning), and ElevenLabs session overrides from stored preferences. Co-authored-by: Cursor <cursoragent@cursor.com>

Replace append-only personality notes with the bundled HAPI voice system prompt in an advanced editor. User edits replace the base instruction for ElevenLabs, Gemini (incl. hub proxy), and Qwen. Presets only drive TTS sliders unless the user appends delivery text explicitly. Co-authored-by: Cursor <cursoragent@cursor.com>

Split voice instructions into platform fixtures, provider guardrails, and editable identity/character; compose at runtime for ElevenLabs, Gemini, and Qwen. Session history uses a small connect bootstrap plus deferred contextual chunks, gated by the proactive-summary setting. Settings UI shows wire budgets and read-only fixtures. Co-authored-by: Cursor <cursoragent@cursor.com>

The web workspace runs tests via vitest run. Both voice tests imported describe/expect/test from 'bun:test', which vite cannot bundle and caused 'Cannot bundle built-in module bun:test' transform failures during driver soup verify. Only generic test APIs are used; switching the import to 'vitest' is a no-op behavior change. Co-authored-by: Cursor <cursoragent@cursor.com>

heavygee · 2026-05-31T23:12:48Z

Heads-up from a downstream soup-rebuild (heavygee/hapi driver/integration, 2026-05-31):

To get this branch + the voice-advanced layer building cleanly on the current feat/pluggable-voice-backend tip, two pre-existing test bugs needed fixing. Both are local-only commits on the rebased branches and will be lost on the next force-push unless preserved.

1. On feat/voice-selection-all-backends (this PR)

238ad4c test(voice): isolate VOICE_BACKEND fallback test from leaked env vars

hub/src/web/routes/voice.test.ts | 4 ++++

The "falls back to elevenlabs for unknown VOICE_BACKEND values" test never deleted GEMINI_API_KEY / QWEN_API_KEY / OPENAI_API_KEY / QWEN_REALTIME_API_KEY before asserting. When hub/.env has any of those set, gemini-live (or another backend) leaks into the resolved backends list and the test fails. The fix adds four delete process.env.* lines at the top of the test, mirroring the cleanup pattern already used by the other tests in the same describe block. No behavior change.

2. On feat/voice-advanced-controls (sibling branch, stacks on this one)

f3ab0f3 test(web): import from vitest instead of bun:test

web/src/lib/voicePersonalitySession.test.ts        | 2 +-
web/src/realtime/hooks/voiceContextPlan.test.ts    | 2 +-

Both files imported describe/expect/test from 'bun:test'. The web workspace runs vitest run, which cannot bundle the built-in bun:test module and fails the transform during bun run build. Only generic test APIs are used; the import swap is a no-op behavior change.

Surfacing in case you re-do the rebase from scratch and overwrite these locally - they would silently come back as test failures on the next CI/build. No action needed if you keep them.

github-actions

Findings

[Major] Qwen proxy still forwards runtime frames before setup is acknowledged — the browser waits for session.updated before starting audio, but an authenticated client can connect directly and send response.create, conversation.item.create, or instruction-only session.update while pendingSetupMap is still set. Those frames are forwarded immediately at hub/src/web/server.ts:188, before the hub-owned session.update sent at hub/src/web/server.ts:159 is acknowledged. That can run the upstream session under the provider default config or race ahead of HAPI's locked tools/voice/instructions, reopening part of the generic-proxy boundary the previous review called out.
Suggested fix:

const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

// after session.created
pendingSetupMap.delete(clientWs)
setupAckMap.set(clientWs, false)
upstream.send(pendingSetup)

// after session.updated for the hub-owned setup
setupAckMap.set(clientWs, true)
for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
pendingClientFrames.delete(clientWs)

message(clientWs, message) {
    if (!isQwenSafeClientFrame(message)) { /* close */ return }
    const upstream = upstreamMap.get(clientWs)
    if (upstream?.readyState !== WebSocket.OPEN) return
    if (pendingSetupMap.has(clientWs) || setupAckMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
}

Questions

None.

Summary

Review mode: follow-up after new commits
The previous Qwen proxy issue is partly addressed, but the proxy still needs to gate client frames until the hub-owned Qwen setup has been acknowledged.

Testing

Not run (automation; static review only, per PR security instructions).

HAPI Bot

The previous Qwen proxy hardening sent the hub-owned session.update once upstream's session.created landed, but never waited for session.updated before forwarding client frames. An authenticated client could push response.create / conversation.item.create / instruction-only session.update during that window and run the upstream session under the provider default config or partially-applied state - reopening part of the generic-proxy boundary the prior review called out. Changes: - Extract createQwenProxyWebSocketHandler into hub/src/web/qwenProxyHandler.ts so the gating behaviour can be unit-tested without spinning up Bun.serve. - Track a per-client setupAcked WeakMap and a pendingClientFrames queue. Client frames are queued (after passing isQwenSafeClientFrame) until the upstream emits session.updated; on ack the queue is flushed in order. - Clear the WeakMaps on upstream error / upstream close / client close so no state leaks across reused references. - Inject the upstream WebSocket constructor for tests; production keeps the real global WebSocket. Tests: 4 new ack-gate cases + existing 245 hub tests still green. Resolves the [Major] thread on PR tiann#743 follow-up review. Co-authored-by: Cursor <cursoragent@cursor.com>

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup on upstream open, but it then flushes queued non-setup frames immediately and forwards later frames whenever the upstream socket is open. A direct authenticated client can therefore send clientContent/tool frames before Google acknowledges the hub-owned setup with setupComplete, racing execution ahead of HAPI's locked model, tools, system prompt, language, and voice config. Evidence: hub/src/web/server.ts:77 and hub/src/web/server.ts:107.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    if (setupCompleteMap.get(clientWs) === false && JSON.parse(text).setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

message(clientWs, message) {
    if (isGeminiSetupFrame(message)) {
        clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
        return
    }
    const upstream = upstreamMap.get(clientWs)
    if (!upstream || upstream.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
}

Questions

None.

Summary

Review mode: follow-up after new commits
The previous Qwen ack-gate issue appears addressed in the follow-up, but the newly added Gemini proxy still has the same setup-ack race at its proxy boundary.

Testing

Not run (automation; static review only, per PR security instructions).

HAPI Bot

github-actions · 2026-06-01T02:41:54Z

+            upstream.onopen = () => {
+                // Hub-owned setup only — never forward client setup (prevents generic Gemini proxy abuse).
+                upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued non-setup frames immediately after sending HAPI's setup, before Google has acknowledged it with setupComplete; once the upstream socket is open, message() also forwards later frames directly. A direct authenticated client can send clientContent or tool frames before HAPI's hub-owned model, tools, system prompt, language, and voice setup is installed, so the session can race ahead under default or partially-applied config.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName))) } // On the first upstream setupComplete, flip the gate and flush queued client frames. if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) { setupCompleteMap.set(clientWs, true) for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued) pendingClientFrames.delete(clientWs) } message(clientWs, message) { if (isGeminiSetupFrame(message)) { clientWs.close(1008, 'Client-provided Gemini setup is not allowed') return } const upstream = upstreamMap.get(clientWs) if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message) }

ElevenLabs convai rejects sessions whose `overrides` payload references any property the agent has not explicitly authorized. The server reply is a malformed error packet on the LiveKit data channel that the convai-react SDK then deref's as `event.error_type` (undefined -> TypeError -> disconnect), so the symptom is a generic "Unknown DataChannel error" with no clue. Empty-prefs sessions now emit exactly `{ agent: { language } }` for byte-parity with the upstream/main baseline that the configured agent permits today: - agent.prompt: only when user has customized identity or character - tts (full slider object): only when preset diverges from balanced defaults - tts: { voice_id }: only when a voice was picked and sliders are at defaults Co-authored-by: Cursor <cursoragent@cursor.com>

…onvAI overrides UI: wrap fixtures, identity, character, delivery preset, and tuning sliders in a single collapsed "Advanced voice settings" disclosure. Defaults stay quiet; sub-sections start collapsed when the master opens. Shows a "customized" badge if any layer differs from defaults so a user who tweaked settings still knows where to find them. Hub: on every /voice/token resolution, PATCH the resolved ConvAI agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt, tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process (cached per agent_id), best-effort (non-fatal on PATCH error). Fixes the "Cannot read properties of undefined (reading 'error_type')" crash on operators who have an existing agent that predates the override declaration. Co-authored-by: Cursor <cursoragent@cursor.com>

Brings layered voice prompt UI, context bootstrap + streaming, and ConvAI agent override reconciliation into the existing voice picker PR. All new surfaces live behind a single collapsed "Advanced voice settings" disclosure; defaults stay quiet for users who only want to pick a voice. Co-authored-by: Cursor <cursoragent@cursor.com> # Conflicts: # hub/src/web/server.ts

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup, then immediately flushes queued client frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's model, tools, prompt, language, and voice configuration are installed. Evidence: hub/src/web/server.ts:88 and hub/src/web/server.ts:118.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    const parsed = JSON.parse(text) as { setupComplete?: unknown }
    if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}

[Major] Pass decoded systemInstruction into Qwen setup — /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings, but createQwenProxyWebSocketHandler() casts the data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName). Qwen sessions therefore ignore the user-edited identity/character/delivery prompt while Gemini uses it. Evidence: hub/src/web/qwenProxyHandler.ts:53 and hub/src/web/qwenProxyHandler.ts:61.
Suggested fix:
```
const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))
```

Questions

None.

Summary
Review mode: follow-up after new commits
The previous Gemini setup-ack race remains present, and the new Qwen prompt transport path drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

Not run (automation; static review only).

HAPI Bot

github-actions · 2026-06-01T07:16:16Z

+                    data.voiceName,
+                    data.systemInstruction
+                )))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued client frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's model, tools, prompt, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction))) } if (setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } // Flip the gate and flush only after the first upstream setupComplete frame.

github-actions · 2026-06-01T07:16:16Z

+            })
+
+            upstreamMap.set(clientWs, upstream)
+            pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))


[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))

…service Live-tested against DashScope international API: - Model: qwen3-omni-flash-realtime → qwen3.5-omni-flash-realtime (previous model ID did not exist on DashScope) - Default voice: Cherry → Tina (confirmed from session.created response on qwen3.5-omni-flash-realtime) - Default WS base: dashscope.aliyuncs.com → dashscope-intl.aliyuncs.com (international accounts use the -intl endpoint; China endpoint rejects international API keys; QWEN_REALTIME_WS_URL env var still overrides)

Two dogfooding fixes verified against live Qwen Realtime session: sendTextMessage: switch from instruction-injection to conversation.item.create Qwen Realtime requires a user conversation item before response.create. The previous approach (updateInstructions + response.create) produced "input messages do not contain elements with role user" errors. Now sends {type:message, role:user, content:[{type:input_text}]} then response.create. sendContextualUpdate is unchanged (instruction-only, no response trigger). Language handling: replace zh-only branch with buildVoiceLanguageBlock() Previously, only language='zh' added any instruction; all other languages (including English) sent no language block, causing Qwen to drift to Chinese. buildVoiceLanguageBlock() now covers three cases: - 'zh'/'zh-*': existing Chinese block (unchanged) - explicit code ('en','es','fr',...): "Always respond in [Language]" - undefined/auto: "Detect the user's language and maintain it" Applied to buildGeminiLiveConfig, buildQwenSessionUpdateMessage, and the client-side currentInstructions mirror in QwenVoiceSession. Also removes the Gemini hub proxy's zh-only filter, which was discarding explicit language selections other than Chinese.

…-backends # Conflicts: # hub/src/web/server.ts # shared/src/voice.ts # web/src/realtime/QwenVoiceSession.tsx

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup, then immediately flushes queued frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can still race clientContent/tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence: hub/src/web/server.ts:88 and hub/src/web/server.ts:118.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    const parsed = JSON.parse(text) as { setupComplete?: unknown }
    if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}

[Major] Pass decoded systemInstruction into Qwen setup - /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client now sends that param for advanced voice settings. The proxy still casts data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence: hub/src/web/qwenProxyHandler.ts:53 and hub/src/web/qwenProxyHandler.ts:61.
Suggested fix:
```
const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))
```

Questions

None.

Summary
Review mode: follow-up after new commits
The prior Gemini setup-ack race remains unresolved, and the Qwen prompt transport path still drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

Not run (automation; bun is not installed in this runner).

HAPI Bot

github-actions · 2026-06-01T20:55:36Z

+                    data.voiceName,
+                    data.systemInstruction
+                )))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction))) } if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } // Flip the gate and flush only after the first upstream setupComplete frame.

github-actions · 2026-06-01T20:55:36Z

+            })
+
+            upstreamMap.set(clientWs, upstream)
+            pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))


[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))

…I_KEY

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup and then flushes queued client frames immediately on upstream open, and later forwards frames whenever the upstream socket is open. A direct authenticated client can still race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence hub/src/web/server.ts:88.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    const parsed = JSON.parse(text) as { setupComplete?: unknown }
    if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
upstream.send(message)

[Major] Pass decoded systemInstruction into Qwen setup - /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings. The extracted proxy handler still casts data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence hub/src/web/qwenProxyHandler.ts:61.
Suggested fix:
```
const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))
```

Questions

None.

Summary
Review mode: follow-up after new commits
The prior Gemini setup-ack race remains unresolved, and the Qwen prompt transport path still drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

Not run (automation; PR review security policy says not to execute PR code).

HAPI Bot

github-actions · 2026-06-01T21:02:41Z

+                    data.voiceName,
+                    data.systemInstruction
+                )))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage( data.language, data.voiceName, data.systemInstruction ))) } if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } // Flip the gate and flush only after the first upstream setupComplete frame.

github-actions · 2026-06-01T21:02:41Z

+            })
+
+            upstreamMap.set(clientWs, upstream)
+            pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))


[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))

Discovered during dogfooding that the advanced settings disclosure mixed two unrelated concerns: acoustic delivery and conversational behaviour. Remove the top-level "Advanced voice settings" collapsible entirely. Replace with two flat, always-visible subsections: How It Sounds — voice backend, voice picker, acoustic tuning sliders (stability, expressiveness, speaking rate, similarity, affective dialog, per-backend hints). All controls that affect the audio signal. How It Responds — language, proactive toggle, identity, character prompt, delivery preset, platform fixtures. All controls that affect what the agent chooses to say and in what language. Language moves from above How It Sounds (where it was awkwardly grouped with the backend picker) into How It Responds, where it belongs. Proactive toggle moves from the separate "Session behavior" block into How It Responds — it is a conversational behaviour control, not an acoustic one. VoiceAdvancedControls exports two named components (VoiceRespondsControls, VoiceSoundsControls); the old VoiceAdvancedControls export is kept as a deprecated re-export for any external callers. zh-CN locale: 声音效果 / 对话风格.

github-actions

Findings

[Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup and then immediately flushes queued client frames, and message() forwards later frames whenever the upstream socket is open. A direct authenticated client can still race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence hub/src/web/server.ts:88.
Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))
}

upstream.onmessage = (event) => {
    const text = typeof event.data === 'string'
        ? event.data
        : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
    const parsed = JSON.parse(text) as { setupComplete?: unknown }
    if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
        setupCompleteMap.set(clientWs, true)
        for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
        pendingClientFrames.delete(clientWs)
    }
    if (clientWs.readyState === 1) clientWs.send(event.data)
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
upstream.send(message)

[Major] Pass decoded systemInstruction into Qwen setup - /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings. The extracted proxy handler still casts data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence hub/src/web/qwenProxyHandler.ts:61.
Suggested fix:
```
const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))
```

Questions

None.

Summary

Review mode: follow-up after new commits
The previous Qwen ack gate issue is partly addressed, but the Gemini setup race remains and Qwen still drops the decoded custom prompt before upstream setup. Add proxy-level tests covering Gemini setupComplete gating and Qwen custom prompt propagation.

Testing

Not run (automation; PR review security policy says not to execute PR code).

HAPI Bot

github-actions · 2026-06-01T21:33:08Z

+                    data.voiceName,
+                    data.systemInstruction
+                )))
+                for (const queued of pending.splice(0)) {


[Major] Gate Gemini client frames until setupComplete

This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>() const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>() upstream.onopen = () => { setupCompleteMap.set(clientWs, false) upstream.send(JSON.stringify(buildGeminiLiveSetupMessage( data.language, data.voiceName, data.systemInstruction ))) } // Queue all client frames until the first upstream setupComplete frame. if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) { const pending = pendingClientFrames.get(clientWs) ?? [] pending.push(message) pendingClientFrames.set(clientWs, pending) return } upstream.send(message)

github-actions · 2026-06-01T21:33:08Z

+            })
+
+            upstreamMap.set(clientWs, upstream)
+            pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))


[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as { apiKey: string model: string language?: string voiceName?: string systemInstruction?: string } pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage( data.language, data.voiceName, data.systemInstruction )))

heavygee marked this pull request as ready for review May 31, 2026 00:53

github-actions Bot reviewed May 31, 2026

View reviewed changes

Comment thread hub/src/web/server.ts Outdated

heavygee mentioned this pull request May 31, 2026

feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #692

Open

9 tasks

heavygee and others added 25 commits May 31, 2026 15:39

fix(voice): restore user mic mute state after Gemini turn completes

58ebf8e

turnComplete handler was unconditionally calling setMuted(false), which re-enabled the mic track even when the user had manually muted. Now restores to state.micMuted instead.

heavygee and others added 11 commits May 31, 2026 23:24

feat(voice): backend chooser and per-provider voice settings (tiann#742)

6036734

Expose configured backends from hub, let Settings pick provider when multiple API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths. Co-authored-by: Cursor <cursoragent@cursor.com>

fix(web): show Gemini/Qwen voice descriptions in Settings

abe1ef5

Surface catalog descriptions on the voice row and in the picker, with a hint when preview is ElevenLabs-only. Disabled preview buttons stay visible. Co-authored-by: Cursor <cursoragent@cursor.com>

chore(dev): harden voice settings Playwright dogfood wait strategy

d24d993

Co-authored-by: Cursor <cursoragent@cursor.com>

fix(web): VoiceAdvancedControls checkbox handler typo

c841ae3

fix(web): ElevenLabs overrides keep client tools, skip default prompt

daa838e

fix(web): cap ElevenLabs WebRTC payload (65KB message limit)

cbfbd5b

heavygee force-pushed the feat/voice-selection-all-backends branch from 7469b27 to 238ad4c Compare June 1, 2026 02:19