Skip to content

feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743

Open
heavygee wants to merge 49 commits into
tiann:mainfrom
heavygee:feat/voice-selection-all-backends
Open

feat(voice): backend voice picker + advanced controls behind disclosure (#742)#743
heavygee wants to merge 49 commits into
tiann:mainfrom
heavygee:feat/voice-selection-all-backends

Conversation

@heavygee
Copy link
Copy Markdown
Contributor

@heavygee heavygee commented May 30, 2026

Reviewers: do not use the default "Files changed" tab on this PR.
This branch is stacked on #692 (feat/pluggable-voice-backend). Against tiann/main GitHub will show the union of #692 + this PR.
Review only: heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends
Do not merge until #692 lands, then rebase onto main so this PR shrinks to its own delta only.

Summary

Two voice features against #742, stacked on #692. The visible default surface stays small (one picker, one toggle); everything else lives behind a single collapsed disclosure so a user who just wants to pick a voice never sees the tuning UI.

Backend-aware voice picker

  • shared/voicePickerCatalog.ts - static Gemini/Qwen voice lists, per-backend localStorage keys, resolve helpers
  • Hub GET /api/voice/backend returns { backend, backends } when multiple providers are configured
  • Hub GET /api/voice/voices - ElevenLabs list available even when default backend is Gemini
  • Hub gemini-ws - ?voice= query param wired into setup message
  • Web Settings - voice backend chooser (when 2+ backends), voice list follows selection, Gemini/Qwen descriptions, preview-is-EL-only hint
  • Voice sessions - VoiceSessionConfig.voiceName + stored preference per backend

Composed system prompt + bootstrap-and-stream context (folded in from feat/voice-advanced-controls)

  • Layered prompt in shared/voicePromptLayers.ts: platform fixtures (read-only - tool contracts, routing, TTS rules) + provider guardrails + user-editable identity + user-editable character. composeVoiceAgentPrompt merges them.
  • Bootstrap + stream context: small initial conversation payload at handshake (~4 KB) plus streaming chunks via sendContextualUpdate after connect. Honest UI wire-budget hints.
  • All three backends: ElevenLabs ConvAI, Gemini Live, and Qwen Realtime each compose + bootstrap + stream.
  • ElevenLabs minimal-overrides discipline: empty prefs produce a minimal {agent:{language:'en'}} payload (byte-parity with upstream baseline). Custom layers/sliders/voiceId opt in their respective override fields. Fixes the unauthorized-override crash (Cannot read properties of undefined ('error_type')).
  • Hub-side ConvAI override reconciliation: on every /voice/token resolution the hub PATCHes the agent's platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt + tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process and best-effort - existing agents that predate the override declaration now self-heal on next session start instead of requiring operator-side console edits.

UI discipline

  • New web/src/components/settings/VoiceAdvancedControls.tsx wraps fixtures preview, identity editor, character editor, delivery preset selector, and tuning sliders inside one master Advanced voice settings disclosure (collapsed by default).
  • Sub-sections (fixtures / identity / character / delivery / tuning) start collapsed when the master opens.
  • A customized badge appears next to the master title if any layer differs from defaults so power users still find their tweaks.
  • Backend picker + voice picker remain at the top, outside the disclosure.

Test plan

  • bunx tsc --noEmit (hub + web)
  • bun test voice routes (hub) - 20 pass including new reconciles platform_settings.overrides on existing agents test
  • bun test voice client tests (web) - voicePersonalitySession + voiceContextPlan green
  • hapi-driver-rebuild --build-web --verify then dogfood the three backends end-to-end on driver
  • Operator dogfood on driver soup (PR review gate)

Merge order

  1. Merge feat(voice): pluggable voice backend with Gemini Live & Qwen Realtime #692 (feat/pluggable-voice-backend)
  2. Rebase this branch onto upstream/main (PR diff should drop to this PR's own delta), merge feat(voice): backend voice picker + advanced controls behind disclosure (#742) #743

Issues

Ref #742
Blocked by #692

@heavygee heavygee marked this pull request as ready for review May 31, 2026 00:53
@heavygee
Copy link
Copy Markdown
Contributor Author

Stack note: This PR is blocked by #692. For review, prefer the incremental diff on the fork:

heavygee/hapi@feat/pluggable-voice-backend...feat/voice-selection-all-backends

(4 commits: catalog scaffold, backend chooser, Gemini/Qwen descriptions, Playwright dogfood script.)

Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Qwen proxy forwards arbitrary client frames with the hub API key — POST /api/voice/qwen-token correctly keeps the DashScope key server-side, but the new /api/voice/qwen-ws proxy then opens DashScope with that key and blindly forwards every browser frame to upstream. Unlike the Gemini path, which rejects client-provided setup frames, any authenticated web client can connect directly and send its own session.update/response.create payloads, turning the hub into a generic DashScope proxy and allowing client-controlled instructions/tools under server credentials. Evidence: hub/src/web/server.ts:159.
    Suggested fix:
    const allowedQwenRuntimeEvents = new Set([
        'input_audio_buffer.append',
        'input_audio_buffer.commit',
        'response.create',
        'conversation.item.create'
    ])
    
    function parseQwenClientEvent(message: string | ArrayBuffer | Uint8Array): { type?: string } | null {
        try {
            return JSON.parse(decodeWsText(message)) as { type?: string }
        } catch {
            return null
        }
    }
    
    message(clientWs: ServerWebSocket<unknown>, message: string | ArrayBuffer | Uint8Array) {
        const event = parseQwenClientEvent(message)
        if (!event?.type || !allowedQwenRuntimeEvents.has(event.type)) {
            try { clientWs.close(1008, 'Client-provided Qwen setup is not allowed') } catch { /* */ }
            return
        }
    
        const upstream = upstreamMap.get(clientWs)
        if (upstream?.readyState === WebSocket.OPEN) {
            upstream.send(message)
        }
    }
    Move the initial session.update construction into the hub proxy, as Gemini does with buildGeminiLiveSetupMessage, and only let the browser send runtime audio/tool-response events afterward.

Questions

  • None.

Summary

  • Review mode: initial
  • One issue found: Qwen realtime proxy needs the same server-owned setup boundary as Gemini before this is safe to merge.

Testing

  • Not run (automation; static review only, per PR security instructions).

HAPI Bot

Comment thread hub/src/web/server.ts Outdated
@heavygee
Copy link
Copy Markdown
Contributor Author

Noise check (re: default diff vs main)

Compare base Commits Files
tiann/main (GitHub default) 29 35
feat/pluggable-voice-backend (#692 tip) 4 23

GitHub would not let us retarget base to feat/pluggable-voice-backend on tiann/hapi (branch only exists on heavygee/hapi). Until #692 merges, use the compare link in the PR description for review.

Automated review note: The github-actions MAJOR on Qwen proxy (hub/src/web/server.ts) is from the #692 stack in this branch, not from the 4 #742 commits — please route that feedback to #692 if still open.

heavygee and others added 25 commits May 31, 2026 15:39
Rebased from Overbaker/hapi#401 onto current main. Adds a pluggable voice
backend architecture that extends the existing ElevenLabs integration:

- Gemini 2.5 Live (gemini-live): Google real-time audio via WebSocket
  with full function calling (messageCodingAgent, processPermissionRequest)
- Qwen Realtime (qwen-realtime): Alibaba DashScope via hub WebSocket
  proxy (browser cannot set Authorization header directly)
- VoiceBackendSession: dynamic backend selector with React.lazy loading,
  gates voice button until backend module is registered
- Hub WS proxies: JWT-authenticated /api/voice/gemini-ws and
  /api/voice/qwen-ws endpoints in Bun.serve, with message queueing during
  upstream connect to prevent dropped setup frames
- AudioWorklet pipeline: inline Blob URL recorder, 24 kHz PCM player,
  serial tool call execution, AudioContext created in user gesture for mobile
- Backend discovery: GET /voice/backend + POST /voice/gemini-token /
  POST /voice/qwen-token hub routes; frontend auto-detects active backend

Merge notes:
- Rebased 135 upstream commits cleanly; HappyComposer keeps upstream's
  configurable enter-behavior setting (supersedes hard-coded Ctrl+Enter)
- Converted gemini test files from bun:test to vitest (web package uses vitest)
- All 221 hub tests and 636 web tests pass; TypeScript clean
turnComplete handler was unconditionally calling setMuted(false), which
re-enabled the mic track even when the user had manually muted. Now
restores to state.micMuted instead.
buildGeminiLiveConfig was appending VOICE_CHINESE_LANGUAGE_BLOCK which
forced Gemini to always respond in Mandarin regardless of user locale.
Gemini now uses the neutral base prompt and responds in the language the
user speaks to it, consistent with the ElevenLabs behaviour.
If the session closes while Gemini is mid-speech, cleanup() left
state.modelSpeaking=true. The next startSession() would then drop all
mic audio in sendAudioChunk() until a model turn eventually flipped
the flag — effectively deaf until page reload.
ws.onclose operated on module-level state.ws, not the socket that fired
the event. A rapid stop/restart could cause the old socket's onclose to
call cleanup() after the new socket was assigned, tearing down the live
session. Guard with `if (state.ws !== ws) return` before cleanup.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Matches the Gemini fix — both backends now use VOICE_SYSTEM_PROMPT
without the Chinese language block, giving consistent English-default
behaviour across all non-ElevenLabs backends.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Adds a "Proactive voice" toggle (default: off = reactive) to the Voice
Assistant settings section.

Reactive (default): initial context and agent-ready events are fed
silently; the assistant waits for the user to speak first.

Proactive: original behaviour — Gemini/Qwen narrate context on connect
and speak unprompted when the agent finishes a task. ElevenLabs is also
affected via onReady sending a user message rather than a silent update.

Covers all three backends uniformly. localStorage key: hapi-voice-proactive.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
… visibility

- hub/server.ts: add toClientCloseCode() to normalize reserved upstream
  close codes (1005/1006/1015) to 1011 before forwarding to browser;
  abnormal upstream drops (1006) would otherwise throw on clientWs.close()
  and leave the browser socket open

- realtime/index.ts: remove static GeminiLiveVoiceSession and QwenVoiceSession
  barrel exports; VoiceBackendSession lazy-imports both, so barrel re-exports
  created static dependencies that defeated the intended code-split

- App.tsx: gate global useVisibilityReporter on !sessionEventSubscription so
  the always-on SSE connection does not suppress native Web Push notifications
  for sessions the user is not currently viewing

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…toggle label

- buildGeminiLiveConfig() now accepts optional language param; appends
  VOICE_CHINESE_LANGUAGE_BLOCK only when language === 'zh'
- GeminiLiveVoiceSession passes config.language through
- QwenVoiceSession conditionally builds basePrompt from language setting
- Fixes silent no-op when user selects Chinese in voice settings on
  Gemini/Qwen backends (was ElevenLabs-only)

- Rename voice-start toggle label to 'Start voice session with summary'
- Fix description: clarifies the choice is about session-open behaviour
  (summary vs greeting), not ongoing narration

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Gemini Live has no built-in first-message like ElevenLabs agents do;
without an explicit turnComplete:true it sits silently. In reactive mode
(default, toggle off) now sends a greeting instruction after any silent
context feed so Gemini introduces itself and invites the user to speak.

Proactive mode is unchanged: the context summary is the opening speech.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…reeting

- VOICE_SYSTEM_PROMPT: explicit instruction never to call itself Gemini,
  Google, or any underlying model/provider name — always HAPI
- Greeting trigger text: instruct to greet as HAPI only, suppress model
  name and any reference to context/recent activity in the opening line

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Gemini + Qwen client:
- onerror now sets setupDone/sessionReady and nulls state.ws before
  calling reject(), so the stale-close guard trips in onclose and
  prevents a duplicate statusCallback('error') on WS failure

Gemini client:
- Proactive mode with no initialContext now falls through to the
  greeting trigger instead of sitting silently
- Remove unused handleBargeIn callback (dead code)

Qwen client:
- Add input_audio_sample_rate: 16000 to session.update so PCM rate
  is declared explicitly rather than relying on DashScope's default

Hub proxy:
- Remove no-op ternary in Gemini flush loop and message handler
  (typeof x === 'string' ? x : x); use upstream.send(msg) directly
- Qwen onerror now calls upstreamMap.delete() before closing client,
  eliminating the stale map entry window
- Align Qwen hub fallback model string with QWEN_REALTIME_MODEL
  constant ('qwen3-omni-flash-realtime')

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
hub/voice.ts:
- Replace string-concat WS URL construction with buildVoiceWsUrl() which
  uses URL API to set protocol/pathname cleanly — fixes double-slash when
  HAPI_PUBLIC_URL has a trailing slash (would silently skip the proxy route)

QwenVoiceSession.tsx:
- Wrap tool definitions in {type:'function', function:{...}} as required
  by Qwen-Omni realtime schema — previous flat shape caused session.update
  rejection before audio capture could start
- Use pcm16/pcm24 audio formats matching DashScope spec; remove
  input_audio_sample_rate (encoded in format name)

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…ose codes

GeminiLiveVoiceSession + QwenVoiceSession:
- startAudioCapture() is now async and awaits recorder.start() before
  calling setMuted() — previously setMuted ran before getUserMedia resolved
  so a session restarted while muted would open the mic anyway
- statusCallback('connected') now fires after audio is ready
- setMuted() called unconditionally (not just when true) to correctly
  apply saved state in either direction

hub/src/web/server.ts:
- Both Gemini and Qwen close() handlers now pass the client code through
  toClientCloseCode() before forwarding to upstream — prevents reserved
  codes (e.g. 1006) from causing WebSocket.close() to throw and leave
  the upstream session open until provider timeout
- Reason string capped at 123 bytes (WebSocket protocol limit)

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
An unhandled rejection inside the async onmessage callback does not
propagate to the outer startSession Promise — the UI hangs on
'connecting' and the provider socket stays partially open. Wrapping
the await in try/catch calls cleanup()/statusCallback('error')/reject()
so failures surface correctly.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs

fetchVoiceBackend no longer catches errors and defaults to 'elevenlabs' — any
network or server failure now throws so VoiceBackendSession can surface it via
onStatusChange('error', ...) rather than silently mounting the wrong backend.

VoiceBackendSession also resets backend state to null when api changes, so
a stale ElevenLabs registration from a prior discovery cannot persist into
a new session.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…alling back to ElevenLabs

Unknown backend strings (future values, typos) now throw rather than defaulting
to elevenlabs, closing the narrow remaining form of the original misrouting bug.
Also removes the unnecessary `as VoiceBackendResponse` cast.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…r base64 uploads

Qwen session.updated handler now sends the same proactive summary or greeting
trigger that Gemini does — previously it started silently in both proactive and
reactive modes.

maxHttpBufferSize raised to 68 MiB to account for base64 expansion: 50 MiB
decoded files become ~66.7 MiB as base64 JSON, so the previous 55 MiB ceiling
would disconnect uploads above ~41 MiB before they reached the CLI.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
….update for Qwen text

Qwen's realtime API only supports conversation.item.create for function_call_output.
Sending it with type:'message' for greetings/context was invalid and could fail
before the user spoke.

sendTextMessage and sendContextualUpdate now update session instructions via
session.update (accumulating context into the system prompt) and trigger
response.create only when a spoken reply is needed — matching Qwen's supported
client event surface.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
…n start

session.updated now returns early after the first ack — subsequent session.update
calls (instruction appends) also echo session.updated but must not re-trigger
audio capture or the greeting path.

currentSessionConfig is now reset to null at the top of startSession so a stale
config from a failed previous session cannot leak into the new one.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Without this guard, a missing wsUrl in the hub token response would
silently attempt to connect directly to Google with "proxied" as the
API key — producing a confusing auth failure instead of a clear error.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
DashScope realtime API accepts only 'pcm' for both input and output
audio formats. The pcm16/pcm24 values caused session.update rejection
before audio capture could start, leaving the Qwen backend unusable.

Also updates the default voice from Mia (not in the qwen3-omni-flash-
realtime voice list) to Cherry, which is documented as supported.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Failed token fetch, microphone denial, or WebSocket error during
setup left state.playbackContext open. Each failure path now calls
cleanup() before throwing/rejecting, preventing AudioContext leaks
on mobile browsers with hard limits on concurrent contexts.

via [HAPI](https://hapi.run)

Co-Authored-By: HAPI <noreply@hapi.run>
Reverts changes to files that shouldn't differ from upstream:
- .gitignore: remove fork-only AGENTS.local.md entry
- web/src/App.tsx: restore dual-subscription SSE pattern (scope-aware)
- web/src/hooks/useSSE.ts: restore SSEScope/scope parameter
- web/src/hooks/useSSE.test.ts: restore (was accidentally deleted)
- web/src/lib/appSseSubscriptions.ts: restore (was accidentally deleted)
- web/src/lib/appSseSubscriptions.test.ts: restore (was accidentally deleted)
- hub/src/sync/syncEngine.ts: restore (off-topic change)
Hub sends HAPI-owned Gemini setup on proxy connect and rejects client
setup frames. Qwen proxy always uses QWEN_REALTIME_MODEL instead of a
client query parameter. Shared buildGeminiLiveSetupMessage() keeps wire
format in one place.

Co-authored-by: Cursor <cursoragent@cursor.com>
heavygee and others added 11 commits May 31, 2026 23:24
Expose configured backends from hub, let Settings pick provider when multiple
API keys exist, and wire voice selection through Gemini/Qwen/ElevenLabs paths.

Co-authored-by: Cursor <cursoragent@cursor.com>
Surface catalog descriptions on the voice row and in the picker, with a
hint when preview is ElevenLabs-only. Disabled preview buttons stay visible.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
When hub/.env (operator-local) sets GEMINI_API_KEY, the 'falls back to
elevenlabs for unknown VOICE_BACKEND values' test leaks gemini-live
into the backends list. Delete the four non-elevenlabs key env vars
defensively at the start of the test, mirroring the cleanup pattern
used by the other tests in the same describe block. No behavior change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Stacks on voice-selection-all-backends. Adds shared voice-personality presets,
Settings accordions (character + backend-specific tuning), and ElevenLabs session
overrides from stored preferences.

Co-authored-by: Cursor <cursoragent@cursor.com>
Replace append-only personality notes with the bundled HAPI voice system
prompt in an advanced editor. User edits replace the base instruction for
ElevenLabs, Gemini (incl. hub proxy), and Qwen. Presets only drive TTS
sliders unless the user appends delivery text explicitly.

Co-authored-by: Cursor <cursoragent@cursor.com>
Split voice instructions into platform fixtures, provider guardrails, and
editable identity/character; compose at runtime for ElevenLabs, Gemini, and Qwen.
Session history uses a small connect bootstrap plus deferred contextual chunks,
gated by the proactive-summary setting. Settings UI shows wire budgets and
read-only fixtures.

Co-authored-by: Cursor <cursoragent@cursor.com>
The web workspace runs tests via vitest run. Both voice tests imported
describe/expect/test from 'bun:test', which vite cannot bundle and
caused 'Cannot bundle built-in module bun:test' transform failures
during driver soup verify. Only generic test APIs are used; switching
the import to 'vitest' is a no-op behavior change.

Co-authored-by: Cursor <cursoragent@cursor.com>
@heavygee
Copy link
Copy Markdown
Contributor Author

Heads-up from a downstream soup-rebuild (heavygee/hapi driver/integration, 2026-05-31):

To get this branch + the voice-advanced layer building cleanly on the current feat/pluggable-voice-backend tip, two pre-existing test bugs needed fixing. Both are local-only commits on the rebased branches and will be lost on the next force-push unless preserved.

1. On feat/voice-selection-all-backends (this PR)

238ad4c test(voice): isolate VOICE_BACKEND fallback test from leaked env vars

hub/src/web/routes/voice.test.ts | 4 ++++

The "falls back to elevenlabs for unknown VOICE_BACKEND values" test never deleted GEMINI_API_KEY / QWEN_API_KEY / OPENAI_API_KEY / QWEN_REALTIME_API_KEY before asserting. When hub/.env has any of those set, gemini-live (or another backend) leaks into the resolved backends list and the test fails. The fix adds four delete process.env.* lines at the top of the test, mirroring the cleanup pattern already used by the other tests in the same describe block. No behavior change.

2. On feat/voice-advanced-controls (sibling branch, stacks on this one)

f3ab0f3 test(web): import from vitest instead of bun:test

web/src/lib/voicePersonalitySession.test.ts        | 2 +-
web/src/realtime/hooks/voiceContextPlan.test.ts    | 2 +-

Both files imported describe/expect/test from 'bun:test'. The web workspace runs vitest run, which cannot bundle the built-in bun:test module and fails the transform during bun run build. Only generic test APIs are used; the import swap is a no-op behavior change.

Surfacing in case you re-do the rebase from scratch and overwrite these locally - they would silently come back as test failures on the next CI/build. No action needed if you keep them.

@heavygee heavygee force-pushed the feat/voice-selection-all-backends branch from 7469b27 to 238ad4c Compare June 1, 2026 02:19
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Qwen proxy still forwards runtime frames before setup is acknowledged — the browser waits for session.updated before starting audio, but an authenticated client can connect directly and send response.create, conversation.item.create, or instruction-only session.update while pendingSetupMap is still set. Those frames are forwarded immediately at hub/src/web/server.ts:188, before the hub-owned session.update sent at hub/src/web/server.ts:159 is acknowledged. That can run the upstream session under the provider default config or race ahead of HAPI's locked tools/voice/instructions, reopening part of the generic-proxy boundary the previous review called out.
    Suggested fix:
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    // after session.created
    pendingSetupMap.delete(clientWs)
    setupAckMap.set(clientWs, false)
    upstream.send(pendingSetup)
    
    // after session.updated for the hub-owned setup
    setupAckMap.set(clientWs, true)
    for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
    pendingClientFrames.delete(clientWs)
    
    message(clientWs, message) {
        if (!isQwenSafeClientFrame(message)) { /* close */ return }
        const upstream = upstreamMap.get(clientWs)
        if (upstream?.readyState !== WebSocket.OPEN) return
        if (pendingSetupMap.has(clientWs) || setupAckMap.get(clientWs) !== true) {
            const pending = pendingClientFrames.get(clientWs) ?? []
            pending.push(message)
            pendingClientFrames.set(clientWs, pending)
            return
        }
        upstream.send(message)
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous Qwen proxy issue is partly addressed, but the proxy still needs to gate client frames until the hub-owned Qwen setup has been acknowledged.

Testing

  • Not run (automation; static review only, per PR security instructions).

HAPI Bot

Comment thread hub/src/web/server.ts Outdated
The previous Qwen proxy hardening sent the hub-owned session.update once
upstream's session.created landed, but never waited for session.updated
before forwarding client frames. An authenticated client could push
response.create / conversation.item.create / instruction-only
session.update during that window and run the upstream session under the
provider default config or partially-applied state - reopening part of
the generic-proxy boundary the prior review called out.

Changes:
- Extract createQwenProxyWebSocketHandler into hub/src/web/qwenProxyHandler.ts
  so the gating behaviour can be unit-tested without spinning up Bun.serve.
- Track a per-client setupAcked WeakMap and a pendingClientFrames queue.
  Client frames are queued (after passing isQwenSafeClientFrame) until the
  upstream emits session.updated; on ack the queue is flushed in order.
- Clear the WeakMaps on upstream error / upstream close / client close so
  no state leaks across reused references.
- Inject the upstream WebSocket constructor for tests; production keeps
  the real global WebSocket.

Tests: 4 new ack-gate cases + existing 245 hub tests still green.
Resolves the [Major] thread on PR tiann#743 follow-up review.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup on upstream open, but it then flushes queued non-setup frames immediately and forwards later frames whenever the upstream socket is open. A direct authenticated client can therefore send clientContent/tool frames before Google acknowledges the hub-owned setup with setupComplete, racing execution ahead of HAPI's locked model, tools, system prompt, language, and voice config. Evidence: hub/src/web/server.ts:77 and hub/src/web/server.ts:107.
    Suggested fix:
    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        if (setupCompleteMap.get(clientWs) === false && JSON.parse(text).setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    message(clientWs, message) {
        if (isGeminiSetupFrame(message)) {
            clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
            return
        }
        const upstream = upstreamMap.get(clientWs)
        if (!upstream || upstream.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
            const pending = pendingClientFrames.get(clientWs) ?? []
            pending.push(message)
            pendingClientFrames.set(clientWs, pending)
            return
        }
        upstream.send(message)
    }

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous Qwen ack-gate issue appears addressed in the follow-up, but the newly added Gemini proxy still has the same setup-ack race at its proxy boundary.

Testing

  • Not run (automation; static review only, per PR security instructions).

HAPI Bot

Comment thread hub/src/web/server.ts
upstream.onopen = () => {
// Hub-owned setup only — never forward client setup (prevents generic Gemini proxy abuse).
upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued non-setup frames immediately after sending HAPI's setup, before Google has acknowledged it with setupComplete; once the upstream socket is open, message() also forwards later frames directly. A direct authenticated client can send clientContent or tool frames before HAPI's hub-owned model, tools, system prompt, language, and voice setup is installed, so the session can race ahead under default or partially-applied config.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName)))
}

// On the first upstream setupComplete, flip the gate and flush queued client frames.
if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
    setupCompleteMap.set(clientWs, true)
    for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
    pendingClientFrames.delete(clientWs)
}

message(clientWs, message) {
    if (isGeminiSetupFrame(message)) {
        clientWs.close(1008, 'Client-provided Gemini setup is not allowed')
        return
    }
    const upstream = upstreamMap.get(clientWs)
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
}

heavygee and others added 3 commits June 1, 2026 07:31
ElevenLabs convai rejects sessions whose `overrides` payload references any
property the agent has not explicitly authorized. The server reply is a malformed
error packet on the LiveKit data channel that the convai-react SDK then deref's
as `event.error_type` (undefined -> TypeError -> disconnect), so the symptom is
a generic "Unknown DataChannel error" with no clue.

Empty-prefs sessions now emit exactly `{ agent: { language } }` for byte-parity
with the upstream/main baseline that the configured agent permits today:

- agent.prompt: only when user has customized identity or character
- tts (full slider object): only when preset diverges from balanced defaults
- tts: { voice_id }: only when a voice was picked and sliders are at defaults

Co-authored-by: Cursor <cursoragent@cursor.com>
…onvAI overrides

UI: wrap fixtures, identity, character, delivery preset, and tuning sliders
in a single collapsed "Advanced voice settings" disclosure. Defaults stay
quiet; sub-sections start collapsed when the master opens. Shows a
"customized" badge if any layer differs from defaults so a user who tweaked
settings still knows where to find them.

Hub: on every /voice/token resolution, PATCH the resolved ConvAI agent's
platform_settings.overrides to match buildVoiceAgentConfig() (agent.prompt,
tts.voice_id/stability/similarity_boost/style/speed). Idempotent per-process
(cached per agent_id), best-effort (non-fatal on PATCH error). Fixes the
"Cannot read properties of undefined (reading 'error_type')" crash on
operators who have an existing agent that predates the override declaration.

Co-authored-by: Cursor <cursoragent@cursor.com>
Brings layered voice prompt UI, context bootstrap + streaming, and
ConvAI agent override reconciliation into the existing voice picker
PR. All new surfaces live behind a single collapsed "Advanced voice
settings" disclosure; defaults stay quiet for users who only want to
pick a voice.

Co-authored-by: Cursor <cursoragent@cursor.com>

# Conflicts:
#	hub/src/web/server.ts
@heavygee heavygee changed the title feat(voice): backend-aware voice picker for Gemini and Qwen (#742) feat(voice): backend voice picker + advanced controls behind disclosure (#742) Jun 1, 2026
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete — the proxy sends HAPI's setup, then immediately flushes queued client frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's model, tools, prompt, language, and voice configuration are installed. Evidence: hub/src/web/server.ts:88 and hub/src/web/server.ts:118.
    Suggested fix:

    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        const parsed = JSON.parse(text) as { setupComplete?: unknown }
        if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
  • [Major] Pass decoded systemInstruction into Qwen setup — /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings, but createQwenProxyWebSocketHandler() casts the data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName). Qwen sessions therefore ignore the user-edited identity/character/delivery prompt while Gemini uses it. Evidence: hub/src/web/qwenProxyHandler.ts:53 and hub/src/web/qwenProxyHandler.ts:61.
    Suggested fix:

    const data = clientWs.data as {
        apiKey: string
        model: string
        language?: string
        voiceName?: string
        systemInstruction?: string
    }
    
    pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))

Questions

  • None.

Summary
Review mode: follow-up after new commits
The previous Gemini setup-ack race remains present, and the new Qwen prompt transport path drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

  • Not run (automation; static review only).

HAPI Bot

Comment thread hub/src/web/server.ts
data.voiceName,
data.systemInstruction
)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued client frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's model, tools, prompt, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}

if (setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
// Flip the gate and flush only after the first upstream setupComplete frame.

})

upstreamMap.set(clientWs, upstream)
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))

heavygee added 3 commits June 1, 2026 18:31
…service

Live-tested against DashScope international API:
- Model: qwen3-omni-flash-realtime → qwen3.5-omni-flash-realtime
  (previous model ID did not exist on DashScope)
- Default voice: Cherry → Tina
  (confirmed from session.created response on qwen3.5-omni-flash-realtime)
- Default WS base: dashscope.aliyuncs.com → dashscope-intl.aliyuncs.com
  (international accounts use the -intl endpoint; China endpoint rejects
  international API keys; QWEN_REALTIME_WS_URL env var still overrides)
Two dogfooding fixes verified against live Qwen Realtime session:

sendTextMessage: switch from instruction-injection to conversation.item.create
  Qwen Realtime requires a user conversation item before response.create.
  The previous approach (updateInstructions + response.create) produced
  "input messages do not contain elements with role user" errors. Now sends
  {type:message, role:user, content:[{type:input_text}]} then response.create.
  sendContextualUpdate is unchanged (instruction-only, no response trigger).

Language handling: replace zh-only branch with buildVoiceLanguageBlock()
  Previously, only language='zh' added any instruction; all other languages
  (including English) sent no language block, causing Qwen to drift to Chinese.
  buildVoiceLanguageBlock() now covers three cases:
    - 'zh'/'zh-*': existing Chinese block (unchanged)
    - explicit code ('en','es','fr',...): "Always respond in [Language]"
    - undefined/auto: "Detect the user's language and maintain it"
  Applied to buildGeminiLiveConfig, buildQwenSessionUpdateMessage, and the
  client-side currentInstructions mirror in QwenVoiceSession.
  Also removes the Gemini hub proxy's zh-only filter, which was discarding
  explicit language selections other than Chinese.
…-backends

# Conflicts:
#	hub/src/web/server.ts
#	shared/src/voice.ts
#	web/src/realtime/QwenVoiceSession.tsx
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup, then immediately flushes queued frames and forwards later frames as soon as the upstream socket is open. A direct authenticated client can still race clientContent/tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence: hub/src/web/server.ts:88 and hub/src/web/server.ts:118.
    Suggested fix:

    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
            data.language,
            data.voiceName,
            data.systemInstruction
        )))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        const parsed = JSON.parse(text) as { setupComplete?: unknown }
        if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
  • [Major] Pass decoded systemInstruction into Qwen setup - /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client now sends that param for advanced voice settings. The proxy still casts data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence: hub/src/web/qwenProxyHandler.ts:53 and hub/src/web/qwenProxyHandler.ts:61.
    Suggested fix:

    const data = clientWs.data as {
        apiKey: string
        model: string
        language?: string
        voiceName?: string
        systemInstruction?: string
    }
    
    pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))

Questions

  • None.

Summary
Review mode: follow-up after new commits
The prior Gemini setup-ack race remains unresolved, and the Qwen prompt transport path still drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

  • Not run (automation; bun is not installed in this runner).

HAPI Bot

Comment thread hub/src/web/server.ts
data.voiceName,
data.systemInstruction
)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(data.language, data.voiceName, data.systemInstruction)))
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
// Flip the gate and flush only after the first upstream setupComplete frame.

})

upstreamMap.set(clientWs, upstream)
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))

@heavygee heavygee force-pushed the feat/voice-selection-all-backends branch from 638871b to 9d4a5f5 Compare June 1, 2026 20:58
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup and then flushes queued client frames immediately on upstream open, and later forwards frames whenever the upstream socket is open. A direct authenticated client can still race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence hub/src/web/server.ts:88.
    Suggested fix:

    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
            data.language,
            data.voiceName,
            data.systemInstruction
        )))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        const parsed = JSON.parse(text) as { setupComplete?: unknown }
        if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
  • [Major] Pass decoded systemInstruction into Qwen setup - /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings. The extracted proxy handler still casts data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence hub/src/web/qwenProxyHandler.ts:61.
    Suggested fix:

    const data = clientWs.data as {
        apiKey: string
        model: string
        language?: string
        voiceName?: string
        systemInstruction?: string
    }
    
    pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))

Questions

  • None.

Summary
Review mode: follow-up after new commits
The prior Gemini setup-ack race remains unresolved, and the Qwen prompt transport path still drops the decoded custom prompt before upstream setup. Add proxy-level tests for both setup contracts.

Testing

  • Not run (automation; PR review security policy says not to execute PR code).

HAPI Bot

Comment thread hub/src/web/server.ts
data.voiceName,
data.systemInstruction
)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))
}

if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
// Flip the gate and flush only after the first upstream setupComplete frame.

})

upstreamMap.set(clientWs, upstream)
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))

Discovered during dogfooding that the advanced settings disclosure mixed
two unrelated concerns: acoustic delivery and conversational behaviour.

Remove the top-level "Advanced voice settings" collapsible entirely.
Replace with two flat, always-visible subsections:

  How It Sounds — voice backend, voice picker, acoustic tuning sliders
    (stability, expressiveness, speaking rate, similarity, affective
    dialog, per-backend hints). All controls that affect the audio signal.

  How It Responds — language, proactive toggle, identity, character
    prompt, delivery preset, platform fixtures. All controls that affect
    what the agent chooses to say and in what language.

Language moves from above How It Sounds (where it was awkwardly grouped
with the backend picker) into How It Responds, where it belongs.

Proactive toggle moves from the separate "Session behavior" block into
How It Responds — it is a conversational behaviour control, not an
acoustic one.

VoiceAdvancedControls exports two named components (VoiceRespondsControls,
VoiceSoundsControls); the old VoiceAdvancedControls export is kept as a
deprecated re-export for any external callers.

zh-CN locale: 声音效果 / 对话风格.
Copy link
Copy Markdown

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Findings

  • [Major] Gate Gemini client frames until setupComplete - the proxy sends HAPI's setup and then immediately flushes queued client frames, and message() forwards later frames whenever the upstream socket is open. A direct authenticated client can still race clientContent or tool frames before Google acknowledges the hub-owned setup, so execution can start before HAPI's prompt, tools, language, and voice config are installed. Evidence hub/src/web/server.ts:88.
    Suggested fix:

    const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
    const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()
    
    upstream.onopen = () => {
        setupCompleteMap.set(clientWs, false)
        upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
            data.language,
            data.voiceName,
            data.systemInstruction
        )))
    }
    
    upstream.onmessage = (event) => {
        const text = typeof event.data === 'string'
            ? event.data
            : new TextDecoder().decode(new Uint8Array(event.data as ArrayBuffer))
        const parsed = JSON.parse(text) as { setupComplete?: unknown }
        if (setupCompleteMap.get(clientWs) === false && parsed.setupComplete) {
            setupCompleteMap.set(clientWs, true)
            for (const queued of pendingClientFrames.get(clientWs) ?? []) upstream.send(queued)
            pendingClientFrames.delete(clientWs)
        }
        if (clientWs.readyState === 1) clientWs.send(event.data)
    }
    
    if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
        const pending = pendingClientFrames.get(clientWs) ?? []
        pending.push(message)
        pendingClientFrames.set(clientWs, pending)
        return
    }
    upstream.send(message)
  • [Major] Pass decoded systemInstruction into Qwen setup - /api/voice/qwen-ws decodes ?systemPrompt= and stores it on ws.data, and the Qwen client sends that param for advanced voice settings. The extracted proxy handler still casts data without systemInstruction and calls buildQwenSessionUpdateMessage(data.language, data.voiceName), so Qwen sessions ignore the user-edited identity/character/delivery prompt while Gemini applies it. Evidence hub/src/web/qwenProxyHandler.ts:61.
    Suggested fix:

    const data = clientWs.data as {
        apiKey: string
        model: string
        language?: string
        voiceName?: string
        systemInstruction?: string
    }
    
    pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))

Questions

  • None.

Summary

  • Review mode: follow-up after new commits
  • The previous Qwen ack gate issue is partly addressed, but the Gemini setup race remains and Qwen still drops the decoded custom prompt before upstream setup. Add proxy-level tests covering Gemini setupComplete gating and Qwen custom prompt propagation.

Testing

  • Not run (automation; PR review security policy says not to execute PR code).

HAPI Bot

Comment thread hub/src/web/server.ts
data.voiceName,
data.systemInstruction
)))
for (const queued of pending.splice(0)) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Gate Gemini client frames until setupComplete

This flushes queued frames immediately after sending HAPI's setup, before Google acknowledges it with setupComplete; message() also forwards later frames whenever the upstream socket is open. A direct authenticated client can race clientContent or tool frames before HAPI's prompt, tools, language, and voice setup is installed.

Suggested fix:

const setupCompleteMap = new WeakMap<ServerWebSocket<unknown>, boolean>()
const pendingClientFrames = new WeakMap<ServerWebSocket<unknown>, Array<string | ArrayBuffer | Uint8Array>>()

upstream.onopen = () => {
    setupCompleteMap.set(clientWs, false)
    upstream.send(JSON.stringify(buildGeminiLiveSetupMessage(
        data.language,
        data.voiceName,
        data.systemInstruction
    )))
}

// Queue all client frames until the first upstream setupComplete frame.
if (upstream?.readyState !== WebSocket.OPEN || setupCompleteMap.get(clientWs) !== true) {
    const pending = pendingClientFrames.get(clientWs) ?? []
    pending.push(message)
    pendingClientFrames.set(clientWs, pending)
    return
}
upstream.send(message)

})

upstreamMap.set(clientWs, upstream)
pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(data.language, data.voiceName)))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Major] Pass the decoded prompt into Qwen's hub-owned setup

The server decodes ?systemPrompt= for /api/voice/qwen-ws and stores it on ws.data, and the Qwen client sends that param for the advanced voice prompt. This handler drops it when building the initial session.update, so Qwen ignores the user-edited identity/character/delivery prompt while Gemini applies it.

Suggested fix:

const data = clientWs.data as {
    apiKey: string
    model: string
    language?: string
    voiceName?: string
    systemInstruction?: string
}

pendingSetupMap.set(clientWs, JSON.stringify(buildQwenSessionUpdateMessage(
    data.language,
    data.voiceName,
    data.systemInstruction
)))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant