-
Notifications
You must be signed in to change notification settings - Fork 0
Home
A self-hosted, full-duplex voice agent loop with swappable streaming STT, LLM, and TTS. This wiki is the deep reference. The README is the product page; start there if you just want to run it.
Built by Sarma Linux. MIT licence.
- Architecture: pipeline and sequence diagrams, the IDLE/LISTEN/THINK/SPEAK state machine, the barge-in and function-call flows, the component table, and the design rationale.
- Quick-Start: install, the providers and environment variables you need, the keyless path, the self-hosted server setup, and a first-call walkthrough.
- Roadmap: what shipped in 1.1.0, what is planned, what I will not ship, and how to contribute.
One voice session maps to one Orchestrator instance, created when the browser opens the /voice WebSocket and disposed when it closes. The orchestrator owns a four-state machine and never blocks: every provider call is a stream consumed with for await, and every cancellable operation is wrapped in an AbortController.
-
Capture. The browser runs the microphone through an
AudioContextresampled to 16kHz mono, converts each frame to PCM16, base64-encodes it, and sends it as a{ type: 'audio', payload }message. The orchestrator is transport-agnostic, so terminating over a mediasoup or LiveKit SFU is a swap at the edge that the pipeline never sees. -
Detect. Each frame runs through the stateful VAD in
vad.ts, which uses hysteresis and a hangover so transients do not flip the decision. When speech is confirmed the machine moves IDLE to LISTEN. - Transcribe. While in LISTEN, voice frames feed the selected STT adapter for live partials. Trailing silence past a frame threshold flushes the adapter for a final transcript, which triggers the move to THINK.
- Think. The final transcript is appended to the conversation and streamed to the LLM adapter with the registered tools advertised. The first token flips the machine to SPEAK and is fed straight into TTS, so audio starts before the completion finishes.
- Speak. The TTS adapter synthesises sentence by sentence and the orchestrator streams PCM chunks back to the client. When the turn drains, the machine returns to IDLE.
Barge-in is the part people get wrong. If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS streams through their AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops back to LISTEN. Because the abort signal propagates through the fetch body reader and the for await loops, there are no orphaned streams talking over the user.
Function-call passthrough lets the LLM call server-side functions mid-turn. The orchestrator advertises the registered tool definitions on every call, runs the matching handler when the model requests it, appends the result to the conversation, and re-streams so the model answers with grounded data. Tool rounds are bounded to guard against loops.
See Architecture for the diagrams and the full state-transition table.
All examples below are taken from the codebase as it ships, not invented.
You do not need provider keys to exercise the transport, state machine, barge-in, and tool calls. Leave the keys unset in .env, run pnpm dev, and the LLM adapter yields a single configuration message while the STT and TTS adapters return nothing: the IDLE/LISTEN/THINK/SPEAK transitions, barge-in, and tool-call routing all still work. This is the fastest way to validate a transport or VAD change before standing up the self-hosted servers.
Switching from OpenTTS to ElevenLabs is a single environment variable, no code change:
TTS_PROVIDER=elevenlabs ELEVENLABS_API_KEY=sk-... pnpm devThe registry in apps/server/src/adapters/tts/registry.ts reads TTS_PROVIDER and returns the matching adapter. The orchestrator only sees the TtsAdapter interface.
apps/server/src/pipeline/tools.ts ships a ToolRegistry and two default tools. Registering your own is one call:
import { defaultTools } from './pipeline/tools.js'
const tools = defaultTools().register({
definition: {
name: 'lookup_order',
description: 'Look up an order by id and return its status.',
parameters: {
type: 'object',
properties: { id: { type: 'string' } },
required: ['id'],
},
},
handler: async (args) => {
const order = await db.orders.find(args.id as string)
return order ? `Order ${order.id} is ${order.status}.` : 'Order not found.'
},
})Pass the registry into new Orchestrator({ tools }). The model is then free to call lookup_order mid-turn, and the orchestrator feeds the result back so the next tokens are grounded in real data.
Create apps/server/src/adapters/llm/<provider>.ts exporting a factory that returns an LlmAdapter (an id and a stream() async generator that yields token and tool_call events and honours the AbortSignal). For an OpenAI-compatible provider, reuse parseChatSse, toWireMessages, and toWireTools from sse.ts, so the body is a few lines. Register it in apps/server/src/adapters/llm/registry.ts behind a new LLM_PROVIDER value. Nothing else in the pipeline changes.
The detector in vad.ts is a stateful Vad with hysteresis (separate enter and exit thresholds) and a hangover (a run of confirming frames before the decision flips). Tune it per session through the orchestrator's vad option without touching vad.ts:
new Orchestrator({
vad: {
enterThreshold: 0.04, // raise to ignore a louder room
exitThreshold: 0.02, // must stay at or below enterThreshold
speechFrames: 2, // require two loud frames before speech onset
hangoverFrames: 5, // hold through longer pauses before ending the turn
},
})Raise enterThreshold for a noisy room, raise hangoverFrames if the agent cuts speakers off mid-sentence, and lower speechFrames for the fastest possible onset. For real workloads, swap the frameRms energy term for a call into silero-vad-onnx and keep the hysteresis and hangover layer on top. The unit tests pin the contract either way.
| Symptom | Likely cause | Fix |
|---|---|---|
| Browser never leaves IDLE | Microphone permission denied, or input level below the VAD enter threshold | Check site permissions for localhost:3000; lower the vad.enterThreshold passed to the orchestrator to confirm the path |
| WebSocket fails to connect | Server not listening on :3001
|
Check the pnpm dev logs; confirm NEXT_PUBLIC_SERVER_URL matches the server port |
| Transcripts but no audio | TTS server unreachable or provider key missing | Confirm OPENTTS_URL points at a running OpenTTS server, or set the key for the provider named in TTS_PROVIDER
|
| No transcripts at all | Whisper.cpp server unreachable, or no STT key set | Confirm WHISPERCPP_URL points at a running whisper-server, or set DEEPGRAM_API_KEY / OPENAI_API_KEY for the alternative adapters |
| LLM only returns a configuration message |
GROQ_API_KEY (or the selected provider key) unset |
Set the key; the adapter yields a configuration token instead of streaming when no key is present |
| Agent talks over me when I interrupt | Barge-in not firing because the VAD is not confirming your interruption | Confirm input frames are reaching the server; lower vad.enterThreshold or vad.speechFrames so mid-output speech is caught faster |
| Agent cuts me off mid-sentence | Hangover too short for your speaking pace | Raise vad.hangoverFrames so brief pauses do not end the utterance |
| Tool is never called | Tool not registered, or the model did not choose to call it | Confirm the tool is in the registry passed to the orchestrator; check /health and the server logs for the advertised tools |
pnpm dev warns about peer dependencies |
Next 15 lists an older React peer range against React 19 | Safe to ignore; the build and runtime are verified against React 19 |
CI fails on pnpm install --frozen-lockfile
|
pnpm-lock.yaml out of sync with a manifest change |
Run pnpm install locally and commit the updated lockfile |