Skip to content
sarmakska edited this page Jun 7, 2026 · 5 revisions

voice-agent-starter

A self-hosted, full-duplex voice agent loop with swappable streaming STT, LLM, and TTS. This wiki is the deep reference. The README is the product page; start there if you just want to run it.

Built by Sarma Linux. MIT licence.

Wiki index

  • Architecture: pipeline and sequence diagrams, the IDLE/LISTEN/THINK/SPEAK state machine, the barge-in and function-call flows, the component table, and the design rationale.
  • Quick-Start: install, the providers and environment variables you need, the keyless path, the self-hosted server setup, and a first-call walkthrough.
  • Roadmap: what shipped in 1.1.0, what is planned, what I will not ship, and how to contribute.

How the loop actually works

One voice session maps to one Orchestrator instance, created when the browser opens the /voice WebSocket and disposed when it closes. The orchestrator owns a four-state machine and never blocks: every provider call is a stream consumed with for await, and every cancellable operation is wrapped in an AbortController.

  1. Capture. The browser runs the microphone through an AudioContext resampled to 16kHz mono, converts each frame to PCM16, base64-encodes it, and sends it as a { type: 'audio', payload } message. The orchestrator is transport-agnostic, so terminating over a mediasoup or LiveKit SFU is a swap at the edge that the pipeline never sees.
  2. Detect. Each frame runs through the stateful VAD in vad.ts, which uses hysteresis and a hangover so transients do not flip the decision. When speech is confirmed the machine moves IDLE to LISTEN.
  3. Transcribe. While in LISTEN, voice frames feed the selected STT adapter for live partials. Trailing silence past a frame threshold flushes the adapter for a final transcript, which triggers the move to THINK.
  4. Think. The final transcript is appended to the conversation and streamed to the LLM adapter with the registered tools advertised. The first token flips the machine to SPEAK and is fed straight into TTS, so audio starts before the completion finishes.
  5. Speak. The TTS adapter synthesises sentence by sentence and the orchestrator streams PCM chunks back to the client. When the turn drains, the machine returns to IDLE.

Barge-in is the part people get wrong. If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS streams through their AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops back to LISTEN. Because the abort signal propagates through the fetch body reader and the for await loops, there are no orphaned streams talking over the user.

Function-call passthrough lets the LLM call server-side functions mid-turn. The orchestrator advertises the registered tool definitions on every call, runs the matching handler when the model requests it, appends the result to the conversation, and re-streams so the model answers with grounded data. Tool rounds are bounded to guard against loops.

See Architecture for the diagrams and the full state-transition table.

Real-world examples

All examples below are taken from the codebase as it ships, not invented.

Run the pipeline without paying for any provider

You do not need provider keys to exercise the transport, state machine, barge-in, and tool calls. Leave the keys unset in .env, run pnpm dev, and the LLM adapter yields a single configuration message while the STT and TTS adapters return nothing: the IDLE/LISTEN/THINK/SPEAK transitions, barge-in, and tool-call routing all still work. This is the fastest way to validate a transport or VAD change before standing up the self-hosted servers.

Swap the TTS provider for one call

Switching from OpenTTS to ElevenLabs is a single environment variable, no code change:

TTS_PROVIDER=elevenlabs ELEVENLABS_API_KEY=sk-... pnpm dev

The registry in apps/server/src/adapters/tts/registry.ts reads TTS_PROVIDER and returns the matching adapter. The orchestrator only sees the TtsAdapter interface.

Register a server-side tool

apps/server/src/pipeline/tools.ts ships a ToolRegistry and two default tools. Registering your own is one call:

import { defaultTools } from './pipeline/tools.js'

const tools = defaultTools().register({
  definition: {
    name: 'lookup_order',
    description: 'Look up an order by id and return its status.',
    parameters: {
      type: 'object',
      properties: { id: { type: 'string' } },
      required: ['id'],
    },
  },
  handler: async (args) => {
    const order = await db.orders.find(args.id as string)
    return order ? `Order ${order.id} is ${order.status}.` : 'Order not found.'
  },
})

Pass the registry into new Orchestrator({ tools }). The model is then free to call lookup_order mid-turn, and the orchestrator feeds the result back so the next tokens are grounded in real data.

Add a brand-new LLM provider

Create apps/server/src/adapters/llm/<provider>.ts exporting a factory that returns an LlmAdapter (an id and a stream() async generator that yields token and tool_call events and honours the AbortSignal). For an OpenAI-compatible provider, reuse parseChatSse, toWireMessages, and toWireTools from sse.ts, so the body is a few lines. Register it in apps/server/src/adapters/llm/registry.ts behind a new LLM_PROVIDER value. Nothing else in the pipeline changes.

Tighten the VAD for a noisy room

The detector in vad.ts is a stateful Vad with hysteresis (separate enter and exit thresholds) and a hangover (a run of confirming frames before the decision flips). Tune it per session through the orchestrator's vad option without touching vad.ts:

new Orchestrator({
  vad: {
    enterThreshold: 0.04, // raise to ignore a louder room
    exitThreshold: 0.02,  // must stay at or below enterThreshold
    speechFrames: 2,      // require two loud frames before speech onset
    hangoverFrames: 5,    // hold through longer pauses before ending the turn
  },
})

Raise enterThreshold for a noisy room, raise hangoverFrames if the agent cuts speakers off mid-sentence, and lower speechFrames for the fastest possible onset. For real workloads, swap the frameRms energy term for a call into silero-vad-onnx and keep the hysteresis and hangover layer on top. The unit tests pin the contract either way.

Troubleshooting

Symptom Likely cause Fix
Browser never leaves IDLE Microphone permission denied, or input level below the VAD enter threshold Check site permissions for localhost:3000; lower the vad.enterThreshold passed to the orchestrator to confirm the path
WebSocket fails to connect Server not listening on :3001 Check the pnpm dev logs; confirm NEXT_PUBLIC_SERVER_URL matches the server port
Transcripts but no audio TTS server unreachable or provider key missing Confirm OPENTTS_URL points at a running OpenTTS server, or set the key for the provider named in TTS_PROVIDER
No transcripts at all Whisper.cpp server unreachable, or no STT key set Confirm WHISPERCPP_URL points at a running whisper-server, or set DEEPGRAM_API_KEY / OPENAI_API_KEY for the alternative adapters
LLM only returns a configuration message GROQ_API_KEY (or the selected provider key) unset Set the key; the adapter yields a configuration token instead of streaming when no key is present
Agent talks over me when I interrupt Barge-in not firing because the VAD is not confirming your interruption Confirm input frames are reaching the server; lower vad.enterThreshold or vad.speechFrames so mid-output speech is caught faster
Agent cuts me off mid-sentence Hangover too short for your speaking pace Raise vad.hangoverFrames so brief pauses do not end the utterance
Tool is never called Tool not registered, or the model did not choose to call it Confirm the tool is in the registry passed to the orchestrator; check /health and the server logs for the advertised tools
pnpm dev warns about peer dependencies Next 15 lists an older React peer range against React 19 Safe to ignore; the build and runtime are verified against React 19
CI fails on pnpm install --frozen-lockfile pnpm-lock.yaml out of sync with a manifest change Run pnpm install locally and commit the updated lockfile

Repository

github.com/sarmakska/voice-agent-starter