Home

voice-agent-starter

A self-hosted, full-duplex voice agent loop with swappable streaming STT, LLM, and TTS. This wiki is the deep reference. The README is the product page; start there if you just want to run it.

Built by Sarma Linux. MIT licence.

Wiki index

Architecture: pipeline and sequence diagrams, the IDLE/LISTEN/THINK/SPEAK state machine, the barge-in and function-call flows, the component table, and the design rationale.
Quick-Start: install, the providers and environment variables you need, the keyless path, the self-hosted server setup, and a first-call walkthrough.
Roadmap: what shipped in 1.1.0, what is planned, what I will not ship, and how to contribute.

How the loop actually works

One voice session maps to one Orchestrator instance, created when the browser opens the /voice WebSocket and disposed when it closes. The orchestrator owns a four-state machine and never blocks: every provider call is a stream consumed with for await, and every cancellable operation is wrapped in an AbortController.

Capture. The browser runs the microphone through an AudioContext resampled to 16kHz mono, converts each frame to PCM16, base64-encodes it, and sends it as a { type: 'audio', payload } message. The orchestrator is transport-agnostic, so terminating over a mediasoup or LiveKit SFU is a swap at the edge that the pipeline never sees.
Detect. Each frame runs through the stateful VAD in vad.ts, which uses hysteresis and a hangover so transients do not flip the decision. When speech is confirmed the machine moves IDLE to LISTEN.
Transcribe. While in LISTEN, voice frames feed the selected STT adapter for live partials. Trailing silence past a frame threshold flushes the adapter for a final transcript, which triggers the move to THINK.
Think. The final transcript is appended to the conversation and streamed to the LLM adapter with the registered tools advertised. The first token flips the machine to SPEAK and is fed straight into TTS, so audio starts before the completion finishes.
Speak. The TTS adapter synthesises sentence by sentence and the orchestrator streams PCM chunks back to the client. When the turn drains, the machine returns to IDLE.

Barge-in is the part people get wrong. If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS streams through their AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops back to LISTEN. Because the abort signal propagates through the fetch body reader and the for await loops, there are no orphaned streams talking over the user.

Function-call passthrough lets the LLM call server-side functions mid-turn. The orchestrator advertises the registered tool definitions on every call, runs the matching handler when the model requests it, appends the result to the conversation, and re-streams so the model answers with grounded data. Tool rounds are bounded to guard against loops.

See Architecture for the diagrams and the full state-transition table.

Real-world examples

All examples below are taken from the codebase as it ships, not invented.

Run the pipeline without paying for any provider

You do not need provider keys to exercise the transport, state machine, barge-in, and tool calls. Leave the keys unset in .env, run pnpm dev, and the LLM adapter yields a single configuration message while the STT and TTS adapters return nothing: the IDLE/LISTEN/THINK/SPEAK transitions, barge-in, and tool-call routing all still work. This is the fastest way to validate a transport or VAD change before standing up the self-hosted servers.

Swap the TTS provider for one call

Switching from OpenTTS to ElevenLabs is a single environment variable, no code change:

TTS_PROVIDER=elevenlabs ELEVENLABS_API_KEY=sk-... pnpm dev

The registry in apps/server/src/adapters/tts/registry.ts reads TTS_PROVIDER and returns the matching adapter. The orchestrator only sees the TtsAdapter interface.

Register a server-side tool

apps/server/src/pipeline/tools.ts ships a ToolRegistry and two default tools. Registering your own is one call:

import { defaultTools } from './pipeline/tools.js'

const tools = defaultTools().register({
  definition: {
    name: 'lookup_order',
    description: 'Look up an order by id and return its status.',
    parameters: {
      type: 'object',
      properties: { id: { type: 'string' } },
      required: ['id'],
    },
  },
  handler: async (args) => {
    const order = await db.orders.find(args.id as string)
    return order ? `Order ${order.id} is ${order.status}.` : 'Order not found.'
  },
})

Pass the registry into new Orchestrator({ tools }). The model is then free to call lookup_order mid-turn, and the orchestrator feeds the result back so the next tokens are grounded in real data.

Add a brand-new LLM provider

Create apps/server/src/adapters/llm/<provider>.ts exporting a factory that returns an LlmAdapter (an id and a stream() async generator that yields token and tool_call events and honours the AbortSignal). For an OpenAI-compatible provider, reuse parseChatSse, toWireMessages, and toWireTools from sse.ts, so the body is a few lines. Register it in apps/server/src/adapters/llm/registry.ts behind a new LLM_PROVIDER value. Nothing else in the pipeline changes.

Tighten the VAD for a noisy room

The detector in vad.ts is a stateful Vad with hysteresis (separate enter and exit thresholds) and a hangover (a run of confirming frames before the decision flips). Tune it per session through the orchestrator's vad option without touching vad.ts:

new Orchestrator({
  vad: {
    enterThreshold: 0.04, // raise to ignore a louder room
    exitThreshold: 0.02,  // must stay at or below enterThreshold
    speechFrames: 2,      // require two loud frames before speech onset
    hangoverFrames: 5,    // hold through longer pauses before ending the turn
  },
})

Raise enterThreshold for a noisy room, raise hangoverFrames if the agent cuts speakers off mid-sentence, and lower speechFrames for the fastest possible onset. For real workloads, swap the frameRms energy term for a call into silero-vad-onnx and keep the hysteresis and hangover layer on top. The unit tests pin the contract either way.

Troubleshooting

Symptom	Likely cause	Fix
Browser never leaves IDLE	Microphone permission denied, or input level below the VAD enter threshold	Check site permissions for `localhost:3000`; lower the `vad.enterThreshold` passed to the orchestrator to confirm the path
WebSocket fails to connect	Server not listening on `:3001`	Check the `pnpm dev` logs; confirm `NEXT_PUBLIC_SERVER_URL` matches the server port
Transcripts but no audio	TTS server unreachable or provider key missing	Confirm `OPENTTS_URL` points at a running OpenTTS server, or set the key for the provider named in `TTS_PROVIDER`
No transcripts at all	Whisper.cpp server unreachable, or no STT key set	Confirm `WHISPERCPP_URL` points at a running whisper-server, or set `DEEPGRAM_API_KEY` / `OPENAI_API_KEY` for the alternative adapters
LLM only returns a configuration message	`GROQ_API_KEY` (or the selected provider key) unset	Set the key; the adapter yields a configuration token instead of streaming when no key is present
Agent talks over me when I interrupt	Barge-in not firing because the VAD is not confirming your interruption	Confirm input frames are reaching the server; lower `vad.enterThreshold` or `vad.speechFrames` so mid-output speech is caught faster
Agent cuts me off mid-sentence	Hangover too short for your speaking pace	Raise `vad.hangoverFrames` so brief pauses do not end the utterance
Tool is never called	Tool not registered, or the model did not choose to call it	Confirm the tool is in the registry passed to the orchestrator; check `/health` and the server logs for the advertised tools
`pnpm dev` warns about peer dependencies	Next 15 lists an older React peer range against React 19	Safe to ignore; the build and runtime are verified against React 19
CI fails on `pnpm install --frozen-lockfile`	`pnpm-lock.yaml` out of sync with a manifest change	Run `pnpm install` locally and commit the updated lockfile

Repository

github.com/sarmakska/voice-agent-starter

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

voice-agent-starter

Wiki index

How the loop actually works

Real-world examples

Run the pipeline without paying for any provider

Swap the TTS provider for one call

Register a server-side tool

Add a brand-new LLM provider

Tighten the VAD for a noisy room

Troubleshooting

Repository

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally