Skip to content
sarmakska edited this page May 31, 2026 · 5 revisions

voice-agent-starter

Sub-second, full-duplex voice agent loop with swappable STT, LLM, and TTS. This wiki is the deep reference. The README is the product page; start there if you just want to run it.

Built by Sarma Linux. MIT licence.

Wiki index

  • Architecture: pipeline sequence diagram, the IDLE/LISTEN/THINK/SPEAK state machine, the component table, and the design rationale behind each piece.
  • Quick-Start: install, the environment variables you need, the keyless smoke-test path, and a first-call walkthrough.
  • Roadmap: what shipped in 1.0.0, what is planned, what I will not ship, and how to contribute.

How the loop actually works

One voice session maps to one Orchestrator instance, created when the browser opens the /voice WebSocket and disposed when it closes. The orchestrator owns a four-state machine and never blocks: every provider call is a stream consumed with for await, and every cancellable operation is wrapped in an AbortController.

  1. Capture. The browser runs the microphone through an AudioContext resampled to 16kHz mono, converts each frame to PCM16, base64-encodes it, and sends it as a { type: 'audio', payload } message. In production you would terminate this over a mediasoup SFU rather than a raw WebSocket; the orchestrator does not care which transport delivers the frames.
  2. Detect. Each frame runs through the RMS-threshold VAD in vad.ts. When energy crosses the threshold the machine moves IDLE to LISTEN.
  3. Transcribe. While in LISTEN, frames feed the selected STT adapter. Partial transcripts are pushed back to the client live; a final transcript triggers the move to THINK.
  4. Think. The final transcript is appended to the conversation history and streamed to the LLM adapter. The orchestrator hands off to TTS once it has a meaningful chunk of text rather than waiting for the full completion, which is what keeps time-to-first-audio low.
  5. Speak. TTS streams audio chunks straight back to the client as base64 PCM. When the stream finishes the machine returns to IDLE.

Barge-in is the part people get wrong. If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS streams through their AbortControllers, emits a barge-in control message, and drops back to LISTEN. Because the abort signal propagates through the fetch body reader and the for await loops, there are no orphaned streams talking over the user.

See Architecture for the sequence diagram and the full state-transition table.

Real-world examples

Run the pipeline without paying for any provider

You do not need provider keys to exercise the transport and state machine. Leave the keys unset in .env, run pnpm dev, and the adapters fall back to stubs: you will see placeholder transcripts and silent audio while the IDLE/LISTEN/THINK/SPEAK transitions and barge-in all work. This is the fastest way to validate a transport or VAD change before spending on API calls.

Swap the TTS provider for one call

Switching from Cartesia to ElevenLabs is a single environment variable, no code change:

TTS_PROVIDER=elevenlabs ELEVENLABS_API_KEY=sk-... pnpm dev

The registry in apps/server/src/adapters/tts/registry.ts reads TTS_PROVIDER and returns the matching adapter. The orchestrator only sees the TtsAdapter interface.

Add a brand-new provider

Create apps/server/src/adapters/llm/<provider>.ts exporting a factory that returns an object implementing the LlmAdapter interface (a single stream() async generator that yields token strings and honours the AbortSignal). Register it in apps/server/src/adapters/llm/registry.ts behind a new LLM_PROVIDER value. Nothing else in the pipeline changes, which is the whole point of the adapter seam.

Tighten the VAD for a noisy room

The default detector in vad.ts is a plain RMS threshold (RMS_THRESHOLD = 0.02). Raise it for noisy environments or, for real workloads, replace detectVoice with a call into webrtcvad-wasm or silero-vad-onnx. The function signature stays (Buffer) => boolean, so the orchestrator is unaffected and the existing smoke test still guards the contract.

Troubleshooting

Symptom Likely cause Fix
Browser never leaves IDLE Microphone permission denied, or input level below the RMS threshold Check site permissions for localhost:3000; lower RMS_THRESHOLD in vad.ts to confirm the path
WebSocket fails to connect Server not listening on :3001 Check the pnpm dev logs; confirm NEXT_PUBLIC_SERVER_URL matches the server port
Transcripts appear but no audio comes back TTS provider key missing or invalid Server logs show the failed adapter call; set the key for the provider named in TTS_PROVIDER
LLM never streams tokens SARMALINK_API_KEY (or the selected provider key) unset The SarmaLink adapter yields a configuration message instead of streaming when no key is set
Agent talks over me when I interrupt Barge-in not firing because the VAD is not detecting your interruption Confirm input frames are reaching the server; lower the RMS threshold so mid-output speech is caught
pnpm dev warns about peer dependencies Next 15 lists an older React peer range against React 19 Safe to ignore; the build and runtime are verified against React 19
CI fails on pnpm install --frozen-lockfile pnpm-lock.yaml out of sync with a manifest change Run pnpm install locally and commit the updated lockfile

Repository

github.com/sarmakska/voice-agent-starter

Clone this wiki locally