-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Sub-second, full-duplex voice agent loop with swappable STT, LLM, and TTS. This wiki is the deep reference. The README is the product page; start there if you just want to run it.
Built by Sarma Linux. MIT licence.
- Architecture: pipeline sequence diagram, the IDLE/LISTEN/THINK/SPEAK state machine, the component table, and the design rationale behind each piece.
- Quick-Start: install, the environment variables you need, the keyless smoke-test path, and a first-call walkthrough.
- Roadmap: what shipped in 1.0.0, what is planned, what I will not ship, and how to contribute.
One voice session maps to one Orchestrator instance, created when the browser opens the /voice WebSocket and disposed when it closes. The orchestrator owns a four-state machine and never blocks: every provider call is a stream consumed with for await, and every cancellable operation is wrapped in an AbortController.
-
Capture. The browser runs the microphone through an
AudioContextresampled to 16kHz mono, converts each frame to PCM16, base64-encodes it, and sends it as a{ type: 'audio', payload }message. In production you would terminate this over a mediasoup SFU rather than a raw WebSocket; the orchestrator does not care which transport delivers the frames. -
Detect. Each frame runs through the RMS-threshold VAD in
vad.ts. When energy crosses the threshold the machine moves IDLE to LISTEN. - Transcribe. While in LISTEN, frames feed the selected STT adapter. Partial transcripts are pushed back to the client live; a final transcript triggers the move to THINK.
- Think. The final transcript is appended to the conversation history and streamed to the LLM adapter. The orchestrator hands off to TTS once it has a meaningful chunk of text rather than waiting for the full completion, which is what keeps time-to-first-audio low.
- Speak. TTS streams audio chunks straight back to the client as base64 PCM. When the stream finishes the machine returns to IDLE.
Barge-in is the part people get wrong. If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS streams through their AbortControllers, emits a barge-in control message, and drops back to LISTEN. Because the abort signal propagates through the fetch body reader and the for await loops, there are no orphaned streams talking over the user.
See Architecture for the sequence diagram and the full state-transition table.
You do not need provider keys to exercise the transport and state machine. Leave the keys unset in .env, run pnpm dev, and the adapters fall back to stubs: you will see placeholder transcripts and silent audio while the IDLE/LISTEN/THINK/SPEAK transitions and barge-in all work. This is the fastest way to validate a transport or VAD change before spending on API calls.
Switching from Cartesia to ElevenLabs is a single environment variable, no code change:
TTS_PROVIDER=elevenlabs ELEVENLABS_API_KEY=sk-... pnpm devThe registry in apps/server/src/adapters/tts/registry.ts reads TTS_PROVIDER and returns the matching adapter. The orchestrator only sees the TtsAdapter interface.
Create apps/server/src/adapters/llm/<provider>.ts exporting a factory that returns an object implementing the LlmAdapter interface (a single stream() async generator that yields token strings and honours the AbortSignal). Register it in apps/server/src/adapters/llm/registry.ts behind a new LLM_PROVIDER value. Nothing else in the pipeline changes, which is the whole point of the adapter seam.
The default detector in vad.ts is a plain RMS threshold (RMS_THRESHOLD = 0.02). Raise it for noisy environments or, for real workloads, replace detectVoice with a call into webrtcvad-wasm or silero-vad-onnx. The function signature stays (Buffer) => boolean, so the orchestrator is unaffected and the existing smoke test still guards the contract.
| Symptom | Likely cause | Fix |
|---|---|---|
| Browser never leaves IDLE | Microphone permission denied, or input level below the RMS threshold | Check site permissions for localhost:3000; lower RMS_THRESHOLD in vad.ts to confirm the path |
| WebSocket fails to connect | Server not listening on :3001
|
Check the pnpm dev logs; confirm NEXT_PUBLIC_SERVER_URL matches the server port |
| Transcripts appear but no audio comes back | TTS provider key missing or invalid | Server logs show the failed adapter call; set the key for the provider named in TTS_PROVIDER
|
| LLM never streams tokens |
SARMALINK_API_KEY (or the selected provider key) unset |
The SarmaLink adapter yields a configuration message instead of streaming when no key is set |
| Agent talks over me when I interrupt | Barge-in not firing because the VAD is not detecting your interruption | Confirm input frames are reaching the server; lower the RMS threshold so mid-output speech is caught |
pnpm dev warns about peer dependencies |
Next 15 lists an older React peer range against React 19 | Safe to ignore; the build and runtime are verified against React 19 |
CI fails on pnpm install --frozen-lockfile
|
pnpm-lock.yaml out of sync with a manifest change |
Run pnpm install locally and commit the updated lockfile |