-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
sequenceDiagram
participant U as User browser
participant SRV as Fastify / WebSocket
participant V as VAD
participant S as Streaming STT
participant L as Streaming LLM
participant TO as Tool registry
participant T as Streaming TTS
U->>SRV: PCM16 frames (16kHz mono, base64)
SRV->>V: each frame
V->>V: detect voice activity
V->>S: forward voice frames
S-->>SRV: partial transcripts (live)
Note over S: trailing silence -> flush -> final transcript
S->>L: final transcript + tool definitions
L-->>T: first token (after first token, state -> SPEAK)
L-->>TO: tool_call (if requested)
TO-->>L: tool result (re-stream)
T-->>SRV: PCM audio chunks (per sentence)
SRV-->>U: audio playback
Note over U,SRV: total latency P50 ~800ms self-hosted
stateDiagram-v2
[*] --> IDLE
IDLE --> LISTEN: voice detected
LISTEN --> THINK: final transcript
THINK --> SPEAK: first LLM token
SPEAK --> IDLE: turn complete
LISTEN --> IDLE: silence, no transcript
THINK --> LISTEN: barge-in (cancel LLM + TTS)
SPEAK --> LISTEN: barge-in (cancel LLM + TTS)
| File | Responsibility |
|---|---|
apps/server/src/index.ts |
Fastify server, /health and /voice WebSocket, message dispatch |
apps/server/src/pipeline/orchestrator.ts |
Duplex state machine, barge-in, function-call passthrough |
apps/server/src/pipeline/tools.ts |
Tool registry and default tools |
apps/server/src/pipeline/vad.ts |
Stateful VAD with hysteresis and hangover, plus the stateless RMS primitive |
apps/server/src/adapters/audio.ts |
PCM/WAV conversion and sentence splitting |
apps/server/src/adapters/llm/sse.ts |
OpenAI-compatible SSE reader and wire mapping |
apps/server/src/adapters/stt/*.ts |
Streaming STT adapters (Whisper.cpp, Deepgram, OpenAI Whisper) |
apps/server/src/adapters/llm/*.ts |
Streaming LLM adapters (Groq, SarmaLink-AI, OpenAI) |
apps/server/src/adapters/tts/*.ts |
Chunked TTS adapters (OpenTTS, Cartesia, ElevenLabs) |
apps/web/app/page.tsx |
Browser client with microphone capture |
interface SttAdapter {
readonly id: string
feed(pcm: Buffer): Promise<{ text: string; final: boolean } | null>
flush(): Promise<{ text: string; final: boolean } | null>
reset(): void
}
interface LlmAdapter {
readonly id: string
stream(opts: { messages: ChatMessage[]; signal: AbortSignal; tools?: ToolDefinition[] })
: AsyncGenerator<{ type: 'token'; text: string } | { type: 'tool_call'; call: ToolCall }>
}
interface TtsAdapter {
readonly id: string
feed(text: string): void
stream(opts: { signal: AbortSignal }): AsyncGenerator<Buffer>
end(): void
reset(): void
}The per-session detector lives in vad.ts. There are two layers. frameRms (and the convenience detectVoice) is the stateless energy primitive: root-mean-square of a PCM16 frame against a threshold, exact and side-effect free. Vad is the stateful detector the orchestrator actually runs, and it adds the two techniques a single energy gate is missing for full-duplex use:
-
Hysteresis. The enter threshold (default
0.025) is higher than the exit threshold (default0.015). Energy that hovers near one boundary cannot rattle the state back and forth frame by frame: once speaking, the level only has to stay above the lower exit threshold to sustain. -
Hangover. A run of confirming frames is required before the decision flips:
speechFramesframes above enter to declare speech,hangoverFramesframes below exit to declare silence. A lone loud transient (a door, a keyboard tap) never declares speech, and a brief intra-word dip never ends the utterance.
The orchestrator constructs one Vad per session with speechFrames: 1 for a fast onset, leaving the hangover (default 3 frames) to absorb transients during output. Tune any of the four parameters through the orchestrator's vad option. Eight unit tests pin the onset run, transient rejection, the hangover hold, the hysteresis band, and the threshold-ordering guard.
If the VAD confirms speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN. The abort propagates through the fetch body reader and every for await loop, so no orphaned stream keeps talking over the user. Confirmation is the point: because the VAD requires sustained energy rather than a single frame, room noise during output does not abort the turn, while a real interruption still cuts in within a frame or two. This path is covered by an end-to-end test that interrupts a long-running turn mid-stream and asserts the machine returns to LISTEN.
The orchestrator advertises the registered tool definitions on every LLM call. The shared SSE reader assembles fragmented tool_calls deltas (the model streams the function name first, then the JSON arguments a few characters at a time) into one complete call per index. When the model requests a tool, the orchestrator runs the handler from the ToolRegistry, appends an assistant turn recording the request and a tool turn carrying the result, and re-streams. Tool rounds are bounded by maxToolRounds (default 3), and handler errors are returned to the model as the tool result rather than crashing the session. This path is covered by an end-to-end test that scripts a tool-call turn followed by a grounded answer.
- Plain WebSocket transport rather than a WebRTC SFU. The orchestrator is transport-agnostic, so a mediasoup or LiveKit edge is a swap that the pipeline never sees. The starter keeps the transport simple on purpose.
- Self-hosted defaults (Groq, Whisper.cpp, OpenTTS) so the stack runs end-to-end with no per-minute provider fees. Hosted providers are one environment variable away per layer.
- Groq Llama 4 as the default LLM because its LPU inference stack gives a sub-300ms first-token target, which is the bottleneck for voice perception.
- Pluggable adapters because every team disagrees on which provider is best. The orchestrator does not care which one you pick.
-
Barge-in via AbortController rather than queue cancellation because abort signals propagate cleanly through
fetchstreams andfor awaitloops. -
Injectable adapters and a
Sinkseam so the entire pipeline is testable without a live socket or provider keys.