Skip to content

Architecture

sarmakska edited this page Jun 7, 2026 · 3 revisions

Architecture

Pipeline sequence

sequenceDiagram
  participant U as User browser
  participant SRV as Fastify / WebSocket
  participant V as VAD
  participant S as Streaming STT
  participant L as Streaming LLM
  participant TO as Tool registry
  participant T as Streaming TTS

  U->>SRV: PCM16 frames (16kHz mono, base64)
  SRV->>V: each frame
  V->>V: detect voice activity
  V->>S: forward voice frames
  S-->>SRV: partial transcripts (live)
  Note over S: trailing silence -> flush -> final transcript
  S->>L: final transcript + tool definitions
  L-->>T: first token (after first token, state -> SPEAK)
  L-->>TO: tool_call (if requested)
  TO-->>L: tool result (re-stream)
  T-->>SRV: PCM audio chunks (per sentence)
  SRV-->>U: audio playback
  Note over U,SRV: total latency P50 ~800ms self-hosted
Loading

State machine

stateDiagram-v2
  [*] --> IDLE
  IDLE --> LISTEN: voice detected
  LISTEN --> THINK: final transcript
  THINK --> SPEAK: first LLM token
  SPEAK --> IDLE: turn complete
  LISTEN --> IDLE: silence, no transcript
  THINK --> LISTEN: barge-in (cancel LLM + TTS)
  SPEAK --> LISTEN: barge-in (cancel LLM + TTS)
Loading

Components

File Responsibility
apps/server/src/index.ts Fastify server, /health and /voice WebSocket, message dispatch
apps/server/src/pipeline/orchestrator.ts Duplex state machine, barge-in, function-call passthrough
apps/server/src/pipeline/tools.ts Tool registry and default tools
apps/server/src/pipeline/vad.ts Stateful VAD with hysteresis and hangover, plus the stateless RMS primitive
apps/server/src/adapters/audio.ts PCM/WAV conversion and sentence splitting
apps/server/src/adapters/llm/sse.ts OpenAI-compatible SSE reader and wire mapping
apps/server/src/adapters/stt/*.ts Streaming STT adapters (Whisper.cpp, Deepgram, OpenAI Whisper)
apps/server/src/adapters/llm/*.ts Streaming LLM adapters (Groq, SarmaLink-AI, OpenAI)
apps/server/src/adapters/tts/*.ts Chunked TTS adapters (OpenTTS, Cartesia, ElevenLabs)
apps/web/app/page.tsx Browser client with microphone capture

Adapter interfaces

interface SttAdapter {
  readonly id: string
  feed(pcm: Buffer): Promise<{ text: string; final: boolean } | null>
  flush(): Promise<{ text: string; final: boolean } | null>
  reset(): void
}

interface LlmAdapter {
  readonly id: string
  stream(opts: { messages: ChatMessage[]; signal: AbortSignal; tools?: ToolDefinition[] })
    : AsyncGenerator<{ type: 'token'; text: string } | { type: 'tool_call'; call: ToolCall }>
}

interface TtsAdapter {
  readonly id: string
  feed(text: string): void
  stream(opts: { signal: AbortSignal }): AsyncGenerator<Buffer>
  end(): void
  reset(): void
}

Voice activity detection

The per-session detector lives in vad.ts. There are two layers. frameRms (and the convenience detectVoice) is the stateless energy primitive: root-mean-square of a PCM16 frame against a threshold, exact and side-effect free. Vad is the stateful detector the orchestrator actually runs, and it adds the two techniques a single energy gate is missing for full-duplex use:

  • Hysteresis. The enter threshold (default 0.025) is higher than the exit threshold (default 0.015). Energy that hovers near one boundary cannot rattle the state back and forth frame by frame: once speaking, the level only has to stay above the lower exit threshold to sustain.
  • Hangover. A run of confirming frames is required before the decision flips: speechFrames frames above enter to declare speech, hangoverFrames frames below exit to declare silence. A lone loud transient (a door, a keyboard tap) never declares speech, and a brief intra-word dip never ends the utterance.

The orchestrator constructs one Vad per session with speechFrames: 1 for a fast onset, leaving the hangover (default 3 frames) to absorb transients during output. Tune any of the four parameters through the orchestrator's vad option. Eight unit tests pin the onset run, transient rejection, the hangover hold, the hysteresis band, and the threshold-ordering guard.

Barge-in

If the VAD confirms speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN. The abort propagates through the fetch body reader and every for await loop, so no orphaned stream keeps talking over the user. Confirmation is the point: because the VAD requires sustained energy rather than a single frame, room noise during output does not abort the turn, while a real interruption still cuts in within a frame or two. This path is covered by an end-to-end test that interrupts a long-running turn mid-stream and asserts the machine returns to LISTEN.

Function-call passthrough

The orchestrator advertises the registered tool definitions on every LLM call. The shared SSE reader assembles fragmented tool_calls deltas (the model streams the function name first, then the JSON arguments a few characters at a time) into one complete call per index. When the model requests a tool, the orchestrator runs the handler from the ToolRegistry, appends an assistant turn recording the request and a tool turn carrying the result, and re-streams. Tool rounds are bounded by maxToolRounds (default 3), and handler errors are returned to the model as the tool result rather than crashing the session. This path is covered by an end-to-end test that scripts a tool-call turn followed by a grounded answer.

Why each piece

  • Plain WebSocket transport rather than a WebRTC SFU. The orchestrator is transport-agnostic, so a mediasoup or LiveKit edge is a swap that the pipeline never sees. The starter keeps the transport simple on purpose.
  • Self-hosted defaults (Groq, Whisper.cpp, OpenTTS) so the stack runs end-to-end with no per-minute provider fees. Hosted providers are one environment variable away per layer.
  • Groq Llama 4 as the default LLM because its LPU inference stack gives a sub-300ms first-token target, which is the bottleneck for voice perception.
  • Pluggable adapters because every team disagrees on which provider is best. The orchestrator does not care which one you pick.
  • Barge-in via AbortController rather than queue cancellation because abort signals propagate cleanly through fetch streams and for await loops.
  • Injectable adapters and a Sink seam so the entire pipeline is testable without a live socket or provider keys.

Clone this wiki locally