Skip to content

Architecture

sarmakska edited this page May 31, 2026 · 3 revisions

Architecture

Pipeline sequence

sequenceDiagram
  participant U as User browser
  participant SRV as Fastify / WebSocket
  participant V as VAD
  participant S as Streaming STT
  participant L as Streaming LLM
  participant TO as Tool registry
  participant T as Streaming TTS

  U->>SRV: PCM16 frames (16kHz mono, base64)
  SRV->>V: each frame
  V->>V: detect voice activity
  V->>S: forward voice frames
  S-->>SRV: partial transcripts (live)
  Note over S: trailing silence -> flush -> final transcript
  S->>L: final transcript + tool definitions
  L-->>T: first token (after first token, state -> SPEAK)
  L-->>TO: tool_call (if requested)
  TO-->>L: tool result (re-stream)
  T-->>SRV: PCM audio chunks (per sentence)
  SRV-->>U: audio playback
  Note over U,SRV: total latency P50 ~800ms self-hosted
Loading

State machine

stateDiagram-v2
  [*] --> IDLE
  IDLE --> LISTEN: voice detected
  LISTEN --> THINK: final transcript
  THINK --> SPEAK: first LLM token
  SPEAK --> IDLE: turn complete
  LISTEN --> IDLE: silence, no transcript
  THINK --> LISTEN: barge-in (cancel LLM + TTS)
  SPEAK --> LISTEN: barge-in (cancel LLM + TTS)
Loading

Components

File Responsibility
apps/server/src/index.ts Fastify server, /health and /voice WebSocket, message dispatch
apps/server/src/pipeline/orchestrator.ts Duplex state machine, barge-in, function-call passthrough
apps/server/src/pipeline/tools.ts Tool registry and default tools
apps/server/src/pipeline/vad.ts RMS-threshold voice activity detection
apps/server/src/adapters/audio.ts PCM/WAV conversion and sentence splitting
apps/server/src/adapters/llm/sse.ts OpenAI-compatible SSE reader and wire mapping
apps/server/src/adapters/stt/*.ts Streaming STT adapters (Whisper.cpp, Deepgram, OpenAI Whisper)
apps/server/src/adapters/llm/*.ts Streaming LLM adapters (Groq, SarmaLink-AI, OpenAI)
apps/server/src/adapters/tts/*.ts Chunked TTS adapters (OpenTTS, Cartesia, ElevenLabs)
apps/web/app/page.tsx Browser client with microphone capture

Adapter interfaces

interface SttAdapter {
  readonly id: string
  feed(pcm: Buffer): Promise<{ text: string; final: boolean } | null>
  flush(): Promise<{ text: string; final: boolean } | null>
  reset(): void
}

interface LlmAdapter {
  readonly id: string
  stream(opts: { messages: ChatMessage[]; signal: AbortSignal; tools?: ToolDefinition[] })
    : AsyncGenerator<{ type: 'token'; text: string } | { type: 'tool_call'; call: ToolCall }>
}

interface TtsAdapter {
  readonly id: string
  feed(text: string): void
  stream(opts: { signal: AbortSignal }): AsyncGenerator<Buffer>
  end(): void
  reset(): void
}

Barge-in

If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN. The abort propagates through the fetch body reader and every for await loop, so no orphaned stream keeps talking over the user. This path is covered by an end-to-end test that interrupts a long-running turn mid-stream and asserts the machine returns to LISTEN.

Function-call passthrough

The orchestrator advertises the registered tool definitions on every LLM call. The shared SSE reader assembles fragmented tool_calls deltas (the model streams the function name first, then the JSON arguments a few characters at a time) into one complete call per index. When the model requests a tool, the orchestrator runs the handler from the ToolRegistry, appends an assistant turn recording the request and a tool turn carrying the result, and re-streams. Tool rounds are bounded by maxToolRounds (default 3), and handler errors are returned to the model as the tool result rather than crashing the session. This path is covered by an end-to-end test that scripts a tool-call turn followed by a grounded answer.

Why each piece

  • Plain WebSocket transport rather than a WebRTC SFU. The orchestrator is transport-agnostic, so a mediasoup or LiveKit edge is a swap that the pipeline never sees. The starter keeps the transport simple on purpose.
  • Self-hosted defaults (Groq, Whisper.cpp, OpenTTS) so the stack runs end-to-end with no per-minute provider fees. Hosted providers are one environment variable away per layer.
  • Groq Llama 4 as the default LLM because its LPU inference stack gives a sub-300ms first-token target, which is the bottleneck for voice perception.
  • Pluggable adapters because every team disagrees on which provider is best. The orchestrator does not care which one you pick.
  • Barge-in via AbortController rather than queue cancellation because abort signals propagate cleanly through fetch streams and for await loops.
  • Injectable adapters and a Sink seam so the entire pipeline is testable without a live socket or provider keys.

Clone this wiki locally