Architecture

Pipeline sequence

sequenceDiagram
  participant U as User browser
  participant SRV as Fastify / WebSocket
  participant V as VAD
  participant S as Streaming STT
  participant L as Streaming LLM
  participant TO as Tool registry
  participant T as Streaming TTS

  U->>SRV: PCM16 frames (16kHz mono, base64)
  SRV->>V: each frame
  V->>V: detect voice activity
  V->>S: forward voice frames
  S-->>SRV: partial transcripts (live)
  Note over S: trailing silence -> flush -> final transcript
  S->>L: final transcript + tool definitions
  L-->>T: first token (after first token, state -> SPEAK)
  L-->>TO: tool_call (if requested)
  TO-->>L: tool result (re-stream)
  T-->>SRV: PCM audio chunks (per sentence)
  SRV-->>U: audio playback
  Note over U,SRV: total latency P50 ~800ms self-hosted

State machine

stateDiagram-v2
  [*] --> IDLE
  IDLE --> LISTEN: voice detected
  LISTEN --> THINK: final transcript
  THINK --> SPEAK: first LLM token
  SPEAK --> IDLE: turn complete
  LISTEN --> IDLE: silence, no transcript
  THINK --> LISTEN: barge-in (cancel LLM + TTS)
  SPEAK --> LISTEN: barge-in (cancel LLM + TTS)

Components

File	Responsibility
`apps/server/src/index.ts`	Fastify server, `/health` and `/voice` WebSocket, message dispatch
`apps/server/src/pipeline/orchestrator.ts`	Duplex state machine, barge-in, function-call passthrough
`apps/server/src/pipeline/tools.ts`	Tool registry and default tools
`apps/server/src/pipeline/vad.ts`	RMS-threshold voice activity detection
`apps/server/src/adapters/audio.ts`	PCM/WAV conversion and sentence splitting
`apps/server/src/adapters/llm/sse.ts`	OpenAI-compatible SSE reader and wire mapping
`apps/server/src/adapters/stt/*.ts`	Streaming STT adapters (Whisper.cpp, Deepgram, OpenAI Whisper)
`apps/server/src/adapters/llm/*.ts`	Streaming LLM adapters (Groq, SarmaLink-AI, OpenAI)
`apps/server/src/adapters/tts/*.ts`	Chunked TTS adapters (OpenTTS, Cartesia, ElevenLabs)
`apps/web/app/page.tsx`	Browser client with microphone capture

Adapter interfaces

interface SttAdapter {
  readonly id: string
  feed(pcm: Buffer): Promise<{ text: string; final: boolean } | null>
  flush(): Promise<{ text: string; final: boolean } | null>
  reset(): void
}

interface LlmAdapter {
  readonly id: string
  stream(opts: { messages: ChatMessage[]; signal: AbortSignal; tools?: ToolDefinition[] })
    : AsyncGenerator<{ type: 'token'; text: string } | { type: 'tool_call'; call: ToolCall }>
}

interface TtsAdapter {
  readonly id: string
  feed(text: string): void
  stream(opts: { signal: AbortSignal }): AsyncGenerator<Buffer>
  end(): void
  reset(): void
}

Barge-in

If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN. The abort propagates through the fetch body reader and every for await loop, so no orphaned stream keeps talking over the user. This path is covered by an end-to-end test that interrupts a long-running turn mid-stream and asserts the machine returns to LISTEN.

Function-call passthrough

The orchestrator advertises the registered tool definitions on every LLM call. The shared SSE reader assembles fragmented tool_calls deltas (the model streams the function name first, then the JSON arguments a few characters at a time) into one complete call per index. When the model requests a tool, the orchestrator runs the handler from the ToolRegistry, appends an assistant turn recording the request and a tool turn carrying the result, and re-streams. Tool rounds are bounded by maxToolRounds (default 3), and handler errors are returned to the model as the tool result rather than crashing the session. This path is covered by an end-to-end test that scripts a tool-call turn followed by a grounded answer.

Why each piece

Plain WebSocket transport rather than a WebRTC SFU. The orchestrator is transport-agnostic, so a mediasoup or LiveKit edge is a swap that the pipeline never sees. The starter keeps the transport simple on purpose.
Self-hosted defaults (Groq, Whisper.cpp, OpenTTS) so the stack runs end-to-end with no per-minute provider fees. Hosted providers are one environment variable away per layer.
Groq Llama 4 as the default LLM because its LPU inference stack gives a sub-300ms first-token target, which is the bottleneck for voice perception.
Pluggable adapters because every team disagrees on which provider is best. The orchestrator does not care which one you pick.
Barge-in via AbortController rather than queue cancellation because abort signals propagate cleanly through fetch streams and for await loops.
Injectable adapters and a Sink seam so the entire pipeline is testable without a live socket or provider keys.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Architecture

Pipeline sequence

State machine

Components

Adapter interfaces

Barge-in

Function-call passthrough

Why each piece

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally