-
Notifications
You must be signed in to change notification settings - Fork 0
Architecture
sequenceDiagram
participant U as User browser
participant SRV as Fastify / WebSocket
participant V as VAD
participant S as Streaming STT
participant L as Streaming LLM
participant TO as Tool registry
participant T as Streaming TTS
U->>SRV: PCM16 frames (16kHz mono, base64)
SRV->>V: each frame
V->>V: detect voice activity
V->>S: forward voice frames
S-->>SRV: partial transcripts (live)
Note over S: trailing silence -> flush -> final transcript
S->>L: final transcript + tool definitions
L-->>T: first token (after first token, state -> SPEAK)
L-->>TO: tool_call (if requested)
TO-->>L: tool result (re-stream)
T-->>SRV: PCM audio chunks (per sentence)
SRV-->>U: audio playback
Note over U,SRV: total latency P50 ~800ms self-hosted
stateDiagram-v2
[*] --> IDLE
IDLE --> LISTEN: voice detected
LISTEN --> THINK: final transcript
THINK --> SPEAK: first LLM token
SPEAK --> IDLE: turn complete
LISTEN --> IDLE: silence, no transcript
THINK --> LISTEN: barge-in (cancel LLM + TTS)
SPEAK --> LISTEN: barge-in (cancel LLM + TTS)
| File | Responsibility |
|---|---|
apps/server/src/index.ts |
Fastify server, /health and /voice WebSocket, message dispatch |
apps/server/src/pipeline/orchestrator.ts |
Duplex state machine, barge-in, function-call passthrough |
apps/server/src/pipeline/tools.ts |
Tool registry and default tools |
apps/server/src/pipeline/vad.ts |
RMS-threshold voice activity detection |
apps/server/src/adapters/audio.ts |
PCM/WAV conversion and sentence splitting |
apps/server/src/adapters/llm/sse.ts |
OpenAI-compatible SSE reader and wire mapping |
apps/server/src/adapters/stt/*.ts |
Streaming STT adapters (Whisper.cpp, Deepgram, OpenAI Whisper) |
apps/server/src/adapters/llm/*.ts |
Streaming LLM adapters (Groq, SarmaLink-AI, OpenAI) |
apps/server/src/adapters/tts/*.ts |
Chunked TTS adapters (OpenTTS, Cartesia, ElevenLabs) |
apps/web/app/page.tsx |
Browser client with microphone capture |
interface SttAdapter {
readonly id: string
feed(pcm: Buffer): Promise<{ text: string; final: boolean } | null>
flush(): Promise<{ text: string; final: boolean } | null>
reset(): void
}
interface LlmAdapter {
readonly id: string
stream(opts: { messages: ChatMessage[]; signal: AbortSignal; tools?: ToolDefinition[] })
: AsyncGenerator<{ type: 'token'; text: string } | { type: 'tool_call'; call: ToolCall }>
}
interface TtsAdapter {
readonly id: string
feed(text: string): void
stream(opts: { signal: AbortSignal }): AsyncGenerator<Buffer>
end(): void
reset(): void
}If the VAD detects speech while the machine is in THINK or SPEAK, the orchestrator aborts both the LLM and TTS AbortControllers, resets the STT and TTS adapters, emits a barge-in control message, and drops to LISTEN. The abort propagates through the fetch body reader and every for await loop, so no orphaned stream keeps talking over the user. This path is covered by an end-to-end test that interrupts a long-running turn mid-stream and asserts the machine returns to LISTEN.
The orchestrator advertises the registered tool definitions on every LLM call. The shared SSE reader assembles fragmented tool_calls deltas (the model streams the function name first, then the JSON arguments a few characters at a time) into one complete call per index. When the model requests a tool, the orchestrator runs the handler from the ToolRegistry, appends an assistant turn recording the request and a tool turn carrying the result, and re-streams. Tool rounds are bounded by maxToolRounds (default 3), and handler errors are returned to the model as the tool result rather than crashing the session. This path is covered by an end-to-end test that scripts a tool-call turn followed by a grounded answer.
- Plain WebSocket transport rather than a WebRTC SFU. The orchestrator is transport-agnostic, so a mediasoup or LiveKit edge is a swap that the pipeline never sees. The starter keeps the transport simple on purpose.
- Self-hosted defaults (Groq, Whisper.cpp, OpenTTS) so the stack runs end-to-end with no per-minute provider fees. Hosted providers are one environment variable away per layer.
- Groq Llama 4 as the default LLM because its LPU inference stack gives a sub-300ms first-token target, which is the bottleneck for voice perception.
- Pluggable adapters because every team disagrees on which provider is best. The orchestrator does not care which one you pick.
-
Barge-in via AbortController rather than queue cancellation because abort signals propagate cleanly through
fetchstreams andfor awaitloops. -
Injectable adapters and a
Sinkseam so the entire pipeline is testable without a live socket or provider keys.