Skip to content

Quick Start

sarmakska edited this page May 31, 2026 · 2 revisions

Quick Start

git clone https://github.com/sarmakska/voice-agent-starter.git
cd voice-agent-starter
pnpm install
cp .env.example .env
pnpm dev

This starts the server on :3001 and the web client on :3000. Open http://localhost:3000, click Start, grant microphone access, and talk.

The self-hosted default stack

The defaults run a fully self-hosted, open-source stack with no per-minute provider fees:

  • LLM: Groq Llama 4. Set GROQ_API_KEY from the Groq console. Groq is a hosted API but has a generous free tier and is the fastest path to a sub-300ms first token.
  • STT: Whisper.cpp. Run a whisper-server and point WHISPERCPP_URL at it (default http://localhost:8090).
  • TTS: OpenTTS Coqui XTTS v2. Run an OpenTTS server and point OPENTTS_URL at it (default http://localhost:5500). The default voice is coqui-tts:en_vctk#xtts_v2.

The quickest way to stand up the two self-hosted servers is their official containers: ghcr.io/ggml-org/whisper.cpp for Whisper.cpp and synesthesiam/opentts for OpenTTS. Expose them on the ports above and the defaults work with no further configuration.

Using hosted providers instead

Every layer is one environment variable. To run entirely on hosted APIs:

STT_PROVIDER=deepgram
LLM_PROVIDER=openai
TTS_PROVIDER=elevenlabs

DEEPGRAM_API_KEY=...
OPENAI_API_KEY=...
ELEVENLABS_API_KEY=...

See .env.example for the full list.

The keyless path

If you have not configured any providers yet, the pipeline still runs end-to-end. The LLM adapter yields a single configuration message and the STT and TTS adapters return nothing, but the IDLE/LISTEN/THINK/SPEAK transitions, barge-in, and tool-call routing all work. This is useful for verifying the transport and state machine before standing up the servers, and it is exactly what the end-to-end test suite drives.

What you should see

  1. Browser status flips to listening when you start talking.
  2. Partial transcripts appear as you speak.
  3. THINK starts when you stop (trailing-silence flush produces the final transcript).
  4. SPEAK plays back the response, starting from the first sentence.
  5. Interrupting the response cancels it and returns to LISTEN.

Verify it locally

pnpm lint
pnpm typecheck
pnpm build
pnpm test

The test suite runs the full pipeline through fake adapters, so it passes with no provider keys.

If something breaks

See the Troubleshooting table on the Home page for the common symptoms and fixes.

Clone this wiki locally