A production-grade, real-time voice assistant engineered for frontier-audio-grade performance: adaptive jitter buffering, spectral analysis, AGC-normalized capture, tool-grounded intelligence, and sub-second conversational latency. Built for frontline operational use where audio quality, speed, and interruptibility are non-negotiable.
Browser (React + Vite) Server (Node.js + TypeScript)
┌─────────────────────────┐ ┌──────────────────────────────────┐
│ AudioWorklet (mic) │──PCM──▸│ Whisper STT / Browser Speech │
│ Web Speech API (primary)│ │ Claude LLM (tool use + ReAct) │
│ Playback Manager │◂─PCM──│ Local TTS (edge-tts) / 11Labs │
│ VoiceOrb + Premium UI │◂─JSON─│ 6-State Machine (orchestrator) │
│ Audio Health Panel │ │ 24 Tools (GitHub, Util, Memory) │
│ Barge-In Detection │──JSON─▸│ PostgreSQL + pgvector + Redis │
└─────────────────────────┘ └──────────────────────────────────┘
Pipeline: Text/Speech → Whisper STT or Browser Speech API → Claude (with tools) → Local TTS (edge-tts) → Audio Playback
Measured latency: LLM TTFT ~1.3s, TTS first chunk ~0.6s, E2E ~2-5s depending on tool calls.
This isn't a toy voice wrapper — the audio layer is engineered for production environments.
- Automatic Gain Control (AGC): Normalizes to -18 dBFS target with 10ms attack / 100ms release envelope
- Clipping detection: Counts samples at +/-1.0, both pre- and post-AGC
- Noise floor estimation: Running minimum RMS over 2-second sliding window
- Peak tracking: Exponential decay (300ms time constant) for smooth metering
- Per-buffer metrics: RMS, peak, clipping count, noise floor, AGC gain — all emitted at buffer rate
[0x01] [version:1] [codec:1] [sampleRate:2 LE] [seqNum:4 LE] [timestamp:4 LE] [PCM data]
- Versioned wire format for forward/backward compatibility
- Codec field supports PCM-16 (0x01), PCM-24 (0x02), extensible to Opus
- Sequence numbers for packet loss and reorder detection
- Timestamps for jitter measurement and lip-sync
- Adaptive buffer depth: Targets 2x measured jitter (50ms min, 500ms max)
- Underrun/overrun tracking: Counters exported as Prometheus metrics
- Sequence-aware scheduling: Detects out-of-order and dropped packets
- Gap-free playback: AudioContext.currentTime scheduling for zero-gap audio
- Real-time FFT spectral analyzer (64-band, 60fps canvas)
- Input level meter (dBFS), VAD state, AEC status
- AGC gain, noise floor, SNR estimate, clipping counter
- Network RTT and jitter (WebSocket ping/pong)
- TTS playback state, buffer depth, barge-in counter
[Capture] → [AGC] → [Encode] → [Network↓] → [STT] → [LLM-TTFT]
↓
[TTS-first] ← [LLM-gen] → [Tools]
↓
[Network↑] → [Jitter Buffer] → [Playback]
Each stage measured with millisecond precision. P50/P95 percentiles computed over rolling 100-sample window. Full breakdown in /status command and Prometheus export.
Requirements: Node.js 20+, Docker
# 1. Install
git clone https://github.com/yashkuceriya/Jarvis.git && cd Jarvis
npm install
pip3 install edge-tts # Local TTS (free, no API key)
brew install ffmpeg # Audio format conversion
# 2. Configure
cp .env.example .env
# Required: ANTHROPIC_API_KEY
# Recommended: OPENAI_API_KEY, GITHUB_TOKEN
# 3. Start infrastructure
docker compose up -d # PostgreSQL (pgvector) + Redis
# 4. Run migrations
npx knex migrate:latest --knexfile knexfile.ts
# 5. Launch (3 services)
python3 server/tts-server.py & # Local TTS on :5757
npx tsx server/src/index.ts & # Server on :3131
cd client && npx vite --port 4173 & # Client on :4173Open http://localhost:4173 — auto-connects, type in the text box or switch to Voice mode.
| SLO | Implementation | Metric |
|---|---|---|
| Cadence | 800ms working cue timer; audible chime during long tool calls | Never leaves user in silence |
| Latency | Per-stage budget tracking: STT → LLM TTFT → TTS first chunk | Displayed in status bar |
| Truth | Tool-grounded responses only; refuses when evidence unavailable | Zero fabricated data |
| Interruptibility | Barge-in stops TTS, aborts LLM, preserves context with [interrupted] |
Sub-200ms interruption |
IDLE ──▸ LISTENING_PASSIVE ──▸ LISTENING_ACTIVE ──▸ PROCESSING ──▸ SPEAKING
▴ │
└──────────────── TTS_DONE ◂───────────────────────────┘
BARGE_IN ──▸ CANCELLING
6 states, 18 valid transitions. Working cue auto-fires at 800ms in PROCESSING. Barge-in as first-class state transition with context preservation.
| Tool | Description |
|---|---|
github_get_repo |
Repository metadata, stars, language |
github_list_repos |
List user's repositories |
github_list_issues |
List/filter issues by state, labels |
github_get_issue |
Issue details + comments |
github_create_issue |
Create a new GitHub issue |
github_list_prs |
List pull requests by state |
github_get_pr |
PR details + diff stats + reviews |
github_list_commits |
Recent commit history |
github_list_branches |
List repository branches |
github_list_workflow_runs |
CI/CD pipeline status (GitHub Actions) |
github_get_file |
Read file contents from a repository |
github_search_code |
Search code across repositories |
github_dry_run_pr |
Preview PR proposal without pushing |
data_query |
Query external data API with freshness tracking |
data_get_freshness |
Report cache freshness timestamps |
memory_recall |
Semantic search across stored memories |
memory_what_do_you_remember |
List all facts/preferences about user |
memory_store_fact |
Persist a fact or preference |
get_current_time |
Current date/time with timezone support |
calculate |
Evaluate math expressions |
web_fetch |
Fetch any URL or API endpoint |
set_reminder |
Set a timed reminder |
Tools execute via a ReAct agent loop — Claude autonomously decides which tools to call, processes results, and may chain multiple tool calls before responding.
Type / in the text input to see available commands:
| Command | Description |
|---|---|
/help |
Show capabilities and command reference |
/tools |
List all registered tools with descriptions |
/clear |
Clear conversation history |
/verify |
Enable verification mode (confidence + citations) |
/status |
Show pipeline status, session info, latency stats |
| Mode | How | API Key |
|---|---|---|
| Text | Type in input box | None |
| Voice | Browser-native SpeechRecognition (Chrome/Edge/Safari) | None |
| Whisper | OpenAI Whisper server-side STT | OPENAI_API_KEY |
The system auto-detects available API keys and uses real services when configured, mock implementations otherwise. Each service (STT, LLM, TTS) can independently be real or mock.
┌─────────────────┐ ┌───────────────────┐ ┌──────────────────┐
│ Conversation │ │ Semantic Memory │ │ Preferences │
│ Store (PG) │ │ (pgvector) │ │ (PG) │
│ │ │ │ │ │
│ Immutable log │ │ Embeddings of │ │ JSON rules │
│ of transcripts, │ │ facts, summaries, │ │ evaluated │
│ tool calls, │ │ preferences with │ │ before every │
│ responses │ │ cosine similarity │ │ response │
│ │ │ search │ │ │
└─────────────────┘ └───────────────────┘ └──────────────────┘
- Immutable audit log: Every message, tool call, and response stored in PostgreSQL
- Semantic retrieval: pgvector embeddings for "what do you remember about me?"
- Preference policies: Enforced constraints ("always use metric units")
- Post-session summarization: Claude Haiku auto-summarizes after each session
- Latency dashboard: Per-stage timing in status bar (STT:120ms LLM:800ms TTS:300ms)
- Audio Health Panel: Input level (dBFS), VAD state, AEC status, TTS state, barge-in count, pipeline state
- Prometheus metrics:
/api/metricsendpoint with histograms for STT/LLM/TTS latency, tool call counts, active sessions - Structured logging: Pino with correlation IDs per session
| Endpoint | Description |
|---|---|
GET /health |
Health check with DB/Redis status |
GET /api/status |
API key configuration status |
GET /api/sessions |
List recent sessions |
GET /api/sessions/:id |
Session details + message history |
GET /api/sessions/:id/transcript |
Formatted transcript |
GET /api/memory/:userId |
List stored memories |
GET /api/preferences/:userId |
List active preferences |
POST /api/preferences/:userId |
Add a preference rule |
GET /api/metrics |
Prometheus metrics |
cd server && npm test # 86 unit tests (vitest)
npx tsx scripts/test-db.ts # Database integration tests
npx tsx scripts/test-latency.ts # End-to-end latency measurement| Suite | Tests | Coverage |
|---|---|---|
| State Machine | 35 | All transitions, events, working cue timer |
| Text Chunker | 18 | Sentence boundaries, flush, min chunk |
| Transcript Buffer | 15 | Interim/final, utterance commit |
| Protocol | 18 | Encode/decode roundtrip, frame types |
├── shared/protocol.ts # WebSocket message types (single source of truth)
├── server/src/
│ ├── orchestrator/ # State machine, pipeline, barge-in, session
│ ├── stt/ # Whisper STT, browser speech, mock
│ ├── llm/ # Claude streaming, ReAct loop, truth guard
│ ├── tts/ # Local TTS (edge-tts), ElevenLabs, mock
│ ├── tools/ # Tool registry, GitHub, Data API, Memory, Commands
│ ├── memory/ # Conversation store, semantic memory, embeddings
│ ├── api/ # REST endpoints
│ ├── db/ # PostgreSQL, Redis, migrations
│ └── observability/ # Latency tracker, Prometheus metrics, Pino logger
├── client/src/
│ ├── components/ # VoiceOrb, TranscriptPane, AudioHealthPanel, etc.
│ ├── hooks/ # useWebSocket, useAudioCapture, useWebSpeech
│ └── audio/ # AudioWorklet, playback manager, working cue
└── scripts/ # Dev, DB test, latency test, seed preferences
92 source files, ~9,500 lines of TypeScript + Python.
- Single WebSocket with 1-byte frame prefix (
0x00=JSON,0x01=audio) — simplifies connection management - Per-service mock/real switching — each of STT, LLM, TTS independently uses real or mock based on API key presence
- Sentence-boundary TTS chunking — buffers LLM tokens until sentence ends for natural prosody
- Dual endpointing — Deepgram's silence-based (300ms) + UtteranceEnd (1000ms) for noisy environments
- Evidence-first truth contract — prompt engineering + tool output tracking, not post-hoc filtering
- Auto-reconnect with exponential backoff — 1s → 2s → 4s → 8s → 10s max
See .env.example for the complete list. Required for full functionality:
| Variable | Required | Description |
|---|---|---|
ANTHROPIC_API_KEY |
Yes | Claude API for LLM |
OPENAI_API_KEY |
Recommended | Whisper STT + embeddings for semantic memory |
GITHUB_TOKEN |
Recommended | GitHub tool access (repos, issues, PRs, etc.) |
ELEVENLABS_API_KEY |
Optional | ElevenLabs TTS (fallback if local TTS unavailable) |