Jarvis — Real-Time Voice Assistant

A production-grade, real-time voice assistant engineered for frontier-audio-grade performance: adaptive jitter buffering, spectral analysis, AGC-normalized capture, tool-grounded intelligence, and sub-second conversational latency. Built for frontline operational use where audio quality, speed, and interruptibility are non-negotiable.

Architecture

Browser (React + Vite)              Server (Node.js + TypeScript)
┌─────────────────────────┐         ┌──────────────────────────────────┐
│  AudioWorklet (mic)      │──PCM──▸│  Whisper STT / Browser Speech     │
│  Web Speech API (primary)│        │  Claude LLM (tool use + ReAct)   │
│  Playback Manager        │◂─PCM──│  Local TTS (edge-tts) / 11Labs   │
│  VoiceOrb + Premium UI   │◂─JSON─│  6-State Machine (orchestrator)  │
│  Audio Health Panel      │        │  24 Tools (GitHub, Util, Memory) │
│  Barge-In Detection      │──JSON─▸│  PostgreSQL + pgvector + Redis   │
└─────────────────────────┘         └──────────────────────────────────┘

Pipeline: Text/Speech → Whisper STT or Browser Speech API → Claude (with tools) → Local TTS (edge-tts) → Audio Playback

Measured latency: LLM TTFT ~1.3s, TTS first chunk ~0.6s, E2E ~2-5s depending on tool calls.

Audio Engineering

This isn't a toy voice wrapper — the audio layer is engineered for production environments.

Capture Pipeline (AudioWorklet)

Automatic Gain Control (AGC): Normalizes to -18 dBFS target with 10ms attack / 100ms release envelope
Clipping detection: Counts samples at +/-1.0, both pre- and post-AGC
Noise floor estimation: Running minimum RMS over 2-second sliding window
Peak tracking: Exponential decay (300ms time constant) for smooth metering
Per-buffer metrics: RMS, peak, clipping count, noise floor, AGC gain — all emitted at buffer rate

Audio Frame Protocol

[0x01] [version:1] [codec:1] [sampleRate:2 LE] [seqNum:4 LE] [timestamp:4 LE] [PCM data]

Versioned wire format for forward/backward compatibility
Codec field supports PCM-16 (0x01), PCM-24 (0x02), extensible to Opus
Sequence numbers for packet loss and reorder detection
Timestamps for jitter measurement and lip-sync

Playback (Adaptive Jitter Buffer)

Adaptive buffer depth: Targets 2x measured jitter (50ms min, 500ms max)
Underrun/overrun tracking: Counters exported as Prometheus metrics
Sequence-aware scheduling: Detects out-of-order and dropped packets
Gap-free playback: AudioContext.currentTime scheduling for zero-gap audio

Real-Time Diagnostics (Audio Health Panel)

Real-time FFT spectral analyzer (64-band, 60fps canvas)
Input level meter (dBFS), VAD state, AEC status
AGC gain, noise floor, SNR estimate, clipping counter
Network RTT and jitter (WebSocket ping/pong)
TTS playback state, buffer depth, barge-in counter

Latency Budget (10-stage tracking)

[Capture] → [AGC] → [Encode] → [Network↓] → [STT] → [LLM-TTFT]
                                                          ↓
                                    [TTS-first] ← [LLM-gen] → [Tools]
                                         ↓
                                    [Network↑] → [Jitter Buffer] → [Playback]

Each stage measured with millisecond precision. P50/P95 percentiles computed over rolling 100-sample window. Full breakdown in /status command and Prometheus export.

Quick Start

Requirements: Node.js 20+, Docker

# 1. Install
git clone https://github.com/yashkuceriya/Jarvis.git && cd Jarvis
npm install
pip3 install edge-tts    # Local TTS (free, no API key)
brew install ffmpeg      # Audio format conversion

# 2. Configure
cp .env.example .env
# Required: ANTHROPIC_API_KEY
# Recommended: OPENAI_API_KEY, GITHUB_TOKEN

# 3. Start infrastructure
docker compose up -d    # PostgreSQL (pgvector) + Redis

# 4. Run migrations
npx knex migrate:latest --knexfile knexfile.ts

# 5. Launch (3 services)
python3 server/tts-server.py &          # Local TTS on :5757
npx tsx server/src/index.ts &           # Server on :3131
cd client && npx vite --port 4173 &     # Client on :4173

Open http://localhost:4173 — auto-connects, type in the text box or switch to Voice mode.

SLO-Driven Design

SLO	Implementation	Metric
Cadence	800ms working cue timer; audible chime during long tool calls	Never leaves user in silence
Latency	Per-stage budget tracking: STT → LLM TTFT → TTS first chunk	Displayed in status bar
Truth	Tool-grounded responses only; refuses when evidence unavailable	Zero fabricated data
Interruptibility	Barge-in stops TTS, aborts LLM, preserves context with `[interrupted]`	Sub-200ms interruption

State Machine

IDLE ──▸ LISTENING_PASSIVE ──▸ LISTENING_ACTIVE ──▸ PROCESSING ──▸ SPEAKING
              ▴                                                      │
              └──────────────── TTS_DONE ◂───────────────────────────┘
                                                    BARGE_IN ──▸ CANCELLING

6 states, 18 valid transitions. Working cue auto-fires at 800ms in PROCESSING. Barge-in as first-class state transition with context preservation.

Tools (24 registered)

Tool	Description
`github_get_repo`	Repository metadata, stars, language
`github_list_repos`	List user's repositories
`github_list_issues`	List/filter issues by state, labels
`github_get_issue`	Issue details + comments
`github_create_issue`	Create a new GitHub issue
`github_list_prs`	List pull requests by state
`github_get_pr`	PR details + diff stats + reviews
`github_list_commits`	Recent commit history
`github_list_branches`	List repository branches
`github_list_workflow_runs`	CI/CD pipeline status (GitHub Actions)
`github_get_file`	Read file contents from a repository
`github_search_code`	Search code across repositories
`github_dry_run_pr`	Preview PR proposal without pushing
`data_query`	Query external data API with freshness tracking
`data_get_freshness`	Report cache freshness timestamps
`memory_recall`	Semantic search across stored memories
`memory_what_do_you_remember`	List all facts/preferences about user
`memory_store_fact`	Persist a fact or preference
`get_current_time`	Current date/time with timezone support
`calculate`	Evaluate math expressions
`web_fetch`	Fetch any URL or API endpoint
`set_reminder`	Set a timed reminder

Tools execute via a ReAct agent loop — Claude autonomously decides which tools to call, processes results, and may chain multiple tool calls before responding.

Commands

Type / in the text input to see available commands:

Command	Description
`/help`	Show capabilities and command reference
`/tools`	List all registered tools with descriptions
`/clear`	Clear conversation history
`/verify`	Enable verification mode (confidence + citations)
`/status`	Show pipeline status, session info, latency stats

Input Modes

Mode	How	API Key
Text	Type in input box	None
Voice	Browser-native SpeechRecognition (Chrome/Edge/Safari)	None
Whisper	OpenAI Whisper server-side STT	`OPENAI_API_KEY`

The system auto-detects available API keys and uses real services when configured, mock implementations otherwise. Each service (STT, LLM, TTS) can independently be real or mock.

Memory Architecture

┌─────────────────┐     ┌───────────────────┐     ┌──────────────────┐
│   Conversation   │     │  Semantic Memory   │     │   Preferences    │
│   Store (PG)     │     │  (pgvector)        │     │   (PG)           │
│                  │     │                    │     │                  │
│  Immutable log   │     │  Embeddings of     │     │  JSON rules      │
│  of transcripts, │     │  facts, summaries, │     │  evaluated       │
│  tool calls,     │     │  preferences with  │     │  before every    │
│  responses       │     │  cosine similarity │     │  response        │
│                  │     │  search            │     │                  │
└─────────────────┘     └───────────────────┘     └──────────────────┘

Immutable audit log: Every message, tool call, and response stored in PostgreSQL
Semantic retrieval: pgvector embeddings for "what do you remember about me?"
Preference policies: Enforced constraints ("always use metric units")
Post-session summarization: Claude Haiku auto-summarizes after each session

Observability

Latency dashboard: Per-stage timing in status bar (STT:120ms LLM:800ms TTS:300ms)
Audio Health Panel: Input level (dBFS), VAD state, AEC status, TTS state, barge-in count, pipeline state
Prometheus metrics: /api/metrics endpoint with histograms for STT/LLM/TTS latency, tool call counts, active sessions
Structured logging: Pino with correlation IDs per session

REST API

Endpoint	Description
`GET /health`	Health check with DB/Redis status
`GET /api/status`	API key configuration status
`GET /api/sessions`	List recent sessions
`GET /api/sessions/:id`	Session details + message history
`GET /api/sessions/:id/transcript`	Formatted transcript
`GET /api/memory/:userId`	List stored memories
`GET /api/preferences/:userId`	List active preferences
`POST /api/preferences/:userId`	Add a preference rule
`GET /api/metrics`	Prometheus metrics

Testing

cd server && npm test          # 86 unit tests (vitest)
npx tsx scripts/test-db.ts     # Database integration tests
npx tsx scripts/test-latency.ts # End-to-end latency measurement

Suite	Tests	Coverage
State Machine	35	All transitions, events, working cue timer
Text Chunker	18	Sentence boundaries, flush, min chunk
Transcript Buffer	15	Interim/final, utterance commit
Protocol	18	Encode/decode roundtrip, frame types

Project Structure

├── shared/protocol.ts          # WebSocket message types (single source of truth)
├── server/src/
│   ├── orchestrator/           # State machine, pipeline, barge-in, session
│   ├── stt/                    # Whisper STT, browser speech, mock
│   ├── llm/                    # Claude streaming, ReAct loop, truth guard
│   ├── tts/                    # Local TTS (edge-tts), ElevenLabs, mock
│   ├── tools/                  # Tool registry, GitHub, Data API, Memory, Commands
│   ├── memory/                 # Conversation store, semantic memory, embeddings
│   ├── api/                    # REST endpoints
│   ├── db/                     # PostgreSQL, Redis, migrations
│   └── observability/          # Latency tracker, Prometheus metrics, Pino logger
├── client/src/
│   ├── components/             # VoiceOrb, TranscriptPane, AudioHealthPanel, etc.
│   ├── hooks/                  # useWebSocket, useAudioCapture, useWebSpeech
│   └── audio/                  # AudioWorklet, playback manager, working cue
└── scripts/                    # Dev, DB test, latency test, seed preferences

92 source files, ~9,500 lines of TypeScript + Python.

Key Design Decisions

Single WebSocket with 1-byte frame prefix (0x00=JSON, 0x01=audio) — simplifies connection management
Per-service mock/real switching — each of STT, LLM, TTS independently uses real or mock based on API key presence
Sentence-boundary TTS chunking — buffers LLM tokens until sentence ends for natural prosody
Dual endpointing — Deepgram's silence-based (300ms) + UtteranceEnd (1000ms) for noisy environments
Evidence-first truth contract — prompt engineering + tool output tracking, not post-hoc filtering
Auto-reconnect with exponential backoff — 1s → 2s → 4s → 8s → 10s max

Environment Variables

See .env.example for the complete list. Required for full functionality:

Variable	Required	Description
`ANTHROPIC_API_KEY`	Yes	Claude API for LLM
`OPENAI_API_KEY`	Recommended	Whisper STT + embeddings for semantic memory
`GITHUB_TOKEN`	Recommended	GitHub tool access (repos, issues, PRs, etc.)
`ELEVENLABS_API_KEY`	Optional	ElevenLabs TTS (fallback if local TTS unavailable)

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
client		client
memory-bank		memory-bank
scripts		scripts
server		server
shared		shared
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.production.yml		docker-compose.production.yml
docker-compose.yml		docker-compose.yml
knexfile.ts		knexfile.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Jarvis — Real-Time Voice Assistant

Architecture

Audio Engineering

Capture Pipeline (AudioWorklet)

Audio Frame Protocol

Playback (Adaptive Jitter Buffer)

Real-Time Diagnostics (Audio Health Panel)

Latency Budget (10-stage tracking)

Quick Start

SLO-Driven Design

State Machine

Tools (24 registered)

Commands

Input Modes

Memory Architecture

Observability

REST API

Testing

Project Structure

Key Design Decisions

Environment Variables

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Jarvis — Real-Time Voice Assistant

Architecture

Audio Engineering

Capture Pipeline (AudioWorklet)

Audio Frame Protocol

Playback (Adaptive Jitter Buffer)

Real-Time Diagnostics (Audio Health Panel)

Latency Budget (10-stage tracking)

Quick Start

SLO-Driven Design

State Machine

Tools (24 registered)

Commands

Input Modes

Memory Architecture

Observability

REST API

Testing

Project Structure

Key Design Decisions

Environment Variables

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages