Skip to content

yashkuceriya/Jarvis

Repository files navigation

Jarvis — Real-Time Voice Assistant

A production-grade, real-time voice assistant engineered for frontier-audio-grade performance: adaptive jitter buffering, spectral analysis, AGC-normalized capture, tool-grounded intelligence, and sub-second conversational latency. Built for frontline operational use where audio quality, speed, and interruptibility are non-negotiable.

Architecture

Browser (React + Vite)              Server (Node.js + TypeScript)
┌─────────────────────────┐         ┌──────────────────────────────────┐
│  AudioWorklet (mic)      │──PCM──▸│  Whisper STT / Browser Speech     │
│  Web Speech API (primary)│        │  Claude LLM (tool use + ReAct)   │
│  Playback Manager        │◂─PCM──│  Local TTS (edge-tts) / 11Labs   │
│  VoiceOrb + Premium UI   │◂─JSON─│  6-State Machine (orchestrator)  │
│  Audio Health Panel      │        │  24 Tools (GitHub, Util, Memory) │
│  Barge-In Detection      │──JSON─▸│  PostgreSQL + pgvector + Redis   │
└─────────────────────────┘         └──────────────────────────────────┘

Pipeline: Text/Speech → Whisper STT or Browser Speech API → Claude (with tools) → Local TTS (edge-tts) → Audio Playback

Measured latency: LLM TTFT ~1.3s, TTS first chunk ~0.6s, E2E ~2-5s depending on tool calls.

Audio Engineering

This isn't a toy voice wrapper — the audio layer is engineered for production environments.

Capture Pipeline (AudioWorklet)

  • Automatic Gain Control (AGC): Normalizes to -18 dBFS target with 10ms attack / 100ms release envelope
  • Clipping detection: Counts samples at +/-1.0, both pre- and post-AGC
  • Noise floor estimation: Running minimum RMS over 2-second sliding window
  • Peak tracking: Exponential decay (300ms time constant) for smooth metering
  • Per-buffer metrics: RMS, peak, clipping count, noise floor, AGC gain — all emitted at buffer rate

Audio Frame Protocol

[0x01] [version:1] [codec:1] [sampleRate:2 LE] [seqNum:4 LE] [timestamp:4 LE] [PCM data]
  • Versioned wire format for forward/backward compatibility
  • Codec field supports PCM-16 (0x01), PCM-24 (0x02), extensible to Opus
  • Sequence numbers for packet loss and reorder detection
  • Timestamps for jitter measurement and lip-sync

Playback (Adaptive Jitter Buffer)

  • Adaptive buffer depth: Targets 2x measured jitter (50ms min, 500ms max)
  • Underrun/overrun tracking: Counters exported as Prometheus metrics
  • Sequence-aware scheduling: Detects out-of-order and dropped packets
  • Gap-free playback: AudioContext.currentTime scheduling for zero-gap audio

Real-Time Diagnostics (Audio Health Panel)

  • Real-time FFT spectral analyzer (64-band, 60fps canvas)
  • Input level meter (dBFS), VAD state, AEC status
  • AGC gain, noise floor, SNR estimate, clipping counter
  • Network RTT and jitter (WebSocket ping/pong)
  • TTS playback state, buffer depth, barge-in counter

Latency Budget (10-stage tracking)

[Capture] → [AGC] → [Encode] → [Network↓] → [STT] → [LLM-TTFT]
                                                          ↓
                                    [TTS-first] ← [LLM-gen] → [Tools]
                                         ↓
                                    [Network↑] → [Jitter Buffer] → [Playback]

Each stage measured with millisecond precision. P50/P95 percentiles computed over rolling 100-sample window. Full breakdown in /status command and Prometheus export.

Quick Start

Requirements: Node.js 20+, Docker

# 1. Install
git clone https://github.com/yashkuceriya/Jarvis.git && cd Jarvis
npm install
pip3 install edge-tts    # Local TTS (free, no API key)
brew install ffmpeg      # Audio format conversion

# 2. Configure
cp .env.example .env
# Required: ANTHROPIC_API_KEY
# Recommended: OPENAI_API_KEY, GITHUB_TOKEN

# 3. Start infrastructure
docker compose up -d    # PostgreSQL (pgvector) + Redis

# 4. Run migrations
npx knex migrate:latest --knexfile knexfile.ts

# 5. Launch (3 services)
python3 server/tts-server.py &          # Local TTS on :5757
npx tsx server/src/index.ts &           # Server on :3131
cd client && npx vite --port 4173 &     # Client on :4173

Open http://localhost:4173 — auto-connects, type in the text box or switch to Voice mode.

SLO-Driven Design

SLO Implementation Metric
Cadence 800ms working cue timer; audible chime during long tool calls Never leaves user in silence
Latency Per-stage budget tracking: STT → LLM TTFT → TTS first chunk Displayed in status bar
Truth Tool-grounded responses only; refuses when evidence unavailable Zero fabricated data
Interruptibility Barge-in stops TTS, aborts LLM, preserves context with [interrupted] Sub-200ms interruption

State Machine

IDLE ──▸ LISTENING_PASSIVE ──▸ LISTENING_ACTIVE ──▸ PROCESSING ──▸ SPEAKING
              ▴                                                      │
              └──────────────── TTS_DONE ◂───────────────────────────┘
                                                    BARGE_IN ──▸ CANCELLING

6 states, 18 valid transitions. Working cue auto-fires at 800ms in PROCESSING. Barge-in as first-class state transition with context preservation.

Tools (24 registered)

Tool Description
github_get_repo Repository metadata, stars, language
github_list_repos List user's repositories
github_list_issues List/filter issues by state, labels
github_get_issue Issue details + comments
github_create_issue Create a new GitHub issue
github_list_prs List pull requests by state
github_get_pr PR details + diff stats + reviews
github_list_commits Recent commit history
github_list_branches List repository branches
github_list_workflow_runs CI/CD pipeline status (GitHub Actions)
github_get_file Read file contents from a repository
github_search_code Search code across repositories
github_dry_run_pr Preview PR proposal without pushing
data_query Query external data API with freshness tracking
data_get_freshness Report cache freshness timestamps
memory_recall Semantic search across stored memories
memory_what_do_you_remember List all facts/preferences about user
memory_store_fact Persist a fact or preference
get_current_time Current date/time with timezone support
calculate Evaluate math expressions
web_fetch Fetch any URL or API endpoint
set_reminder Set a timed reminder

Tools execute via a ReAct agent loop — Claude autonomously decides which tools to call, processes results, and may chain multiple tool calls before responding.

Commands

Type / in the text input to see available commands:

Command Description
/help Show capabilities and command reference
/tools List all registered tools with descriptions
/clear Clear conversation history
/verify Enable verification mode (confidence + citations)
/status Show pipeline status, session info, latency stats

Input Modes

Mode How API Key
Text Type in input box None
Voice Browser-native SpeechRecognition (Chrome/Edge/Safari) None
Whisper OpenAI Whisper server-side STT OPENAI_API_KEY

The system auto-detects available API keys and uses real services when configured, mock implementations otherwise. Each service (STT, LLM, TTS) can independently be real or mock.

Memory Architecture

┌─────────────────┐     ┌───────────────────┐     ┌──────────────────┐
│   Conversation   │     │  Semantic Memory   │     │   Preferences    │
│   Store (PG)     │     │  (pgvector)        │     │   (PG)           │
│                  │     │                    │     │                  │
│  Immutable log   │     │  Embeddings of     │     │  JSON rules      │
│  of transcripts, │     │  facts, summaries, │     │  evaluated       │
│  tool calls,     │     │  preferences with  │     │  before every    │
│  responses       │     │  cosine similarity │     │  response        │
│                  │     │  search            │     │                  │
└─────────────────┘     └───────────────────┘     └──────────────────┘
  • Immutable audit log: Every message, tool call, and response stored in PostgreSQL
  • Semantic retrieval: pgvector embeddings for "what do you remember about me?"
  • Preference policies: Enforced constraints ("always use metric units")
  • Post-session summarization: Claude Haiku auto-summarizes after each session

Observability

  • Latency dashboard: Per-stage timing in status bar (STT:120ms LLM:800ms TTS:300ms)
  • Audio Health Panel: Input level (dBFS), VAD state, AEC status, TTS state, barge-in count, pipeline state
  • Prometheus metrics: /api/metrics endpoint with histograms for STT/LLM/TTS latency, tool call counts, active sessions
  • Structured logging: Pino with correlation IDs per session

REST API

Endpoint Description
GET /health Health check with DB/Redis status
GET /api/status API key configuration status
GET /api/sessions List recent sessions
GET /api/sessions/:id Session details + message history
GET /api/sessions/:id/transcript Formatted transcript
GET /api/memory/:userId List stored memories
GET /api/preferences/:userId List active preferences
POST /api/preferences/:userId Add a preference rule
GET /api/metrics Prometheus metrics

Testing

cd server && npm test          # 86 unit tests (vitest)
npx tsx scripts/test-db.ts     # Database integration tests
npx tsx scripts/test-latency.ts # End-to-end latency measurement
Suite Tests Coverage
State Machine 35 All transitions, events, working cue timer
Text Chunker 18 Sentence boundaries, flush, min chunk
Transcript Buffer 15 Interim/final, utterance commit
Protocol 18 Encode/decode roundtrip, frame types

Project Structure

├── shared/protocol.ts          # WebSocket message types (single source of truth)
├── server/src/
│   ├── orchestrator/           # State machine, pipeline, barge-in, session
│   ├── stt/                    # Whisper STT, browser speech, mock
│   ├── llm/                    # Claude streaming, ReAct loop, truth guard
│   ├── tts/                    # Local TTS (edge-tts), ElevenLabs, mock
│   ├── tools/                  # Tool registry, GitHub, Data API, Memory, Commands
│   ├── memory/                 # Conversation store, semantic memory, embeddings
│   ├── api/                    # REST endpoints
│   ├── db/                     # PostgreSQL, Redis, migrations
│   └── observability/          # Latency tracker, Prometheus metrics, Pino logger
├── client/src/
│   ├── components/             # VoiceOrb, TranscriptPane, AudioHealthPanel, etc.
│   ├── hooks/                  # useWebSocket, useAudioCapture, useWebSpeech
│   └── audio/                  # AudioWorklet, playback manager, working cue
└── scripts/                    # Dev, DB test, latency test, seed preferences

92 source files, ~9,500 lines of TypeScript + Python.

Key Design Decisions

  1. Single WebSocket with 1-byte frame prefix (0x00=JSON, 0x01=audio) — simplifies connection management
  2. Per-service mock/real switching — each of STT, LLM, TTS independently uses real or mock based on API key presence
  3. Sentence-boundary TTS chunking — buffers LLM tokens until sentence ends for natural prosody
  4. Dual endpointing — Deepgram's silence-based (300ms) + UtteranceEnd (1000ms) for noisy environments
  5. Evidence-first truth contract — prompt engineering + tool output tracking, not post-hoc filtering
  6. Auto-reconnect with exponential backoff — 1s → 2s → 4s → 8s → 10s max

Environment Variables

See .env.example for the complete list. Required for full functionality:

Variable Required Description
ANTHROPIC_API_KEY Yes Claude API for LLM
OPENAI_API_KEY Recommended Whisper STT + embeddings for semantic memory
GITHUB_TOKEN Recommended GitHub tool access (repos, issues, PRs, etc.)
ELEVENLABS_API_KEY Optional ElevenLabs TTS (fallback if local TTS unavailable)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors