Skip to content

rexper101/jarvis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

<<<<<<< HEAD

Jarvis — AI-Driven Desktop Virtual Assistant

MCA (Data Science) Final Year Capstone Project

An intelligent desktop assistant with animated avatar, voice interaction, computer vision, and autonomous task planning — running 100% locally with zero cloud cost.


Architecture at a Glance

User speaks → Wake Word → STT (Faster-Whisper)
                              ↓
                    Supervisor (LangGraph)
                    ┌──────────────────────┐
                    │  Intent classifier   │
                    │  (Phi-3-mini)        │
                    └──────┬───────────────┘
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
   Conversation     Planning Agent    Vision Agent
   Agent (Qwen)     (Qwen + plan)     (LLaVA + OCR)
          │                │                │
          └────────────────┼────────────────┘
                           ▼
                   Unified Response
                           │
          ┌────────────────┼────────────────┐
          ▼                ▼                ▼
    TTS (Piper)     Avatar (Godot)    Memory (ChromaDB)

Quick Start

# 1. Install Ollama
curl https://ollama.ai/install.sh | sh
ollama serve &

# 2. Clone and set up
git clone <your-repo>
cd jarvis
python setup.py          # downloads all models (~15GB)

# 3. Run
python -m uvicorn api.main:app --port 8000

# 4. Test
curl -X POST http://localhost:8000/chat \
     -H "Content-Type: application/json" \
     -d '{"text": "Create a folder on my desktop called MCA Project"}'

Project Structure

jarvis/
├── core/
│   ├── supervisor.py        LangGraph orchestrator
│   └── llm/
│       └── ollama_client.py Ollama wrapper
├── speech/
│   ├── wake_word.py         OpenWakeWord listener
│   ├── stt.py               Faster-Whisper STT
│   └── tts.py               Piper / Coqui TTS
├── memory/
│   └── long_term.py         ChromaDB + SQLite memory
├── vision/
│   └── screen_capture.py   OCR + UI detection + LLaVA
├── automation/
│   └── executor.py          PyAutoGUI + Playwright
├── avatar/
│   ├── avatar_controller.py WebSocket bridge to Godot
│   └── godot_project/       Godot 4 avatar scene
├── api/
│   └── main.py              FastAPI backend (entry point)
├── config/
│   └── settings.yaml
├── requirements.txt
└── setup.py

Hardware Requirements

Tier GPU RAM Performance
Minimum GTX 1060 6GB 16GB Good — 7B models, ~1.5s latency
Recommended RTX 3060 12GB 32GB Excellent — 13B models, <1s
CPU-only None 16GB Degraded — 3B models, ~5s latency

Technology Stack

Component Technology Reason
LLM Qwen2.5-7B via Ollama Best reasoning per GB
STT Faster-Whisper 4× faster than Whisper, same accuracy
TTS Piper TTS 50ms latency, 35+ languages
Wake word OpenWakeWord Apache 2.0, fully offline, trainable
Agents LangGraph Fine-grained control over agent flow
Memory ChromaDB + SQLite Vector + structured storage
Vision LLaVA + EasyOCR LLM-grade screen understanding
Automation PyAutoGUI + Playwright GUI + browser control
Avatar Godot 4 MIT license, WebSocket API
Backend FastAPI Async, WebSocket support

Research Contributions

  1. Emotion-aware memory retrieval — weights past memories by emotional context
  2. Proactive task anticipation — learns user patterns and suggests actions
  3. Visual workflow recording — records human actions → replayable plan
  4. Cross-app context transfer — shares context between applications

API Reference

POST /chat

{"text": "Open Chrome and search for deep learning", "session_id": "user1"}

Response:

{
  "response": "[HAPPY] Opening Chrome and searching...",
  "emotion": "happy",
  "intent": "automation",
  "action_plan": [...],
  "latency_ms": 1240
}

WebSocket /ws

Connect to ws://localhost:8000/ws and send:

{"text": "What's on my screen?", "session_id": "user1"}

GET /health

{"status": "ok", "ollama": true, "avatar": true}

License

MIT — Free for academic and personal use.

Nano-

An AI-powered desktop virtual assistant with an animated avatar, voice interaction, system automation, computer vision, memory, and LLM-based reasoning using free and open-source AI models.

f83a25afb1ceae62cd8188772f013fdc896457e8

About

An AI-powered desktop virtual assistant with an animated avatar, voice interaction, system automation, computer vision, memory, and LLM-based reasoning using free and open-source AI models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors