<<<<<<< HEAD
MCA (Data Science) Final Year Capstone Project
An intelligent desktop assistant with animated avatar, voice interaction, computer vision, and autonomous task planning — running 100% locally with zero cloud cost.
User speaks → Wake Word → STT (Faster-Whisper)
↓
Supervisor (LangGraph)
┌──────────────────────┐
│ Intent classifier │
│ (Phi-3-mini) │
└──────┬───────────────┘
│
┌────────────────┼────────────────┐
▼ ▼ ▼
Conversation Planning Agent Vision Agent
Agent (Qwen) (Qwen + plan) (LLaVA + OCR)
│ │ │
└────────────────┼────────────────┘
▼
Unified Response
│
┌────────────────┼────────────────┐
▼ ▼ ▼
TTS (Piper) Avatar (Godot) Memory (ChromaDB)
# 1. Install Ollama
curl https://ollama.ai/install.sh | sh
ollama serve &
# 2. Clone and set up
git clone <your-repo>
cd jarvis
python setup.py # downloads all models (~15GB)
# 3. Run
python -m uvicorn api.main:app --port 8000
# 4. Test
curl -X POST http://localhost:8000/chat \
-H "Content-Type: application/json" \
-d '{"text": "Create a folder on my desktop called MCA Project"}'jarvis/
├── core/
│ ├── supervisor.py LangGraph orchestrator
│ └── llm/
│ └── ollama_client.py Ollama wrapper
├── speech/
│ ├── wake_word.py OpenWakeWord listener
│ ├── stt.py Faster-Whisper STT
│ └── tts.py Piper / Coqui TTS
├── memory/
│ └── long_term.py ChromaDB + SQLite memory
├── vision/
│ └── screen_capture.py OCR + UI detection + LLaVA
├── automation/
│ └── executor.py PyAutoGUI + Playwright
├── avatar/
│ ├── avatar_controller.py WebSocket bridge to Godot
│ └── godot_project/ Godot 4 avatar scene
├── api/
│ └── main.py FastAPI backend (entry point)
├── config/
│ └── settings.yaml
├── requirements.txt
└── setup.py
| Tier | GPU | RAM | Performance |
|---|---|---|---|
| Minimum | GTX 1060 6GB | 16GB | Good — 7B models, ~1.5s latency |
| Recommended | RTX 3060 12GB | 32GB | Excellent — 13B models, <1s |
| CPU-only | None | 16GB | Degraded — 3B models, ~5s latency |
| Component | Technology | Reason |
|---|---|---|
| LLM | Qwen2.5-7B via Ollama | Best reasoning per GB |
| STT | Faster-Whisper | 4× faster than Whisper, same accuracy |
| TTS | Piper TTS | 50ms latency, 35+ languages |
| Wake word | OpenWakeWord | Apache 2.0, fully offline, trainable |
| Agents | LangGraph | Fine-grained control over agent flow |
| Memory | ChromaDB + SQLite | Vector + structured storage |
| Vision | LLaVA + EasyOCR | LLM-grade screen understanding |
| Automation | PyAutoGUI + Playwright | GUI + browser control |
| Avatar | Godot 4 | MIT license, WebSocket API |
| Backend | FastAPI | Async, WebSocket support |
- Emotion-aware memory retrieval — weights past memories by emotional context
- Proactive task anticipation — learns user patterns and suggests actions
- Visual workflow recording — records human actions → replayable plan
- Cross-app context transfer — shares context between applications
{"text": "Open Chrome and search for deep learning", "session_id": "user1"}Response:
{
"response": "[HAPPY] Opening Chrome and searching...",
"emotion": "happy",
"intent": "automation",
"action_plan": [...],
"latency_ms": 1240
}Connect to ws://localhost:8000/ws and send:
{"text": "What's on my screen?", "session_id": "user1"}{"status": "ok", "ollama": true, "avatar": true}An AI-powered desktop virtual assistant with an animated avatar, voice interaction, system automation, computer vision, memory, and LLM-based reasoning using free and open-source AI models.
f83a25afb1ceae62cd8188772f013fdc896457e8