A local-first AI system that combines Retrieval-Augmented Generation (RAG), streaming responses, and philosophical reasoning — built entirely on a laptop.
This project is an end-to-end AI knowledge system designed to:
- 📚 Answer questions using custom PDFs (RAG)
- ⚡ Stream responses in real-time (ChatGPT-like UX)
- 🕉️ Generate daily philosophical contemplations
- 🔒 Run fully local (Ollama + FastAPI + React)
No cloud dependencies. Full control over data and compute.
PDFs → Embeddings → FAISS Index
↓
User Query → Retrieval → LLM (Ollama)
↓
Streaming Response → UI
- FastAPI (Python)
- FAISS (Vector Search)
- Sentence Transformers (Embeddings)
- Ollama (Llama 3.2 - Local LLM)
- React + Vite
- Streaming Fetch API
- Discord Webhook (Daily Contemplations)
- Context-aware answers from your own PDFs
- Smart chunking + embedding pipeline
- Local vector database (FAISS)
- Token-by-token streaming (like ChatGPT)
- Optimized for local inference
- Fault-tolerant fallback to non-stream mode
- AI-generated philosophy (Gita, Upanishads, Epics)
- Automatically sent to Discord
- Dynamic topic generation (no static prompts)
- Runs entirely on laptop
- No external API dependency
- Full data privacy
- Structured reasoning prompts
- Context grounding (no hallucination)
- Insight generation (not copy-paste)
- Batched embeddings for scalability
- Chunk limiting to prevent overload
- Streaming optimization for latency
git clone https://github.com/thescalableguy/rag-implementation.git
cd rag-implementation
pip install -r requirements.txt
uvicorn api.server:app --reload
cd ui
npm install
npm run dev
ollama serve
ollama run llama3.2
python core/rag/ingest.py
- 📖 Ask: "What is Dharma according to Bhagavad Gita?"
- 🧠 Get: Context-grounded, explained answer
- ⚡ Delivered via streaming UI
- Combines LLMs + Systems Engineering + Performance Thinking
- Demonstrates local AI deployment (future trend)
- Implements real-world RAG pipeline
- Focuses on latency, scalability, and reliability
- RAG is primarily a data + retrieval problem, not just LLM usage
- Streaming systems require careful state management
- Local AI introduces performance constraints → design trade-offs
- Source-level citation (Perplexity-style)
- Memory layer (personalized responses)
- LLM observability (latency, tokens/sec)
- Krishna Mode (philosophy-driven reasoning engine)
Sayan Bhattacharya Senior Performance Engineer | AI Systems Builder
Building systems where performance meets intelligence
Give it a star ⭐ and feel free to contribute!