Skip to content

sandeshgoyal09/Ai-Rag

Repository files navigation

MasterRAG — Multimodal RAG with Gemini & YOLO

A production-ready pipeline for extracting and retrieving information from complex documents containing text, tables, images, and formulas. Uses YOLO for document-layout detection, Camelot for table parsing, and Gemini for semantic understanding.

Features

  • "Eyes & Brain" Approach — Computer Vision (YOLO) + Multimodal LLMs (Gemini).
  • Hybrid Retrieval — Vector search + BM25 keyword search fused with Reciprocal Rank Fusion (RRF).
  • Agentic RAG — LangGraph-based agent with query optimization and optional fact-checking via Tavily.
  • Visual Understanding — Extracts and captions charts, diagrams, formulas, and tables.
  • Auth & Multi-tenancy — JWT-based auth with per-user document isolation.
  • Fully Configurable — All models, keys, and settings controlled via .env.
├── config.py           # Central configuration (reads .env)
├── db.py               # MongoDB vector service (handles indexing & vector/BM25 search)
├── embedding.py        # Embedding generation script (using Gemini models)
├── ingest.py           # PDF extraction pipeline (YOLO layout detection + Gemini vision)
├── retriever.py        # LangGraph RAG agent (query logic & fact checking)
├── server.py           # FastAPI backend server (auth, upload, query endpoints)
├── main.py             # CLI entry point for local operations (ingest, query)
├── diagnose_rag.py     # Diagnostic tool for testing vector/BM25/hybrid search manually
├── ui.py               # Streamlit chat UI frontend
├── prompts.py          # LLM prompt templates for analysis, extraction, and generation
├── logger.py           # Colored CLI logging setup
├── .env.example        # Template for environment variables (.env should be created locally)
├── static/             # Frontend assets (HTML/CSS/JS for the dashboard)
├── uploads/            # Temporary directory where uploaded PDFs are stored before ingestion
├── models/             # Directory for downloaded local weights (e.g., YOLO model)
├── data/               # Local data repository (used for large sample PDFs not pushed to Git)
└── output/             # Stores intermediate ingestion artifacts (useful for debugging!)

🔎 Verifying Extraction Quality in the output/ Folder

Whenever a new document is ingested (e.g., document.pdf), the system automatically saves local artifacts inside the output/document/ directory. You can use these files to verify whether the pipeline extracted data properly:

  • output/{filename}/full_text.txt: This shows the complete extracted text representation of the document before chunking. You can see how tables, figures, and formulas have been represented or captioned by Gemini.
  • output/{filename}/chunks.json: This allows you to inspect the exact chunks logic, showing the size, overlap, and embedding representation of what will be stored in MongoDB.
  • output/{filename}/yolo/page_{num}.png: For every page processed, you will find an image here containing the exact bounding boxes drawn over the document by the YOLO model (highlighting tables, figures, headings, etc.). This is extremely useful to visually verify if the YOLO layout detection is working accurately.

Quick Start

1. Install dependencies

python -m venv .venv
.venv\Scripts\activate      # Windows
pip install -r requirements.txt

Note: You also need Poppler for PDF-to-image conversion.

2. Configure environment

cp .env.example .env

Edit .env with your keys:

GOOGLE_API_KEY=your_gemini_key
TAVILY_API_KEY=your_tavily_key
MONGO_URI=your_mongodb_uri
DB_NAME=MasterRAG
GEMINI_MODEL=models/gemini-2.5-flash
EMBEDDING_MODEL=models/gemini-embedding-001
JWT_SECRET=your_secret

3. Run the API server

python server.py

Server starts at http://localhost:8000.

4. UI

UI starts at http://localhost:8000.

CLI Usage

# Ingest a PDF
python main.py ingest path/to/document.pdf --user-id user123

# Query
python main.py query "What is the revenue growth?" --user-id user123

API Endpoints

Method Endpoint Auth Description
POST /api/register No Create account
POST /api/login No Get JWT token
POST /api/documents/upload Bearer Upload & ingest PDF
GET /api/documents Bearer List user's documents
POST /api/query Bearer Ask a question (RAG)
GET /api/history Bearer Get query history

🧠 Extraction Pipeline Deep Dive

For a comprehensive technical deep dive into how our multimodal extraction pipeline works (incorporating YOLO11, Gemini Vision, and LangGraph), please refer to:

👉 extraction_pipeline_deep_dive.md

📸 Screenshots

Screenshot 1 Screenshot 2


License

MIT License

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors