A production-ready pipeline for extracting and retrieving information from complex documents containing text, tables, images, and formulas. Uses YOLO for document-layout detection, Camelot for table parsing, and Gemini for semantic understanding.
- "Eyes & Brain" Approach — Computer Vision (YOLO) + Multimodal LLMs (Gemini).
- Hybrid Retrieval — Vector search + BM25 keyword search fused with Reciprocal Rank Fusion (RRF).
- Agentic RAG — LangGraph-based agent with query optimization and optional fact-checking via Tavily.
- Visual Understanding — Extracts and captions charts, diagrams, formulas, and tables.
- Auth & Multi-tenancy — JWT-based auth with per-user document isolation.
- Fully Configurable — All models, keys, and settings controlled via
.env.
├── config.py # Central configuration (reads .env)
├── db.py # MongoDB vector service (handles indexing & vector/BM25 search)
├── embedding.py # Embedding generation script (using Gemini models)
├── ingest.py # PDF extraction pipeline (YOLO layout detection + Gemini vision)
├── retriever.py # LangGraph RAG agent (query logic & fact checking)
├── server.py # FastAPI backend server (auth, upload, query endpoints)
├── main.py # CLI entry point for local operations (ingest, query)
├── diagnose_rag.py # Diagnostic tool for testing vector/BM25/hybrid search manually
├── ui.py # Streamlit chat UI frontend
├── prompts.py # LLM prompt templates for analysis, extraction, and generation
├── logger.py # Colored CLI logging setup
├── .env.example # Template for environment variables (.env should be created locally)
├── static/ # Frontend assets (HTML/CSS/JS for the dashboard)
├── uploads/ # Temporary directory where uploaded PDFs are stored before ingestion
├── models/ # Directory for downloaded local weights (e.g., YOLO model)
├── data/ # Local data repository (used for large sample PDFs not pushed to Git)
└── output/ # Stores intermediate ingestion artifacts (useful for debugging!)
Whenever a new document is ingested (e.g., document.pdf), the system automatically saves local artifacts inside the output/document/ directory. You can use these files to verify whether the pipeline extracted data properly:
output/{filename}/full_text.txt: This shows the complete extracted text representation of the document before chunking. You can see how tables, figures, and formulas have been represented or captioned by Gemini.output/{filename}/chunks.json: This allows you to inspect the exact chunks logic, showing the size, overlap, and embedding representation of what will be stored in MongoDB.output/{filename}/yolo/page_{num}.png: For every page processed, you will find an image here containing the exact bounding boxes drawn over the document by the YOLO model (highlighting tables, figures, headings, etc.). This is extremely useful to visually verify if the YOLO layout detection is working accurately.
python -m venv .venv
.venv\Scripts\activate # Windows
pip install -r requirements.txtNote: You also need Poppler for PDF-to-image conversion.
cp .env.example .envEdit .env with your keys:
GOOGLE_API_KEY=your_gemini_key
TAVILY_API_KEY=your_tavily_key
MONGO_URI=your_mongodb_uri
DB_NAME=MasterRAG
GEMINI_MODEL=models/gemini-2.5-flash
EMBEDDING_MODEL=models/gemini-embedding-001
JWT_SECRET=your_secretpython server.pyServer starts at http://localhost:8000.
UI starts at http://localhost:8000.
# Ingest a PDF
python main.py ingest path/to/document.pdf --user-id user123
# Query
python main.py query "What is the revenue growth?" --user-id user123| Method | Endpoint | Auth | Description |
|---|---|---|---|
| POST | /api/register |
No | Create account |
| POST | /api/login |
No | Get JWT token |
| POST | /api/documents/upload |
Bearer | Upload & ingest PDF |
| GET | /api/documents |
Bearer | List user's documents |
| POST | /api/query |
Bearer | Ask a question (RAG) |
| GET | /api/history |
Bearer | Get query history |
For a comprehensive technical deep dive into how our multimodal extraction pipeline works (incorporating YOLO11, Gemini Vision, and LangGraph), please refer to:
👉 extraction_pipeline_deep_dive.md
MIT License

