A Retrieval-Augmented Generation (RAG) system that processes PDF documents and provides intelligent question-answering capabilities using embeddings, vector search, and large language models.
- PDF Processing: Extract and chunk text from PDF documents using intelligent splitting strategies
- Vector Embeddings: Generate semantic embeddings using SentenceTransformers
- Vector Storage: Store and query embeddings using ChromaDB
- Query Augmentation: Enhance queries with hypothetical answers for improved retrieval
- LLM Integration: Generate responses using Mistral AI models
- Visualization: Plot embedding spaces with UMAP dimensionality reduction
- Python >=3.13
- Mistral API key
- Clone the repository:
git clone <repository-url>
cd rag-advanced- Install dependencies using uv (recommended):
uv syncOr using pip:
pip install -e .- Set up environment variables:
cp .env.example .env
# Edit .env and add your MISTRAL_API_KEYRun the main RAG pipeline:
python main.pyThis will:
- Process the NVIDIA annual report PDF in
data/ - Split text into chunks and generate embeddings
- Store embeddings in ChromaDB
- Execute a sample query with query augmentation
- Generate and display the answer
- Create a visualization of the embedding space
Edit the original_query variable in main.py:112 to ask different questions:
original_query = "What was the total profit for the year, and how does it compare to the previous year?"- Place PDF files in the
data/directory - Update the file path in
main.py:42:
reader = PdfReader("data/your-document.pdf")├── data/ # PDF documents for processing
│ └── NVIDIA-2025-Annual-Report.pdf
├── main.py # Main RAG pipeline
├── helper_utils.py # Utility functions
├── .env.example # Environment variables template
├── .python-version # Python version specification
├── AGENTS.md # Guidelines for AI coding agents
├── pyproject.toml # Project dependencies
├── uv.lock # Dependency lock file
└── README.md # This file
- Document Processing: PDFs are loaded and text is extracted
- Text Chunking: Documents are split into manageable chunks using recursive character splitting and token-based splitting
- Embedding Generation: Each chunk is converted to vector embeddings using SentenceTransformers
- Vector Storage: Embeddings are stored in ChromaDB for efficient similarity search
- Query Augmentation: Original queries are enhanced with hypothetical answers to improve retrieval
- Retrieval: Most relevant document chunks are retrieved using similarity search
- Answer Generation: Mistral AI generates contextual answers based on retrieved chunks
- Visualization: UMAP creates 2D projections of the embedding space for analysis
MISTRAL_API_KEY: Your Mistral AI API key (required)TOKENIZERS_PARALLELISM: Set tofalseto avoid tokenizer warnings
- Ensure your Mistral API key is valid and has sufficient credits
- Check that PDF files are readable and not encrypted
- Verify Python version compatibility (>=3.13 required)
- If you encounter import errors, run
uv syncto install all dependencies - For tokenizer warnings, set
TOKENIZERS_PARALLELISM=falsein your environment