RAG Advanced

A Retrieval-Augmented Generation (RAG) system that processes PDF documents and provides intelligent question-answering capabilities using embeddings, vector search, and large language models.

Features

PDF Processing: Extract and chunk text from PDF documents using intelligent splitting strategies
Vector Embeddings: Generate semantic embeddings using SentenceTransformers
Vector Storage: Store and query embeddings using ChromaDB
Query Augmentation: Enhance queries with hypothetical answers for improved retrieval
LLM Integration: Generate responses using Mistral AI models
Visualization: Plot embedding spaces with UMAP dimensionality reduction

Requirements

Python >=3.13
Mistral API key

Installation

Clone the repository:

git clone <repository-url>
cd rag-advanced

Install dependencies using uv (recommended):

uv sync

Or using pip:

pip install -e .

Set up environment variables:

cp .env.example .env
# Edit .env and add your MISTRAL_API_KEY

Usage

Basic Usage

Run the main RAG pipeline:

python main.py

This will:

Process the NVIDIA annual report PDF in data/
Split text into chunks and generate embeddings
Store embeddings in ChromaDB
Execute a sample query with query augmentation
Generate and display the answer
Create a visualization of the embedding space

Customizing Queries

Edit the original_query variable in main.py:112 to ask different questions:

original_query = "What was the total profit for the year, and how does it compare to the previous year?"

Adding Your Own Documents

Place PDF files in the data/ directory
Update the file path in main.py:42:

reader = PdfReader("data/your-document.pdf")

Project Structure

├── data/                           # PDF documents for processing
│   └── NVIDIA-2025-Annual-Report.pdf
├── main.py                         # Main RAG pipeline
├── helper_utils.py                 # Utility functions
├── .env.example                    # Environment variables template
├── .python-version                 # Python version specification
├── AGENTS.md                       # Guidelines for AI coding agents
├── pyproject.toml                  # Project dependencies
├── uv.lock                         # Dependency lock file
└── README.md                       # This file

How It Works

Document Processing: PDFs are loaded and text is extracted
Text Chunking: Documents are split into manageable chunks using recursive character splitting and token-based splitting
Embedding Generation: Each chunk is converted to vector embeddings using SentenceTransformers
Vector Storage: Embeddings are stored in ChromaDB for efficient similarity search
Query Augmentation: Original queries are enhanced with hypothetical answers to improve retrieval
Retrieval: Most relevant document chunks are retrieved using similarity search
Answer Generation: Mistral AI generates contextual answers based on retrieved chunks
Visualization: UMAP creates 2D projections of the embedding space for analysis

Environment Variables

MISTRAL_API_KEY: Your Mistral AI API key (required)
TOKENIZERS_PARALLELISM: Set to false to avoid tokenizer warnings

Troubleshooting

Ensure your Mistral API key is valid and has sufficient credits
Check that PDF files are readable and not encrypted
Verify Python version compatibility (>=3.13 required)
If you encounter import errors, run uv sync to install all dependencies
For tokenizer warnings, set TOKENIZERS_PARALLELISM=false in your environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG Advanced

Features

Requirements

Installation

Usage

Basic Usage

Customizing Queries

Adding Your Own Documents

Project Structure

How It Works

Environment Variables

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
AGENTS.md		AGENTS.md
LICENSE		LICENSE
README.md		README.md
helper_utils.py		helper_utils.py
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

RAG Advanced

Features

Requirements

Installation

Usage

Basic Usage

Customizing Queries

Adding Your Own Documents

Project Structure

How It Works

Environment Variables

Troubleshooting

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages