# Vision RAG: Multimodal Retrieval-Augmented Generation System

## 1. Introduction

Welcome to the comprehensive guide for the Vision RAG system! This notebook provides an in-depth explanation of a cutting-edge multimodal Retrieval-Augmented Generation (RAG) system that can process, understand, and query both textual and visual content.

### What is Vision RAG?

Vision RAG extends the traditional text-only RAG approach by incorporating visual understanding capabilities. While conventional RAG systems only work with text documents, this Vision RAG system can:

1. **Process Multimodal Documents**: Extract and understand both text and images from PDF files
2. **Generate Multimodal Embeddings**: Create separate vector representations for text (using OpenAI's text-embedding-3-large) and images (using Cohere's embed-v4.0)
3. **Perform Semantic Search**: Find relevant content across both modalities using cosine similarity
4. **Generate Contextual Answers**: Provide comprehensive responses using both text and image context

### Core Concepts

**Retrieval-Augmented Generation (RAG)**: A paradigm that combines the power of large language models with external knowledge retrieval. Instead of relying solely on pre-trained knowledge, RAG systems retrieve relevant information from a knowledge base and use it to generate more accurate, contextual responses.

**Vector Embeddings**: Dense numerical representations of text or images that capture semantic meaning. Similar content will have similar embeddings, enabling semantic search through vector similarity.

**Semantic Chunking**: An intelligent text segmentation approach that breaks documents into semantically coherent chunks rather than arbitrary fixed-size segments, preserving contextual meaning.

**Multimodal Understanding**: The ability to process and understand multiple types of data (text, images) simultaneously, enabling more comprehensive document analysis.

### System Architecture

The Vision RAG system consists of three main components:

1. **Ingestion Pipeline**: Processes PDFs and images, extracts content, generates embeddings, and stores them in PostgreSQL with pgvector
2. **Query Engine**: Searches for relevant content across both text and images using vector similarity
3. **Answer Generation**: Uses either OpenAI GPT or Google Gemini to generate comprehensive answers using retrieved context

### Key Features

- **Dual Embedding Strategy**: Text embeddings (3072-dimensional) and image embeddings (1536-dimensional) using different specialized models
- **Flexible PDF Processing**: Can extract text and images separately or convert entire PDF pages to images
- **Advanced Text Chunking**: Uses LangChain's SemanticChunker for intelligent text segmentation
- **Vector Database**: Leverages PostgreSQL with pgvector extension for efficient similarity search
- **Provider Flexibility**: Supports both OpenAI and Google Gemini for answer generation
- **Comprehensive Metadata**: Stores rich metadata for better content organization and debugging

This implementation provides a robust foundation for applications requiring multimodal document understanding, such as technical documentation analysis, research paper exploration, and visual content management systems.

## 📺 Watch the Tutorial

Prefer a video walkthrough? Check out the accompanying tutorial on YouTube:

[Vision RAG](https://youtu.be/LNydD9ZemZ8)

## 2. Prerequisites

- Python 3.8+
- Docker and Docker Compose (for PostgreSQL only)
- OpenAI API key
- Cohere API key
- Google Gemini API key (optional, for Gemini provider)

## 3. Quick Start

### 3.1 Environment Setup

Create a `.env` file with your API keys:

```bash
OPENAI_API_KEY=your_openai_api_key_here
COHERE_API_KEY=your_cohere_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here
POSTGRES_CONNECTION_STRING=postgresql://username:password@localhost:5432/vision_rag_db
```

### 3.2 Start PostgreSQL Database

```bash
# Start PostgreSQL with automatic database initialization
docker-compose up -d

# Check that database is running
docker-compose ps
```

The database will be automatically initialized with the required tables and indexes.

### 3.3 Setup Python Environment

```bash
# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt
```

### 3.4 Add Your Documents

Place your PDF files and images in the `docs/` folder:

```
docs/
├── document1.pdf
├── document2.pdf
├── image1.png
└── subfolder/
    └── image2.jpg
```

### 3.5 Run the Application

```bash
# Make sure virtual environment is activated
source venv/bin/activate

# Run the application
python main.py
```

## 4. Usage

### 4.1 Basic Commands

```bash
# Start PostgreSQL database
docker-compose up -d

# Stop PostgreSQL database
docker-compose down

# View database logs
docker-compose logs -f postgres

# Check database status
docker-compose ps
```

### 4.2 Python Application

```bash
# Activate virtual environment
source venv/bin/activate

# Run main application (processes docs/ folder and runs example queries)
python main.py
```

### 4.3 Database Management

```bash
# Connect to PostgreSQL directly
docker-compose exec postgres psql -U username -d vision_rag_db

# Check table contents
docker-compose exec postgres psql -U username -d vision_rag_db -c "SELECT COUNT(*) FROM text_embeddings;"
docker-compose exec postgres psql -U username -d vision_rag_db -c "SELECT COUNT(*) FROM image_embeddings;"
```

### 4.4 Complete Cleanup Commands

```bash
# Stop and remove containers with volumes (removes all data)
docker-compose down -v

# Remove project-specific Docker images
docker-compose down --rmi all

# Complete project cleanup (containers, volumes, images, and networks)
docker-compose down -v --rmi all --remove-orphans

# If you want to remove everything Docker Compose created for this project:
docker-compose down -v --rmi all --remove-orphans && docker volume prune -f
```

### Architecture Overview
The complete workflow of our Vision RAG:

![Vision RAG](vision_rag_workflow.png)


## 5. System Architecture & Main Classes

The Vision RAG system is built around three core modules, each containing specialized classes designed for specific functionality:

### 5.1 Core Module Overview

1. **main.py**: Application entry point and orchestration
2. **ingestion.py**: Document processing and embedding generation
3. **query.py**: Search and answer generation

### 5.2 Data Flow Architecture

```
Documents (PDFs/Images) → Ingestion Pipeline → Vector Database → Query Engine → Answer Generation
                                    ↓                    ↓               ↓              ↓
                              Text/Image           PostgreSQL      Similarity      OpenAI/Gemini
                              Processing            +pgvector        Search         Generation
```

## 6. Main Application (main.py)

### 6.1 Purpose
The main module serves as the application entry point, orchestrating the entire Vision RAG workflow from document ingestion to query processing.

### 6.2 Core Functionality

```python
def main():
    """Main function to initialize and run the Vision-RAG system"""
    
    # Initialize components
    unified_ingestion = UnifiedIngestionPipe()
    rag_query = RagQuery()
```

### 6.3 Key Features

1. **Component Initialization**: Creates instances of the unified ingestion pipeline and query engine
2. **Configurable Processing**: Uses `DEFAULT_CONFIG` to control whether ingestion and querying are active
3. **Batch Document Processing**: Processes all files in the `docs/` folder when ingestion is enabled
4. **Example Query Execution**: Demonstrates the system capabilities with predefined queries
5. **Result Reporting**: Provides detailed statistics about processed documents and query results

### 6.4 Workflow Stages

1. **Initialization Phase**: Sets up logging and creates component instances
2. **Ingestion Phase**: Processes documents if `activate_ingestion` is enabled
3. **Query Phase**: Executes example queries if `activate_query` is enabled
4. **Reporting Phase**: Displays processing statistics and query results

### 6.5 Configuration Control
The application behavior is controlled through configuration flags:
- `DEFAULT_CONFIG.activate_ingestion`: Controls document processing
- `DEFAULT_CONFIG.activate_query`: Controls query execution

This design allows for flexible operation modes, such as ingestion-only for initial setup or query-only for production use.

## 7. Ingestion Pipeline (ingestion.py)

The ingestion module contains four specialized classes that work together to process multimodal documents and generate vector embeddings.

### 7.1 ImageProcessor Class

**Purpose**: Utility class for image preprocessing and format conversion.

**Key Methods**:
- `resize_image()`: Ensures images don't exceed maximum pixel limits (1568×1568)
- `base64_from_image()`: Converts images to base64 encoding for storage and API calls

**Technical Details**:
```python
MAX_PIXELS = 1568 * 1568  # Optimized for vision model input
```

This class ensures images are properly formatted for both storage and processing by vision models.

### 7.2 TextIngestionPipe Class

**Purpose**: Handles text document processing with advanced semantic chunking and embedding generation.

**Core Components**:
- **Semantic Chunker**: Uses LangChain's SemanticChunker for intelligent text segmentation
- **OpenAI Embeddings**: Generates 3072-dimensional vectors using text-embedding-3-large
- **PostgreSQL Storage**: Stores embeddings with pgvector extension

**Key Methods**:
```python
def chunk_text(self, text: str) -> List[str]:
    """Chunk text using semantic chunker"""
    # Uses semantic boundaries rather than fixed sizes
    
def compute_text_embedding(self, text: str) -> np.ndarray:
    """Compute embedding using OpenAI text-embedding-3-large"""
    # Returns 3072-dimensional vector
    
def store_text_embedding(self, text_content: str, ...):
    """Store embeddings in PostgreSQL with pgvector"""
    # Includes metadata for better organization
```

**Advanced Features**:
- Database connection verification with helpful error messages
- Chunk filtering based on minimum size requirements
- Comprehensive metadata storage for debugging and analysis

### 7.3 ImageIngestionPipe Class

**Purpose**: Processes images and generates embeddings using Cohere's multimodal model.

**Core Technology**:
- **Cohere Embed v4.0**: Generates 1536-dimensional image embeddings
- **Base64 Storage**: Stores original images for retrieval and display
- **MIME Type Detection**: Proper format handling for different image types

**Key Methods**:
```python
def compute_image_embedding(self, img_path: str) -> np.ndarray:
    """Generate embedding using Cohere embed-v4.0"""
    # Converts image to base64 and processes with Cohere API
    
def store_image_embedding(self, ...):
    """Store image embeddings with base64 data"""
    # Includes original image data for display purposes
```

**Unique Features**:
- Stores both embeddings and original image data
- Supports multiple image formats (PNG, JPG, GIF, BMP, TIFF, WebP)
- Automatic MIME type detection and validation

### 7.4 UnifiedIngestionPipe Class

**Purpose**: Orchestrates the complete ingestion workflow, combining text and image processing.

**Core Functionality**:
```python
def process_all_files(self) -> Dict[str, List[str]]:
    """Process all files from the docs folder"""
    # Handles both PDFs and standalone images
    # Returns document IDs for tracking
```

**Processing Modes**:
1. **Standard Mode**: Extracts text and images separately from PDFs
2. **Page-as-Image Mode**: Converts entire PDF pages to images (useful for layout-heavy documents)

**Advanced Features**:
- Recursive file discovery in subdirectories
- Duplicate prevention (excludes extracted_images directory)
- Flexible PDF processing modes based on configuration
- Comprehensive error handling and logging

**File Organization**:
```
docs/
├── document.pdf          # Processed for text + images
├── image.png            # Processed as standalone image
└── extracted_images/    # PDF-extracted images (auto-created)
    └── document_img1.png
```

## 8. Query Engine (query.py)

The query module contains the RagQuery class, which handles the complete query workflow from search to answer generation.

### 8.1 RagQuery Class

**Purpose**: Implements multimodal search and answer generation using retrieved context.

**Core Architecture**:
- **Dual Search System**: Separate search mechanisms for text and images
- **Provider Flexibility**: Supports both OpenAI and Google Gemini for answer generation
- **Vector Database Integration**: Uses PostgreSQL with pgvector for efficient similarity search

### 8.2 Text Search Implementation

```python
def search_similar_texts(self, question: str, top_k: int = 3) -> List[Dict]:
    """Search for similar text chunks using pgvector cosine similarity"""
    # 1. Convert query to embedding using OpenAI text-embedding-3-large
    # 2. Perform vector similarity search in PostgreSQL
    # 3. Return top-k results with similarity scores
```

**Technical Details**:
- Uses 3072-dimensional embeddings from OpenAI
- Cosine distance calculation: `similarity = 1 - distance`
- Returns comprehensive metadata including source files and chunk information

### 8.3 Image Search Implementation

```python
def search_similar_images(self, question: str, top_k: int = 2) -> List[Dict]:
    """Search for similar images using Cohere embeddings"""
    # 1. Convert text query to embedding using Cohere embed-v4.0
    # 2. Search against image embeddings in PostgreSQL
    # 3. Return images with base64 data for display
```

**Key Features**:
- Cross-modal search: text queries can find relevant images
- 1536-dimensional embeddings from Cohere
- Includes original image data for answer generation

### 8.4 Answer Generation Systems

#### 8.4.1 OpenAI Integration
```python
def generate_answer_with_openai(self, question: str, text_context: List[Dict], 
                                image_context: List[Dict]) -> str:
    """Generate answer using OpenAI with text context"""
    # Uses LangChain ChatOpenAI with system/human message structure
    # Combines text and image metadata for comprehensive context
```

#### 8.4.2 Google Gemini Integration
```python
def generate_answer_with_gemini(self, question: str, text_context: List[Dict], 
                                image_context: List[Dict]) -> str:
    """Generate answer using Google Gemini with both text and image context"""
    # Supports true multimodal input with both text and images
    # Converts base64 images to PIL format for Gemini API
```

**Gemini Advantages**:
- Native multimodal support (text + images simultaneously)
- Direct image processing without conversion
- Better visual understanding for complex queries

### 8.5 Complete Query Workflow

```python
def query(self, question: str) -> Dict:
    """Complete query flow: search both text and images, then generate answer"""
    # 1. Search similar texts (top_k from config)
    # 2. Search similar images (top_k from config)
    # 3. Generate answer using retrieved context
    # 4. Return comprehensive result with context information
```

**Result Structure**:
```python
{
    "question": "User's original question",
    "answer": "Generated response",
    "text_context": [{
        "text_content": "Retrieved text chunk",
        "source_file": "document.pdf",
        "similarity": 0.85,
        "metadata": {...}
    }],
    "image_context": [{
        "image_name": "diagram.png",
        "source_file": "document.pdf",
        "similarity": 0.78,
        "base64_data": "...",
        "metadata": {...}
    }]
}
```

This comprehensive result structure allows for transparency in the retrieval process and enables applications to display source materials alongside answers.

## 9. Technical Implementation Details

### 9.1 Database Schema

#### Text Embeddings Table
```sql
CREATE TABLE text_embeddings (
    id SERIAL PRIMARY KEY,
    text_content TEXT NOT NULL,
    chunk_index INTEGER,
    source_file TEXT,
    embedding VECTOR(3072),  -- OpenAI text-embedding-3-large
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

#### Image Embeddings Table
```sql
CREATE TABLE image_embeddings (
    id SERIAL PRIMARY KEY,
    image_name TEXT NOT NULL,
    image_path TEXT,
    source_file TEXT,
    base64_data TEXT,  -- Original image data
    mime_type TEXT,
    embedding VECTOR(1536),  -- Cohere embed-v4.0
    metadata JSONB,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
```

### 9.2 Embedding Strategy

**Text Embeddings**: 
- Model: OpenAI text-embedding-3-large
- Dimensions: 3072
- Optimized for semantic text search

**Image Embeddings**:
- Model: Cohere embed-v4.0
- Dimensions: 1536  
- Supports cross-modal text-to-image search

### 9.3 Performance Optimizations

1. **Vector Indexing**: pgvector indexes for fast similarity search
2. **Image Resizing**: Automatic resizing to prevent API limits
3. **Semantic Chunking**: Preserves context while optimizing chunk size
4. **Connection Pooling**: Efficient database connection management
5. **Batch Processing**: Optimized for processing multiple documents

### 9.4 Error Handling & Logging

The system implements comprehensive error handling:
- Database connection verification
- API rate limit handling  
- File format validation
- Graceful degradation when components fail
- Detailed logging for debugging

### 9.5 Configuration Management

All system behavior is controlled through the configuration system:
- Processing modes (text-only, image-only, multimodal)
- Chunking parameters
- Retrieval settings (top-k values)
- Provider selection (OpenAI vs Gemini)
- Database connection settings

## 10. Best Practices & Usage Recommendations

### 10.1 Document Preparation

**Optimal PDF Structure**:
- Clear text formatting for better extraction
- High-quality images (minimum 300 DPI)
- Consistent document structure

**Image Guidelines**:
- Supported formats: PNG, JPG, GIF, BMP, TIFF, WebP
- Recommended resolution: 1024x1024 or higher
- Clear, well-lit images for better embeddings

### 10.2 Query Optimization

**Effective Queries**:
- Be specific about what you're looking for
- Include context about document types or content
- Use descriptive language for image searches

**Example Good Queries**:
- "Find a floor plan that includes a tea store and gaming space"
- "What are the technical specifications mentioned in the manual?"
- "Show me diagrams related to network architecture"

### 10.3 Performance Tuning

**Database Optimization**:
```sql
-- Create indexes for better performance
CREATE INDEX idx_text_embeddings_vector ON text_embeddings USING ivfflat (embedding vector_cosine_ops);
CREATE INDEX idx_image_embeddings_vector ON image_embeddings USING ivfflat (embedding vector_cosine_ops);
```

**Configuration Tuning**:
- Adjust `top_k` values based on your use case
- Optimize chunk size for your document types
- Choose the appropriate answer provider for your needs

### 10.4 Monitoring & Maintenance

**Regular Maintenance**:
- Monitor database size and performance
- Update embeddings when documents change
- Review query performance and adjust indexes
- Monitor API usage and costs

**Debugging Tools**:
- Check processing logs for errors
- Verify embedding dimensions
- Test similarity scores for relevance
- Review metadata for completeness

## 11. Conclusion

The Vision RAG system represents a significant advancement in document understanding technology, combining the power of large language models with sophisticated multimodal retrieval capabilities. By processing both text and visual content, it enables more comprehensive and contextual responses to user queries.

### Key Achievements

1. **Multimodal Processing**: Successfully handles both text and image content from complex documents
2. **Advanced Chunking**: Uses semantic segmentation for better context preservation
3. **Flexible Architecture**: Supports multiple embedding providers and answer generation models
4. **Production Ready**: Includes comprehensive error handling, logging, and monitoring capabilities
5. **Scalable Design**: Built on PostgreSQL with pgvector for efficient similarity search at scale

### Future Enhancements

- **Additional File Formats**: Support for more document types (Word, PowerPoint, etc.)
- **Real-time Processing**: Live document updates and incremental ingestion
- **Advanced Chunking**: Domain-specific chunking strategies
- **Multi-language Support**: Embedding models for different languages
- **Enhanced Metadata**: Richer document structure understanding

This system provides a solid foundation for applications requiring sophisticated document understanding, from enterprise knowledge management to research assistance and automated content analysis.