# Multimodal RAG with LlamaIndex

Multimodal RAG extends traditional RAG to handle images, tables, and other non-text content. This notebook covers techniques for building multimodal AI applications.

## Learning Objectives

By the end of this notebook, you will:
1. Understand multimodal RAG architecture
2. Process documents with images and tables
3. Use vision-language models for retrieval
4. Build applications that understand visual content

---

## What is Multimodal RAG?

Traditional RAG only handles text. **Multimodal RAG** extends this to:
- **Images**: Photos, diagrams, charts
- **Tables**: Structured data in documents
- **Mixed content**: PDFs with text, images, and tables

### Approaches to Multimodal RAG

| Approach | Description | Best For |
|----------|-------------|----------|
| **Text extraction** | Extract text from images/tables | Simple documents |
| **Image embeddings** | Embed images alongside text | Visual similarity |
| **Vision LLM** | Use GPT-4V or similar | Complex visual reasoning |

In [None]:
# Setup
import nest_asyncio
nest_asyncio.apply()

from dotenv import load_dotenv
load_dotenv()

from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.multi_modal_llms.openai import OpenAIMultiModal

# Configure
Settings.llm = OpenAI(model="gpt-4o-mini")
Settings.embed_model = OpenAIEmbedding(model="text-embedding-3-small")

print("✓ Setup complete!")

## 1. Vision Language Models

First, let's understand how to use vision-capable LLMs:

In [None]:
# Initialize multimodal LLM (GPT-4 Vision)
multimodal_llm = OpenAIMultiModal(
    model="gpt-4o",  # Vision-capable model
    max_new_tokens=500,
)

print("✓ Multimodal LLM initialized!")
print(f"Model: {multimodal_llm.model}")

In [None]:
from llama_index.core.schema import ImageDocument
import base64
import requests
from pathlib import Path

# Helper function to create image document from URL
def create_image_doc_from_url(url: str) -> ImageDocument:
    """Create an ImageDocument from a URL."""
    response = requests.get(url)
    image_data = base64.b64encode(response.content).decode()
    return ImageDocument(
        image=image_data,
        image_url=url,
    )

print("✓ Helper function defined!")

In [None]:
# Example: Analyze an image with vision LLM
# Using a sample image from the web
sample_image_url = "https://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Python-logo-notext.svg/200px-Python-logo-notext.svg.png"

try:
    image_doc = create_image_doc_from_url(sample_image_url)
    
    # Ask the vision LLM about the image
    response = multimodal_llm.complete(
        prompt="What is shown in this image? Describe it in detail.",
        image_documents=[image_doc],
    )
    
    print(f"Image analysis:\n{response}")
except Exception as e:
    print(f"Note: Image analysis requires a vision-capable model and valid image URL.")
    print(f"Error: {e}")

## 2. Multimodal Document Processing

Process documents that contain both text and images:

In [None]:
from llama_index.core.schema import TextNode, ImageNode
from typing import List

class MultimodalDocument:
    """A document with both text and image content."""
    
    def __init__(self, doc_id: str):
        self.doc_id = doc_id
        self.text_nodes: List[TextNode] = []
        self.image_nodes: List[ImageNode] = []
    
    def add_text(self, text: str, metadata: dict = None):
        """Add a text section."""
        node = TextNode(
            text=text,
            metadata={
                "doc_id": self.doc_id,
                "type": "text",
                **(metadata or {}),
            }
        )
        self.text_nodes.append(node)
    
    def add_image_description(self, description: str, image_ref: str, metadata: dict = None):
        """Add an image with its description."""
        # For retrieval, we index the description
        node = TextNode(
            text=f"[IMAGE: {image_ref}]\n{description}",
            metadata={
                "doc_id": self.doc_id,
                "type": "image",
                "image_ref": image_ref,
                **(metadata or {}),
            }
        )
        self.text_nodes.append(node)
    
    def get_all_nodes(self) -> List[TextNode]:
        """Get all nodes for indexing."""
        return self.text_nodes

print("✓ MultimodalDocument class defined!")

In [None]:
# Create a sample multimodal document
doc = MultimodalDocument(doc_id="ai_overview")

# Add text content
doc.add_text("""
# Introduction to Artificial Intelligence

Artificial Intelligence (AI) is a branch of computer science that aims to create 
intelligent machines that can perform tasks typically requiring human intelligence.
""")

# Add image description (simulating OCR/vision analysis)
doc.add_image_description(
    description="A diagram showing the relationship between AI, Machine Learning, "
                "and Deep Learning as nested circles. AI is the outermost circle, "
                "containing Machine Learning, which contains Deep Learning.",
    image_ref="ai_ml_dl_diagram.png"
)

doc.add_text("""
## Machine Learning

Machine Learning is a subset of AI that enables systems to learn from data 
and improve their performance without explicit programming.
""")

# Add another image description
doc.add_image_description(
    description="A flowchart showing the machine learning pipeline: Data Collection → "
                "Data Preprocessing → Model Training → Model Evaluation → Deployment.",
    image_ref="ml_pipeline.png"
)

doc.add_text("""
## Deep Learning

Deep Learning uses neural networks with many layers to learn hierarchical 
representations of data. It excels at tasks like image recognition and NLP.
""")

print(f"Created document with {len(doc.get_all_nodes())} nodes")

In [None]:
# Index the multimodal document
nodes = doc.get_all_nodes()

# Create index from nodes
mm_index = VectorStoreIndex(nodes=nodes)

# Create query engine
mm_query_engine = mm_index.as_query_engine(
    similarity_top_k=3,
)

print("✓ Multimodal index created!")

In [None]:
# Query about visual content
queries = [
    "What does the AI/ML diagram show?",
    "Describe the machine learning pipeline.",
    "What is the relationship between AI and deep learning?",
]

for query in queries:
    print(f"\nQ: {query}")
    response = mm_query_engine.query(query)
    print(f"A: {response}")

## 3. Table Understanding

Handle tabular data in documents:

In [None]:
import pandas as pd

# Create sample tables
ml_algorithms = pd.DataFrame({
    "Algorithm": ["Linear Regression", "Decision Tree", "Random Forest", "SVM", "Neural Network"],
    "Type": ["Regression", "Both", "Both", "Both", "Both"],
    "Complexity": ["Low", "Medium", "High", "High", "Very High"],
    "Interpretability": ["High", "High", "Low", "Low", "Very Low"],
})

print("Sample table:")
print(ml_algorithms.to_string(index=False))

In [None]:
def table_to_text(df: pd.DataFrame, table_name: str) -> str:
    """Convert a pandas DataFrame to a searchable text representation."""
    lines = [f"TABLE: {table_name}"]
    lines.append(f"Columns: {', '.join(df.columns)}")
    lines.append("")
    
    # Add markdown representation
    lines.append(df.to_markdown(index=False))
    lines.append("")
    
    # Add row-by-row natural language description
    lines.append("Row descriptions:")
    for _, row in df.iterrows():
        desc = ", ".join([f"{col}: {val}" for col, val in row.items()])
        lines.append(f"- {desc}")
    
    return "\n".join(lines)

# Convert table to text
table_text = table_to_text(ml_algorithms, "Machine Learning Algorithms Comparison")
print(table_text)

In [None]:
# Create document with table content
table_doc = MultimodalDocument(doc_id="ml_comparison")

table_doc.add_text("""
# Comparison of Machine Learning Algorithms

This document compares different machine learning algorithms based on 
their type, complexity, and interpretability.
""")

table_doc.add_text(table_text, metadata={"type": "table", "table_name": "ML Algorithms"})

table_doc.add_text("""
## Key Insights

- Linear Regression is the simplest and most interpretable
- Neural Networks are the most complex but least interpretable
- Random Forest offers good performance with moderate complexity
""")

# Index
table_nodes = table_doc.get_all_nodes()
table_index = VectorStoreIndex(nodes=table_nodes)
table_engine = table_index.as_query_engine()

print("✓ Table-aware index created!")

In [None]:
# Query about table content
table_queries = [
    "Which algorithm has the highest interpretability?",
    "Compare Random Forest and SVM complexity.",
    "Which algorithms can be used for both classification and regression?",
]

for query in table_queries:
    print(f"\nQ: {query}")
    response = table_engine.query(query)
    print(f"A: {response}")

## 4. Multimodal Pipeline Class

A complete pipeline for handling multimodal documents:

In [None]:
from typing import Optional, Dict, Any
from dataclasses import dataclass

@dataclass
class MultimodalContent:
    """Container for multimodal content."""
    content_type: str  # 'text', 'image', 'table'
    content: Any
    description: Optional[str] = None
    metadata: Optional[Dict] = None

class MultimodalRAGPipeline:
    """A complete pipeline for multimodal RAG."""
    
    def __init__(self, vision_llm=None):
        self.vision_llm = vision_llm
        self.nodes = []
        self.index = None
        self.query_engine = None
    
    def add_text(self, text: str, metadata: dict = None):
        """Add text content."""
        node = TextNode(
            text=text,
            metadata={"type": "text", **(metadata or {})}
        )
        self.nodes.append(node)
    
    def add_image(self, image_description: str, image_path: str = None, metadata: dict = None):
        """Add image content (via description)."""
        text = f"[IMAGE: {image_path or 'embedded'}]\nDescription: {image_description}"
        node = TextNode(
            text=text,
            metadata={
                "type": "image",
                "image_path": image_path,
                **(metadata or {})
            }
        )
        self.nodes.append(node)
    
    def add_table(self, df: pd.DataFrame, table_name: str, metadata: dict = None):
        """Add table content."""
        text = table_to_text(df, table_name)
        node = TextNode(
            text=text,
            metadata={
                "type": "table",
                "table_name": table_name,
                **(metadata or {})
            }
        )
        self.nodes.append(node)
    
    def build_index(self):
        """Build the vector index."""
        if not self.nodes:
            raise ValueError("No content added yet")
        
        self.index = VectorStoreIndex(nodes=self.nodes)
        self.query_engine = self.index.as_query_engine(similarity_top_k=5)
        print(f"✓ Index built with {len(self.nodes)} nodes")
    
    def query(self, question: str) -> str:
        """Query the multimodal index."""
        if not self.query_engine:
            raise ValueError("Index not built. Call build_index() first.")
        
        response = self.query_engine.query(question)
        return str(response)
    
    def get_stats(self) -> dict:
        """Get statistics about the content."""
        stats = {
            "total_nodes": len(self.nodes),
            "text_nodes": len([n for n in self.nodes if n.metadata.get("type") == "text"]),
            "image_nodes": len([n for n in self.nodes if n.metadata.get("type") == "image"]),
            "table_nodes": len([n for n in self.nodes if n.metadata.get("type") == "table"]),
        }
        return stats

print("✓ MultimodalRAGPipeline defined!")

In [None]:
# Use the pipeline
pipeline = MultimodalRAGPipeline()

# Add various content types
pipeline.add_text("""
# Deep Learning Architectures

Deep learning uses various neural network architectures for different tasks.
Common architectures include CNNs, RNNs, and Transformers.
""")

pipeline.add_image(
    image_description="A CNN architecture diagram showing input layer, "
                      "convolutional layers with pooling, fully connected layers, "
                      "and output layer for image classification.",
    image_path="cnn_architecture.png"
)

# Add architecture comparison table
architectures = pd.DataFrame({
    "Architecture": ["CNN", "RNN", "LSTM", "Transformer"],
    "Best For": ["Images", "Sequences", "Long sequences", "Everything"],
    "Key Feature": ["Convolution", "Recurrence", "Memory gates", "Attention"],
    "Year Introduced": [1998, 1986, 1997, 2017],
})

pipeline.add_table(architectures, "Deep Learning Architectures")

pipeline.add_text("""
## Transformers Revolution

Transformers, introduced in the "Attention is All You Need" paper, 
have revolutionized NLP and are now being applied to vision tasks.
""")

# Build and query
pipeline.build_index()
print(f"\nStats: {pipeline.get_stats()}")

In [None]:
# Test multimodal queries
test_queries = [
    "What does the CNN diagram show?",
    "When was the Transformer architecture introduced?",
    "Which architecture is best for image processing?",
    "What is the key feature of LSTM?",
]

for q in test_queries:
    print(f"\nQ: {q}")
    print(f"A: {pipeline.query(q)}")

## 5. Summary

You've learned multimodal RAG techniques:

### Key Takeaways

| Technique | Description | Use Case |
|-----------|-------------|----------|
| **Vision LLM** | GPT-4V for image understanding | Complex visual reasoning |
| **Image descriptions** | Index image descriptions as text | Visual content retrieval |
| **Table processing** | Convert tables to searchable text | Structured data Q&A |
| **Multimodal pipeline** | Unified handling of mixed content | Complete documents |

### Best Practices

1. **Describe images thoroughly** for better retrieval
2. **Include table structure** in text representation
3. **Use metadata** to track content types
4. **Consider LlamaParse** for complex PDFs

### Next Steps

In the next notebook, we'll explore GraphRAG for entity relationships.

---

## Exercises

1. **PDF processing**: Process a PDF with mixed content using LlamaParse

2. **Chart understanding**: Create a pipeline that describes chart images

3. **Table Q&A**: Build a specialized engine for tabular data

4. **Image similarity**: Implement image-to-image retrieval

In [None]:
# Exercise space
# Build your multimodal application here!