# Multimodal RAG with RAG-Anything and Local Ollama

This notebook demonstrates how to build a multimodal Retrieval-Augmented Generation (RAG) system using the RAG-Anything library integrated with local Ollama models. We'll cover:

- Setting up the environment with RAG-Anything and Ollama
- Configuring custom model functions for text LLM, vision, and embeddings
- Processing multimodal documents (text + images)
- Performing queries with and without vision enhancement

The system uses local Ollama models for privacy and cost-effectiveness.

In [None]:
# Installation of RAG-Anything with all extras
!pip install rag-anything[all]

## Prerequisites

Before running this notebook, ensure you have Ollama installed and running locally on your machine.

1. Install Ollama from https://ollama.ai/
2. Start the Ollama service
3. Pull the required models:

```bash
ollama pull llama3.2
ollama pull llava
ollama pull nomic-embed-text
```

Note: If `qwen3-vl` is available and preferred over `llava`, you can use that instead for the vision model.

In [None]:
# Import required libraries
import asyncio
import requests
import base64
from rag_anything import RAGAnything, RAGAnythingConfig
import json
import os

## Ollama Utility Functions

These functions provide a simple interface to interact with local Ollama models via HTTP API.

In [None]:
# Ollama utility functions
def chat_with_ollama(prompt, model="llama3.2", stream=False):
    """Send a chat prompt to Ollama and return the response."""
    try:
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt, "stream": stream},
            timeout=60
        )
        response.raise_for_status()
        return response.json()["response"]
    except requests.exceptions.RequestException as e:
        raise Exception(f"Error communicating with Ollama: {e}")

def embed_with_ollama(text, model="nomic-embed-text"):
    """Generate embeddings for text using Ollama."""
    try:
        response = requests.post(
            "http://localhost:11434/api/embeddings",
            json={"model": model, "prompt": text},
            timeout=30
        )
        response.raise_for_status()
        return response.json()["embedding"]
    except requests.exceptions.RequestException as e:
        raise Exception(f"Error getting embeddings from Ollama: {e}")

def vision_with_ollama(image_path, prompt, model="llava"):
    """Send an image and prompt to Ollama vision model."""
    try:
        with open(image_path, "rb") as f:
            image_data = f.read()
        image_b64 = base64.b64encode(image_data).decode('utf-8')
        
        response = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": model, "prompt": prompt, "images": [image_b64], "stream": False},
            timeout=120
        )
        response.raise_for_status()
        return response.json()["response"]
    except FileNotFoundError:
        raise Exception(f"Image file not found: {image_path}")
    except requests.exceptions.RequestException as e:
        raise Exception(f"Error communicating with Ollama vision model: {e}")

## Custom Function Definitions

These functions wrap the Ollama utilities to match the RAG-Anything API requirements.

In [None]:
# Custom model functions for RAG-Anything
def llm_model_func(prompt):
    """Text LLM function using Ollama llama3.2."""
    return chat_with_ollama(prompt, model="llama3.2")

def vision_model_func(image_path, prompt):
    """Vision model function using Ollama llava."""
    return vision_with_ollama(image_path, prompt, model="llava")

def embedding_func(text):
    """Embedding function using Ollama nomic-embed-text."""
    return embed_with_ollama(text, model="nomic-embed-text")

## Configuration

Set up the RAG-Anything configuration with our custom model functions.

In [None]:
# RAG-Anything configuration
config = RAGAnythingConfig(
    llm_model_func=llm_model_func,
    vision_model_func=vision_model_func,
    embedding_func=embedding_func,
    embedding_dim=768  # nomic-embed-text produces 768-dimensional embeddings
)

## Initialization

Create the RAG-Anything instance with the configured settings.

In [None]:
# Initialize RAG-Anything
rag = RAGAnything(config)
print("RAG-Anything initialized successfully!")

## Document Processing Example

Process a sample PDF document. Make sure you have a PDF file available (e.g., 'sample.pdf' in the current directory).

In [None]:
# Document processing example
# Replace 'sample.pdf' with the path to your PDF document
document_path = "sample.pdf"

if os.path.exists(document_path):
    try:
        # Process the document (this may take some time for large documents)
        rag.process_document(document_path)
        print(f"Document '{document_path}' processed successfully!")
    except Exception as e:
        print(f"Error processing document: {e}")
else:
    print(f"Document '{document_path}' not found. Please provide a valid PDF path.")

## Querying Examples

Demonstrate different types of queries: text-only and vision-enhanced.

In [None]:
# Querying examples

# Text query
text_query = "What is the main topic of this document?"
try:
    text_result = rag.query(text_query)
    print("Text Query Result:")
    print(text_result)
    print("\n" + "="*50 + "\n")
except Exception as e:
    print(f"Error with text query: {e}")

# Vision-enhanced query (requires an image file)
image_path = "sample_image.jpg"  # Replace with actual image path
vision_query = "Describe what you see in this image and how it relates to the document content."

if os.path.exists(image_path):
    try:
        vision_result = rag.query_with_vision(vision_query, image_path=image_path)
        print("Vision-Enhanced Query Result:")
        print(vision_result)
    except Exception as e:
        print(f"Error with vision query: {e}")
else:
    print(f"Image '{image_path}' not found. Skipping vision query example.")

## Results Display

The results from the queries are displayed above. In a real application, you might want to format these results better or integrate them into a user interface.

This notebook provides a complete example of setting up multimodal RAG with local Ollama models. You can extend this by:

- Adding more document types
- Implementing conversation memory
- Creating a web interface
- Fine-tuning the retrieval parameters