# PDF Querying with ChromaDB and ChatGPT

> PyCon Ireland 2024

## Overview

This script processes a PDF document, extracts its content, stores it in ChromaDB.
It uses a language model (ChatGPT or Ollama) to generate context-aware responses.

## How It Works

1. **PDF Ingestion**:
   - Downloads a sample PDF if not already available.
   - Extracts text from the PDF and splits it into manageable chunks.

2. **ChromaDB Setup**:
   - Initializes a ChromaDB client and creates or retrieves a collection.
   - Stores the text chunks in the ChromaDB collection using `upsert`.

3. **Querying**:
   - Searches the ChromaDB collection for chunks most similar to the user’s query.
   - Passes the top match as context to the selected language model (ChatGPT or Ollama).

4. **Response Generation**:
   - The language model generates a detailed response based on the retrieved context and user query.

## Key Features

- **Configurable**: Supports ChatGPT (via OpenAI API) and Ollama (local model).
- **Vector-Based Search**: Efficient querying using semantic similarity.
- **Contextual Responses**: Provides answers tailored to the retrieved PDF content.

## Requirements

- Python 3.8 or higher
- OpenAI API key (if using ChatGPT)
- ChromaDB installed (`pip install chromadb`)
- Optional: Ollama running locally for local LLM support

## How to Use

1. Set up the constants in the script, including `QUERY` and `MODEL_TYPE`.
2. Run the script.
3. Inspect the results:
   - View top matches from the vector database.
   - Get a detailed response from the language model based on the query

## Workflow Diagram

```mermaid
flowchart TD
    A[Start] --> B[Download Sample PDF]
    B --> C[Extract Text from PDF]
    C --> D[Split Text into Chunks]
    D --> E[Initialize ChromaDB Client]
    E --> F[Store Chunks in ChromaDB Collection]
    F --> G[Query Vector DB with User Query]
    G --> H[Retrieve Top Matches]
    H --> I[Pass Top Match to LLM]
    I --> J[Generate Response]
    J --> K[Display Results]
```    

In [None]:
# Install necessary libraries
!pip install chromadb openai PyPDF2 sentence-transformers

In [None]:
import os
import logging
from typing import List
from PyPDF2 import PdfReader
import chromadb
from chromadb.config import Settings
import requests

# -----------------------------------------
# Setup Logging
# -----------------------------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger()


In [None]:
# -----------------------------------------
# Constants: Configure the script here
# -----------------------------------------
# Sample PDF URL to download if the file doesn't exist locally
SAMPLE_PDF_URL = "https://pdfobject.com/pdf/sample.pdf"

# Path to save the sample PDF file
PDF_PATH = "sample.pdf"  # Local file name for the PDF

# Default query for testing
QUERY = "What does this document talk about?"  # Example query

# Directory where ChromaDB will persist its data
CHROMA_DB_DIR = "./chromadb"  # Path to ChromaDB storage

# Name of the ChromaDB collection to create or retrieve
COLLECTION_NAME = "pdf_documents"  # Logical name for the collection

# Maximum size of text chunks when splitting the PDF content
CHUNK_SIZE = 500  # Number of characters per text chunk

# Configurable model selection: "chatgpt" or "ollama"
MODEL_TYPE = "ollama"  # Default is Ollama; change to "chatgpt" for OpenAI's GPT

# Ollama-specific configurations
OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"  # Local Ollama server endpoint
DEFAULT_OLLAMA_MODEL = "phi3.5:latest"  # Default Ollama model

# OpenAI-specific configurations
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "your_openai_api_key")  # Replace with your OpenAI API key
CHATGPT_MODEL_NAME = "gpt-3.5-turbo"

In [None]:
# -----------------------------------------
# Utility Functions
# -----------------------------------------
def download_sample_pdf(url: str, save_path: str) -> None:
    """Downloads a sample PDF file from a given URL."""
    logger.info("Downloading sample PDF...")
    response = requests.get(url)
    response.raise_for_status()
    with open(save_path, "wb") as file:
        file.write(response.content)
    logger.info(f"Sample PDF downloaded to {save_path}")


def verify_pdf_file(file_path: str) -> None:
    """Verifies that the specified PDF file exists."""
    if not os.path.exists(file_path):
        raise FileNotFoundError(f"PDF file not found at {file_path}.")
    logger.info(f"PDF file verified at {file_path}")


def extract_text_from_pdf(pdf_path: str) -> str:
    """Extracts all text from a PDF file."""
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    logger.info("Text extracted from PDF.")
    return text


def split_text_into_chunks(text: str, chunk_size: int = 500) -> List[str]:
    """Splits a large text into smaller chunks for embedding."""
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    logger.info(f"Text split into {len(chunks)} chunks.")
    return chunks


def create_chroma_client_and_collection(collection_name: str) -> chromadb.api.Collection:
    """Creates a ChromaDB client and retrieves a collection."""
    chroma_client = chromadb.Client(Settings(persist_directory=CHROMA_DB_DIR))
    collection = chroma_client.get_or_create_collection(name=collection_name)
    logger.info(f"ChromaDB collection '{collection_name}' created or retrieved.")
    return collection


def ingest_pdf_to_chromadb(pdf_path: str, collection: chromadb.api.Collection, chunk_size: int = 500) -> None:
    """Ingests text from a PDF into a ChromaDB collection."""
    text = extract_text_from_pdf(pdf_path)
    chunks = split_text_into_chunks(text, chunk_size)

    for idx, chunk in enumerate(chunks):
        if chunk.strip():  # Skip empty chunks
            collection.upsert(
                documents=[chunk],
                ids=[f"doc_{idx}"]
            )
    logger.info(f"Ingested {len(chunks)} chunks into ChromaDB.")


def query_vector_db(collection: chromadb.api.Collection, query: str, max_results: int = 3) -> List[str]:
    """Queries the vector database for the most similar chunks."""
    results = collection.query(
        query_texts=[query],
        n_results=max_results
    )
    documents = results["documents"][0]
    logger.info(f"Retrieved {len(documents)} matching chunks from the vector DB.")
    return documents


# -----------------------------------------
# Query Functions
# -----------------------------------------
def query_llm(
    query: str,
    context: List[str],
    model_type: str = MODEL_TYPE,
    ollama_model: str = DEFAULT_OLLAMA_MODEL
) -> str:
    """
    Queries a language model (LLM) with additional context.

    Args:
        query (str): The user query.
        context (List[str]): The context retrieved from the vector DB.
        model_type (str): The model type to use ("chatgpt" or "ollama").
        ollama_model (str): The specific Ollama model to use.

    Returns:
        str: The LLM response.
    """
    if model_type == "chatgpt":
        # Initialize the OpenAI client
        client = OpenAI(api_key=OPENAI_API_KEY)

        # Construct the messages with context
        messages = [
            {"role": "system", "content": "You are an assistant that answers questions about PDF documents."},
            {"role": "user", "content": f"Context: {' '.join(context)}\n\nQuestion: {query}"}
        ]

        # Create a chat completion
        response = client.chat.completions.create(
            model=CHATGPT_MODEL_NAME,
            messages=messages
        )

        logger.info("ChatGPT query completed.")
        return response["choices"][0]["message"]["content"]

    elif model_type == "ollama":
        # Combine context into a single string
        combined_context = " ".join(context)
        prompt = f"Context: {combined_context}\n\nQuestion: {query}"

        # Specify the model and send the query to the Ollama server
        payload = {
            "model": ollama_model,
            "prompt": prompt,
            "stream": False  # Disable streaming for simplicity
        }
        try:
            response = requests.post(OLLAMA_ENDPOINT, json=payload)
            response.raise_for_status()
            logger.info(f"Ollama query completed using model: {ollama_model}.")
            return response.json().get("response", "No response received from Ollama.")
        except requests.exceptions.HTTPError as e:
            logger.error(f"HTTPError: {e}")
            logger.error(f"Response: {e.response.text}")
            raise
        except Exception as e:
            logger.error(f"Unexpected error querying Ollama: {e}")
            raise

    else:
        raise ValueError("Invalid model type. Use 'chatgpt' or 'ollama'.")


In [None]:
# -----------------------------------------
# Main Workflow
# -----------------------------------------
def main() -> None:
    """Main workflow to ingest a PDF into ChromaDB, query the vector DB, and query the LLM."""
    if not os.path.exists(PDF_PATH):
        download_sample_pdf(SAMPLE_PDF_URL, PDF_PATH)

    verify_pdf_file(PDF_PATH)

    logger.info("Setting up ChromaDB client and collection...")
    collection = create_chroma_client_and_collection(COLLECTION_NAME)

    logger.info(f"Ingesting PDF '{PDF_PATH}' into ChromaDB collection...")
    ingest_pdf_to_chromadb(PDF_PATH, collection, CHUNK_SIZE)

    logger.info("Querying vector DB...")
    top_matches = query_vector_db(collection, QUERY)

    logger.info("Top matches from the vector DB:")
    for i, match in enumerate(top_matches, 1):
        print(f"{i}. {match[:200]}...")  # Display the first 200 characters of each match

    logger.info("Querying LLM with context from vector DB...")
    response = query_llm(QUERY, top_matches)
    logger.info("LLM Response:")
    print(response)


if __name__ == "__main__":
    main()