# PDF Querying with ChromaDB and ChatGPT

> Build your own "Chat with Local Files" using Retrieval Augmented Generation
> Mihai Criveti, PyCon Ireland 2024

## Overview

This script processes a PDF document, extracts its content, stores it in ChromaDB.
It uses a language model (FakeLLM, ChatGPT or Ollama) to generate context-aware responses.

## How It Works

1. **PDF Ingestion**:
   - Downloads a sample PDF if not already available.
   - Extracts text from the PDF and splits it into chunks.

2. **ChromaDB Setup**:
   - Initializes a ChromaDB client and creates or retrieves a collection.
   - Stores the text chunks in the ChromaDB collection.

3. **Querying**:
   - Searches the ChromaDB collection for chunks most similar to the user’s query.
   - Passes the top match as context to the selected language model (FakeLLM, ChatGPT or Ollama).

4. **Response Generation**:
   - The language model generates a detailed response based on the retrieved context and user query.

## Workflow Diagram

```mermaid
flowchart TD
    A[Start] --> B[Download Sample PDF]
    B --> C[Extract Text from PDF]
    C --> D[Split Text into Chunks]
    D --> E[Initialize ChromaDB Client]
    E --> F[Store Chunks in ChromaDB Collection]
    F --> G[Query Vector DB with User Query]
    G --> H[Retrieve Top Matches]
    H --> I[Pass Top Match to LLM]
    I --> J[Generate Response]
    J --> K[Display Results]
```    

In [1]:
# Install necessary libraries
!pip install chromadb openai PyPDF2 sentence-transformers

Collecting openai
  Downloading openai-1.54.4-py3-none-any.whl (389 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m389.6/389.6 kB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m[31m3.3 MB/s[0m eta [36m0:00:01[0m
Collecting sentence-transformers
  Downloading sentence_transformers-3.3.0-py3-none-any.whl (268 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m268.7/268.7 kB[0m [31m9.4 MB/s[0m eta [36m0:00:00[0m
Collecting distro<2,>=1.7.0
  Using cached distro-1.9.0-py3-none-any.whl (20 kB)
Collecting jiter<1,>=0.4.0
  Downloading jiter-0.7.1-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (325 kB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m325.5/325.5 kB[0m [31m21.4 MB/s[0m eta [36m0:00:00[0m
Collecting transformers<5.0.0,>=4.41.0
  Downloading transformers-4.46.2-py3-none-any.whl (10.0 MB)
[2K     [38;2;114;156;31m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.0/10.0

In [2]:
import os
import logging
from typing import List
from PyPDF2 import PdfReader
import chromadb
from chromadb.config import Settings
import requests

# -----------------------------------------
# Setup Logging
# -----------------------------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger()

In [3]:
# -----------------------------------------
# Constants - configuration
# -----------------------------------------
SAMPLE_PDF_URL = "https://pdfobject.com/pdf/sample.pdf"  # URL for a sample PDF file
#SAMPLE_PDF_URL = "https://bugs.python.org/file47781/Tutorial_EDIT.pdf" # Python tutorial sample
PDF_PATH = "sample.pdf"  # Path to save or check for the PDF file
QUERIES = [
    "What is the main idea of the document?",
    "Summarize the key topics discussed."
]
CHROMA_DB_DIR = "./chromadb"  # Directory for ChromaDB storage
COLLECTION_NAME = "rag_documents"  # Logical name for the ChromaDB collection
CHUNK_SIZE = 500  # Number of characters in each text chunk
NUM_CHUNKS = 2  # Number of chunks to retrieve for each query
MODEL_TYPE = "fakellm"  # Default model ("chatgpt", "ollama", or "fakellm")
OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"  # Ollama server endpoint
DEFAULT_OLLAMA_MODEL = "granite3-dense"  # Default Ollama model (2b model)
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY", "")  # Define OPENAI_API_KEY in your ENV
CHATGPT_MODEL_NAME = "gpt-4o"

In [4]:
# -----------------------------------------
# Utility Functions
# -----------------------------------------
def download_sample_pdf(url: str, save_path: str) -> None:
    """Downloads a sample PDF file."""
    logger.info("Downloading sample PDF...")
    response = requests.get(url)
    response.raise_for_status()
    with open(save_path, "wb") as file:
        file.write(response.content)
    logger.info(f"Sample PDF downloaded to {save_path}")


def extract_text_from_pdf(pdf_path: str) -> str:
    """Extracts all text from a PDF file."""
    logger.info(f"Extracting text from PDF at {pdf_path}...")
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text()
    logger.info("Text extraction completed.")
    return text


def split_text_into_chunks(text: str, chunk_size: int = 500) -> List[str]:
    """Splits text into smaller chunks."""
    logger.info("Splitting text into chunks...")
    chunks = [text[i:i + chunk_size] for i in range(0, len(text), chunk_size)]
    logger.info(f"Text split into {len(chunks)} chunks.")
    return chunks


def create_chroma_client_and_collection(collection_name: str) -> chromadb.api.Collection:
    """Creates a ChromaDB client and retrieves a collection."""
    logger.info("Setting up ChromaDB client and collection...")
    chroma_client = chromadb.Client(Settings(persist_directory=CHROMA_DB_DIR))
    collection = chroma_client.get_or_create_collection(name=collection_name)
    logger.info(f"ChromaDB collection '{collection_name}' created or retrieved.")
    return collection


def ingest_pdf_to_chromadb(pdf_path: str, collection: chromadb.api.Collection, chunk_size: int = 500) -> None:
    """Ingests text from a PDF into a ChromaDB collection."""
    logger.info(f"Ingesting PDF from '{pdf_path}' into ChromaDB...")
    text = extract_text_from_pdf(pdf_path)
    chunks = split_text_into_chunks(text, chunk_size)

    for idx, chunk in enumerate(chunks):
        if chunk.strip():  # Skip empty chunks
            collection.upsert(
                documents=[chunk],
                ids=[f"doc_{idx}"]
            )
    logger.info(f"Ingested {len(chunks)} chunks into ChromaDB.")


def query_vector_db(collection: chromadb.api.Collection, query: str, max_results: int = NUM_CHUNKS) -> List[str]:
    """Queries the vector database for the most similar chunks."""
    logger.info(f"Querying vector DB for: '{query}'...")
    results = collection.query(
        query_texts=[query],
        n_results=max_results
    )
    documents = results["documents"][0]
    logger.info(f"Retrieved {len(documents)} matching chunks from the vector DB.")
    return documents

In [5]:
# -----------------------------------------
# Query Functions
# -----------------------------------------
def query_llm(
    query: str,
    context: List[str],
    model_type: str = MODEL_TYPE,
    ollama_model: str = DEFAULT_OLLAMA_MODEL
) -> str:
    """
    Queries a language model (LLM) with context.

    Args:
        query (str): User query.
        context (List[str]): Context from ChromaDB.
        model_type (str): Model type ("chatgpt", "ollama", "fakellm").
        ollama_model (str): Specific Ollama model.

    Returns:
        str: LLM response.
    """
    logger.info(f"Querying LLM {model_type} with query: {query} and context: {context}")
    if model_type == "chatgpt":
        from openai import OpenAI
        client = OpenAI(api_key=OPENAI_API_KEY)
        messages = [
            {"role": "system", "content": "You are an assistant for answering questions about PDF documents."},
            {"role": "user", "content": f"Context: {' '.join(context)}\n\nQuestion: {query}"}
        ]
        response = client.chat.completions.create(
            model=CHATGPT_MODEL_NAME,
            messages=messages
        )
        logger.info("ChatGPT query completed.")
        return response.choices[0].message.content

    elif model_type == "ollama":
        payload = {
            "model": ollama_model,
            "prompt": f"Context: {' '.join(context)}\n\nQuestion: {query}",
            "stream": False
        }
        try:
            response = requests.post(OLLAMA_ENDPOINT, json=payload)
            response.raise_for_status()
            logger.info(f"Ollama query completed using model: {ollama_model}.")
            return response.json().get("response", "No response from Ollama.")
        except Exception as e:
            logger.error(f"Ollama query error: {e}")
            raise

    elif model_type == "fakellm":
        logger.info("Fake LLM returning input as response.")
        return f"Context: {' '.join(context)}\n\nQuery: {query}"

    else:
        raise ValueError("Invalid model type. Use 'chatgpt', 'ollama', or 'fakellm'.")

In [6]:
# -----------------------------------------
# Main Workflow
# -----------------------------------------
def main() -> None:
    """Main workflow to ingest a PDF, query ChromaDB, and use an LLM."""
    if not os.path.exists(PDF_PATH):
        logger.info("PDF not found locally. Downloading...")
        download_sample_pdf(SAMPLE_PDF_URL, PDF_PATH)

    collection = create_chroma_client_and_collection(COLLECTION_NAME)
    ingest_pdf_to_chromadb(PDF_PATH, collection, CHUNK_SIZE)

    for query in QUERIES:
        logger.info(f"Processing query: {query}")
        matches = query_vector_db(collection, query)
        if not matches:
            logger.warning(f"No matches found for query: {query}")
            print(f"No matches for query: {query}\n")
            continue
        response = query_llm(query, matches)
        print(f"Query: {query}\nLLM Response:\n{response}\n")


if __name__ == "__main__":
    main()

2024-11-17 06:55:13,417 - INFO - PDF not found locally. Downloading...
2024-11-17 06:55:13,418 - INFO - Downloading sample PDF...
2024-11-17 06:55:13,875 - INFO - Sample PDF downloaded to sample.pdf
2024-11-17 06:55:13,875 - INFO - Setting up ChromaDB client and collection...
2024-11-17 06:55:13,895 - INFO - Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.
2024-11-17 06:55:14,017 - INFO - ChromaDB collection 'rag_documents' created or retrieved.
2024-11-17 06:55:14,017 - INFO - Ingesting PDF from 'sample.pdf' into ChromaDB...
2024-11-17 06:55:14,018 - INFO - Extracting text from PDF at sample.pdf...
2024-11-17 06:55:14,030 - INFO - Text extraction completed.
2024-11-17 06:55:14,030 - INFO - Splitting text into chunks...
2024-11-17 06:55:14,030 - INFO - Text split into 6 chunks.
2024-11-17 06:55:14,384 - INFO - Ingested 6 chunks into ChromaDB.
2024-11-17 06:55:14,385 - INFO - Processing query: What is the main idea of the d

Query: What is the main idea of the document?
LLM Response:
Context: Sample PDFThis is a simple PDF ﬁle. Fun fun fun.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Phasellus facilisis odio sed mi. Curabitur suscipit. Nullam vel nisi. Etiam semper ipsum ut lectus. Proin aliquam, erat eget pharetra commodo, eros mi condimentum quam, sed commodo justo quam ut velit. Integer a erat. Cras laoreet ligula cursus enim. Aenean scelerisque velit et tellus. Vestibulum dictum aliquet sem. Nulla facilisi. Vestibulum accumsan ante vitae elit. Nulla erat dolor, bland rsus. Duis ut magna at justo dignissim condimentum. Cum sociis natoque penatibus et magnis dis parturient montes, nascetur ridiculus mus. Vivamus varius. Ut sit amet diam suscipit mauris ornare aliquam. Sed varius. Duis arcu. Etiam tristique massa eget dui. Phasellus congue. Aenean est erat, tincidunt eget, venenatis quis, commodo at, quam.

Query: What is the main idea of the document?

Query: Summarize the key topics discuss