<div style="max-width: 90%; background: linear-gradient(135deg, #130C49 0%, #262262 100%); padding: 20px; border-radius: 10px; color: white; text-align: center; box-shadow: 0 5px 15px rgba(0,0,0,0.2); margin-bottom: 15px; box-sizing: border-box; border: 2px solid #464a7fff;">
    <h1 style="font-size: 2em; margin: 0; text-shadow: 1px 1px 3px rgba(0,0,0,0.3);">🚀 Building a PDF-Driven RAG System</h1>
    <p style="font-size: 1.1em; margin: 10px 0;">Transform static documents into intelligent, searchable knowledge</p>
    <hr/>
    <p style="margin: 10px 0;">
        👀 <strong>Watch along</strong> or 💻 <strong>Code along</strong> - your choice!
    </p>    
</div>

<div style="max-width: 90%; background: linear-gradient(90deg, #F7F9FD 0%, #ECF4F8 100%); padding: 15px; border-radius: 10px; text-align: center; margin: 15px 0; box-sizing: border-box; border: 1px solid #B8C0DE;">
    <div style="display: flex; align-items: center; justify-content: center; gap: 15px;">
        <div style="width: 50px; height: 50px; background: #ECF4F8; border: 2px solid #7AD6EB; border-radius: 50%; display: flex; align-items: center; justify-content: center; font-size: 1.2em; color: #130C49;">
            <img src="assets/jp_profile_small_2023.jpeg" alt="JP Hwang" style="width: 46px; height: 46px; border-radius: 50%; object-fit: cover;"/>
        </div>
        <div style="text-align: left;">
            <h4 style="margin: 0; color: #130C49; font-size: 1em;">JP Hwang</h4>
            <p style="margin: 2px 0 0 0; color: #130C49; font-size: 0.85em; opacity: 0.8;">Senior Developer Educator</p>
        </div>
    </div>
</div>

<div style="max-width: 90%; background: linear-gradient(90deg, #ECF4F8 0%, #F7F9FD 100%); padding: 20px; border-radius: 10px; text-align: center; margin: 15px 0; box-sizing: border-box; border: 2px solid #61BD73;">
    <div style="max-width: 90%; background: rgba(97,189,115,0.1); padding: 15px; border-radius: 6px;">
        <h3 style="margin: 0 0 10px 0; color: #130C49; font-size: 1.1em;">🔧 Setup Required (Only if coding along)</h3>
        <p style="color: #130C49; font-size: 0.9em; margin-bottom: 10px;">
            Follow the <strong>setup instructions</strong> in our README<br/><br/><a href="https://github.com/weaviate-tutorials/workshop-pdf-driven-rag/" style="color: #130C49; text-decoration: underline;">https://github.com/weaviate-tutorials/workshop-pdf-driven-rag/</a>
        </p>
        <div style="text-align: left; display: inline-block;">
            <p style="color: #130C49; font-weight: bold; margin: 5px 0; font-size: 0.9em;">
            ✅ Python environment configured<br>
            ✅ Required dependencies installed<br>
            ✅ API keys set up in <code style="background: #130C49; color: white; padding: 2px 4px; border-radius: 3px;">.env</code> file<br>
            </p>
        </div>
    </div>
</div>

In [None]:
from preprocess_pdf_to_img import convert_pdf_to_images
from pathlib import Path

img_dir = Path("data/imgs")
for pdf_path in [
    Path("data/pdfs/hai_ai_index_report_2025_chapter_2.pdf"),
    Path("data/pdfs/howto-free-threading-python.pdf")
]:
    convert_pdf_to_images(pdf_path, img_dir)

![assets/about_weaviate.png](assets/about_weaviate.png)

![assets/workshop_agenda.jpg](assets/workshop_agenda.jpg)

## Preparation

Load environment variables. If you haven't set up your `.env` file, please refer to the [README](README.md).

In the live session, your instructor may provide you with temporary keys to use. 

In [None]:
from dotenv import load_dotenv

load_dotenv()

In [None]:
import os

print(os.getenv("ANTHROPIC_API_KEY")[:10])
print(os.getenv("COHERE_API_KEY")[:10])

## Working with PDFs - an introduction

### Get text from PDFs

PDFs contain rich formatting:

<img src="data/imgs/howto-free-threading-python_2_of_4.png" height="400px" />
<img src="data/imgs/hai_ai_index_report_2025_chapter_2_34_of_80.jpg" height="400px" />

Modern libraries can preserve document structure while converting to text format, making the text easier to process while maintaining semantic relationships.

Let's try a popular library, `docling`.

Here's how we can convert PDFs into text files:

In [None]:
from pathlib import Path
from docling.document_converter import DocumentConverter


def parse_pdf(input_doc_path: Path, output_dir: Path):
    doc_converter = DocumentConverter()
    conv_res = doc_converter.convert(input_doc_path)

    # Save markdown
    filename = input_doc_path.stem
    md_filepath = output_dir / f"{filename}-parsed-text.md"
    if not md_filepath.exists():
        output_dir.mkdir(parents=True, exist_ok=True)
        print(f"Saving parsed text to {md_filepath}")
        with md_filepath.open("w", encoding="utf-8") as md_file:
            md_file.write(conv_res.document.export_to_markdown())
    else:
        print(f"Parsed text already exists at {md_filepath}, skipping text extraction.")
    return md_filepath

In [None]:
data_folder = Path("data/pdfs")
output_dir = Path("data/parsed")
output_dir.mkdir(parents=True, exist_ok=True)

input_pdf_name = "howto-free-threading-python.pdf"
input_doc_path = data_folder / input_pdf_name

In [None]:
# Run the parser function
# ADD YOUR CODE HERE

Inspect the converted file:

In [None]:
md_txt = md_filepath.read_text()

# Print some part of the parsed text:
# ADD YOUR CODE HERE

### Chunking

Chunking breaks documents into smaller, manageable pieces while preserving context. 

![assets/chunking_why.png](assets/chunking_why.png)

#### RAG - Recap

![assets/llm_2_rag_basic.png](assets/llm_2_rag_basic.png)

What does this have to do with chunking? 

Each "chunk" becomes a "unit" of context to feed into a RAG input!

### Chunking strategies

Different chunking strategies serve different use cases. 

![assets/chunking_methods.png](assets/chunking_methods.png)

Let's try a few options:

#### Chunk by text length with overlap

Overlapping chunks help maintain context across boundaries. 

In [None]:
def get_chunks_by_length_with_overlap(src_text: str, chunk_length: int = 500, overlap: int = 100) -> list[str]:
    """
    Split text into chunks of approximately `chunk_length` characters.
    """
    chunks = []
    for i in range(0, len(src_text), chunk_length):
        chunks.append(src_text[i:i + chunk_length + overlap])
    return chunks

In [None]:
# Try chunking using the get_chunks_by_length_with_overlap method
# ADD YOUR CODE HERE

# Have a look at some of the chunks
display(chunks[:5])
print(len(chunks[0]))

#### Chunk using markers

Using document markers (like headers) creates chunks that respect natural document boundaries. 

This approach preserves semantic structure and is ideal for documents with clear hierarchical organization like reports, manuals, or academic papers.

In [None]:
def get_chunks_using_markers(src_text: str) -> list[str]:
    """
    Split the source text into chunks using markers.
    """
    marker = "\n##"

    # Split by marker and reconstruct with markers (except first chunk)
    parts = src_text.split(marker)
    chunks = []

    # Add first chunk if it exists and isn't empty
    if parts[0].strip():
        chunks.append(parts[0].strip())

    # Add remaining chunks with markers reattached
    for part in parts[1:]:
        if part.strip():
            chunks.append(marker + part.strip())

    return chunks

In [None]:
# Try chunking using the marker-based method
# ADD YOUR CODE HERE

# Have a look at some of the chunks
display(chunks[:5])
print(len(chunks[0]))

### Choosing the right strategy

The best chunking strategy depends on your use case. 

Marker-based chunking excels with structured documents - but in some cases, it may not work as well:

In [None]:
md_file_2 = Path("data/parsed/hai_ai_index_report_2025_chapter_2-parsed-text.md")
md_text_2 = md_file_2.read_text(encoding="utf-8")
get_chunks_using_markers(md_text_2)[:15]

Here, the page headers are mistakenly interpreted as headings, which confuses our structure. 

<img src="data/imgs/hai_ai_index_report_2025_chapter_2_05_of_80.jpg" height="400px" />
<img src="data/imgs/hai_ai_index_report_2025_chapter_2_06_of_80.jpg" height="400px" />

As a result, a length-based chunking method tends to be quite robust.