<div style="max-width: 90%; background: linear-gradient(90deg, #F7F9FD 0%, #ECF4F8 100%); padding: 25px; border-radius: 10px; text-align: center; margin-bottom: 15px; box-sizing: border-box; border: 2px solid #7AD6EB;">
    <div style="margin-bottom: 20px;">
        <div style="width: 80px; height: 40px; background: #ECF4F8; border: 0px solid #61BD73; border-radius: 8px; display: inline-flex; align-items: center; justify-content: center; font-size: 0.8em; color: #130C49; font-weight: bold; margin-bottom: 25px;">
            <img src="assets/weaviate-logo-light-transparent-200.png" alt="Weaviate Logo" style="height: 30px;"/>
        </div>
    </div>
    <h1 style="font-size: 2.2em; margin: 0 0 25px 0; color: #130C49; font-weight: 700;">🚀 Building a PDF-Driven RAG System</h1>
    <p style="font-size: 1.2em; margin: 0 0 20px 0; color: #130C49; opacity: 0.8; font-weight: 400;">Transform static documents into intelligent, searchable knowledge<br/>with <strong>Weaviate</strong> and multi-modal models</p>
    <hr style="border: none; height: 1px; background: linear-gradient(90deg, transparent 0%, #61BD73 50%, transparent 100%); margin: 25px 0;"/>
    <p style="margin: 0; font-size: 1em; color: #130C49; opacity: 0.9;">
        👀 <strong>Watch along</strong> or 💻 <strong>Code along</strong> - your choice!
    </p>    
</div>

<div style="max-width: 90%; background: linear-gradient(90deg, #F7F9FD 0%, #ECF4F8 100%); padding: 15px; border-radius: 10px; text-align: center; margin: 15px 0; box-sizing: border-box; border: 1px solid #B8C0DE;">
    <div style="display: flex; align-items: center; justify-content: center; gap: 15px;">
        <div style="width: 50px; height: 50px; background: #ECF4F8; border: 2px solid #7AD6EB; border-radius: 50%; display: flex; align-items: center; justify-content: center; font-size: 1.2em; color: #130C49;">
            <img src="assets/jp_profile_small_2023.jpeg" alt="JP Hwang" style="width: 46px; height: 46px; border-radius: 50%; object-fit: cover;"/>
        </div>
        <div style="text-align: left;">
            <h4 style="margin: 0; color: #130C49; font-size: 1em;">JP Hwang</h4>
            <p style="margin: 2px 0 0 0; color: #130C49; font-size: 0.85em; opacity: 0.8;">Senior Developer Educator</p>
        </div>
    </div>
</div>

<div style="max-width: 90%; background: linear-gradient(90deg, #ECF4F8 0%, #F7F9FD 100%); padding: 20px; border-radius: 10px; text-align: center; margin: 15px 0; box-sizing: border-box; border: 2px solid #61BD73;">
    <div style="max-width: 90%; background: rgba(97,189,115,0.1); padding: 15px; border-radius: 6px;">
        <h3 style="margin: 0 0 10px 0; color: #130C49; font-size: 1.1em;">🔧 Setup Required (Only if coding along)</h3>
        <p style="color: #130C49; font-size: 0.9em; margin-bottom: 10px;">
            Follow the <strong>setup instructions</strong> in our README<br/><br/><a href="https://github.com/weaviate-tutorials/workshop-pdf-driven-rag/" style="color: #130C49; text-decoration: underline;">https://github.com/weaviate-tutorials/workshop-pdf-driven-rag/</a>
        </p>
        <div style="text-align: left; display: inline-block;">
            <p style="color: #130C49; font-weight: bold; margin: 5px 0; font-size: 0.9em;">
            ✅ Python environment configured<br>
            ✅ Required dependencies installed<br>
            ✅ API keys set up in <code style="background: #130C49; color: white; padding: 2px 4px; border-radius: 3px;">.env</code> file<br>
            </p>
        </div>
    </div>
</div>

In [13]:
from preprocess_pdf_to_img import convert_pdf_to_images
from pathlib import Path

img_dir = Path("data/imgs")
for pdf_path in [
    Path("data/pdfs/hai_ai_index_report_2025_chapter_2.pdf"),
    Path("data/pdfs/howto-free-threading-python.pdf")
]:
    convert_pdf_to_images(pdf_path, img_dir)

![assets/about_weaviate.png](assets/about_weaviate.png)

![assets/workshop_agenda.jpg](assets/workshop_agenda.jpg)

## Preparation

Load environment variables. If you haven't set up your `.env` file, please refer to the [README](README.md).

In the live session, your instructor may provide you with temporary keys to use. 

In [14]:
from dotenv import load_dotenv

load_dotenv()

True

In [15]:
import os

print(os.getenv("ANTHROPIC_API_KEY")[:10])
print(os.getenv("COHERE_API_KEY")[:10])

sk-ant-api
RRSFuhizdT


## Working with PDFs - an introduction

### Get text from PDFs

PDFs contain rich formatting:

<img src="data/imgs/howto-free-threading-python_2_of_4.png" height="400px" />
<img src="data/imgs/hai_ai_index_report_2025_chapter_2_34_of_80.jpg" height="400px" />

Modern libraries can preserve document structure while converting to text format, making the text easier to process while maintaining semantic relationships.

Let's try a popular library, `docling`.

Here's how we can convert PDFs into text files:

In [16]:
from pathlib import Path
from docling.document_converter import DocumentConverter


def parse_pdf(input_doc_path: Path, output_dir: Path):
    doc_converter = DocumentConverter()
    conv_res = doc_converter.convert(input_doc_path)

    # Save markdown
    filename = input_doc_path.stem
    md_filepath = output_dir / f"{filename}-parsed-text.md"
    if not md_filepath.exists():
        output_dir.mkdir(parents=True, exist_ok=True)
        print(f"Saving parsed text to {md_filepath}")
        with md_filepath.open("w", encoding="utf-8") as md_file:
            md_file.write(conv_res.document.export_to_markdown())
    else:
        print(f"Parsed text already exists at {md_filepath}, skipping text extraction.")
    return md_filepath

In [17]:
data_folder = Path("data/pdfs")
output_dir = Path("data/parsed")
output_dir.mkdir(parents=True, exist_ok=True)

input_pdf_name = "howto-free-threading-python.pdf"
input_doc_path = data_folder / input_pdf_name

In [18]:
# Run the parser function
# BEGIN_SOLUTION
md_filepath = parse_pdf(input_doc_path, output_dir)
# END_SOLUTION

Parsed text already exists at data/parsed/howto-free-threading-python-parsed-text.md, skipping text extraction.


Inspect the converted file:

In [19]:
md_txt = md_filepath.read_text()

# Print some part of the parsed text:
# BEGIN_SOLUTION
print(md_txt[:1000])
# END_SOLUTION

## Python experimental support for free threading

Release 3.13.3

## Guido van Rossum and the Python development team

April 27, 2025

Python Software Foundation Email: docs@python.org

## Contents

| 1     | Installation                                        |   2 |
|-------|-----------------------------------------------------|-----|
| 2     | Identifying free-threaded Python                    |   2 |
| 3     | The global interpreter lock in free-threaded Python |   2 |
| 4     | Thread safety                                       |   2 |
| 5     | Known limitations                                   |   2 |
| 5.1   | Immortalization . . . . . . . . . . . . . . . . .   |   3 |
| 5.2   | Frame objects . . . . . . . . . . . . . . . . . .   |   3 |
| 5.3   | Iterators . . . . . . . . . . . . . . . . . . . . . |   3 |
| 5.4   | Single-threaded performance . . . . . . . . . .     |   3 |
| Index |                                                     |   4 |

Starting with the 3.13 releas

### Chunking

Chunking breaks documents into smaller, manageable pieces while preserving context. 

![assets/chunking_why.png](assets/chunking_why.png)

#### RAG - Recap

![assets/llm_2_rag_basic.png](assets/llm_2_rag_basic.png)

What does this have to do with chunking? 

Each "chunk" becomes a "unit" of context to feed into a RAG input!

### Chunking strategies

Different chunking strategies serve different use cases. 

![assets/chunking_methods.png](assets/chunking_methods.png)

Let's try a few options:

#### Chunk by text length with overlap

Overlapping chunks help maintain context across boundaries. 

In [20]:
def get_chunks_by_length_with_overlap(src_text: str, chunk_length: int = 500, overlap: int = 100) -> list[str]:
    """
    Split text into chunks of approximately `chunk_length` characters.
    """
    chunks = []
    for i in range(0, len(src_text), chunk_length):
        chunks.append(src_text[i:i + chunk_length + overlap])
    return chunks

In [21]:
# Try chunking using the get_chunks_by_length_with_overlap method
# BEGIN_SOLUTION
chunks = get_chunks_by_length_with_overlap(md_txt)
# END_SOLUTION

# Have a look at some of the chunks
display(chunks[:5])
print(len(chunks[0]))

['## Python experimental support for free threading\n\nRelease 3.13.3\n\n## Guido van Rossum and the Python development team\n\nApril 27, 2025\n\nPython Software Foundation Email: docs@python.org\n\n## Contents\n\n| 1     | Installation                                        |   2 |\n|-------|-----------------------------------------------------|-----|\n| 2     | Identifying free-threaded Python                    |   2 |\n| 3     | The global interpreter lock in free-threaded Python |   2 |\n| 4     | Thread safety                                       |   2 |\n| 5     | Known limitations                       ',
 'ety                                       |   2 |\n| 5     | Known limitations                                   |   2 |\n| 5.1   | Immortalization . . . . . . . . . . . . . . . . .   |   3 |\n| 5.2   | Frame objects . . . . . . . . . . . . . . . . . .   |   3 |\n| 5.3   | Iterators . . . . . . . . . . . . . . . . . . . . . |   3 |\n| 5.4   | Single-threaded performance . .

600


#### Chunk using markers

Using document markers (like headers) creates chunks that respect natural document boundaries. 

This approach preserves semantic structure and is ideal for documents with clear hierarchical organization like reports, manuals, or academic papers.

In [22]:
def get_chunks_using_markers(src_text: str) -> list[str]:
    """
    Split the source text into chunks using markers.
    """
    marker = "\n##"

    # Split by marker and reconstruct with markers (except first chunk)
    parts = src_text.split(marker)
    chunks = []

    # Add first chunk if it exists and isn't empty
    if parts[0].strip():
        chunks.append(parts[0].strip())

    # Add remaining chunks with markers reattached
    for part in parts[1:]:
        if part.strip():
            chunks.append(marker + part.strip())

    return chunks

In [23]:
# Try chunking using the marker-based method
# BEGIN_SOLUTION
chunks = get_chunks_using_markers(md_txt)
# END_SOLUTION

# Have a look at some of the chunks
display(chunks[:5])
print(len(chunks[0]))

['## Python experimental support for free threading\n\nRelease 3.13.3',
 '\n##Guido van Rossum and the Python development team\n\nApril 27, 2025\n\nPython Software Foundation Email: docs@python.org',
 '\n##Contents\n\n| 1     | Installation                                        |   2 |\n|-------|-----------------------------------------------------|-----|\n| 2     | Identifying free-threaded Python                    |   2 |\n| 3     | The global interpreter lock in free-threaded Python |   2 |\n| 4     | Thread safety                                       |   2 |\n| 5     | Known limitations                                   |   2 |\n| 5.1   | Immortalization . . . . . . . . . . . . . . . . .   |   3 |\n| 5.2   | Frame objects . . . . . . . . . . . . . . . . . .   |   3 |\n| 5.3   | Iterators . . . . . . . . . . . . . . . . . . . . . |   3 |\n| 5.4   | Single-threaded performance . . . . . . . . . .     |   3 |\n| Index |                                                     |   4 |\n\

65


### Choosing the right strategy

The best chunking strategy depends on your use case. 

Marker-based chunking excels with structured documents - but in some cases, it may not work as well:

In [24]:
md_file_2 = Path("data/parsed/hai_ai_index_report_2025_chapter_2-parsed-text.md")
md_text_2 = md_file_2.read_text(encoding="utf-8")
get_chunks_using_markers(md_text_2)[:15]

['## Arti fi cial Intelligence Index Report 2025\n\n<!-- image -->\n\n<!-- image -->\n\n<!-- image -->',
 '\n##Chapter 2: Technical Performance\n\nOverview\n\n84\n\nChapter Highlights\n\n85',
 '\n##2.1 Overview of AI in 2024\n\n87\n\nTimeline: Signi fi cant Model and Dataset Releases\n\n87\n\nState of AI Performance\n\n93\n\nOverall Review\n\n93\n\nClosed vs. Open-Weight Models\n\n94\n\nUS vs. China Technical Performance\n\n96\n\nImproved Performance From Smaller Models\n\n98\n\nModel Performance Converges at the Frontier\n\n99\n\nBenchmarking AI\n\n100',
 '\n##2.2 Language',
 '\n##103\n\n| Understanding                                        |   104 |\n|------------------------------------------------------|-------|\n| MMLU: Massive Multitask Language Understanding       |   104 |\n| Generation                                           |   105 |\n| Chatbot Arena Leaderboard                            |   105 |\n| Arena-Hard-Auto                                      |   107 |\n| WildBe

Here, the page headers are mistakenly interpreted as headings, which confuses our structure. 

<img src="data/imgs/hai_ai_index_report_2025_chapter_2_05_of_80.jpg" height="400px" />
<img src="data/imgs/hai_ai_index_report_2025_chapter_2_06_of_80.jpg" height="400px" />

As a result, a length-based chunking method tends to be quite robust.