## Working with PDFs with images

PDFs contain more than rich formatting - they have images!

Run the cells below to convert PDF data to images (this should take about a minute). 

In [None]:
try:
    import pymupdf
except ImportError:
    %pip install -Uqq pymupdf

In [None]:
%%bash
python pdf_to_img.py hai*.pdf
echo "Images extracted from AI report PDF"

In [None]:
import matplotlib.pyplot as plt
from PIL import Image

images = [
    f"data/imgs/hai_ai-index-report-2025_chapter2_excerpts_1_of_8.jpg",
    f"data/imgs/hai_ai-index-report-2025_chapter2_excerpts_4_of_8.jpg",
    f"data/imgs/hai_ai-index-report-2025_chapter2_excerpts_5_of_8.jpg"   
]

fig, axes = plt.subplots(1, 3, figsize=(15, 40))

for i, img_path in enumerate(images):
    img = Image.open(img_path)
    axes[i].imshow(img)
    axes[i].axis('off')

plt.tight_layout()
plt.show()

How do we work with these for RAG?

### Approach 1 - Extract text and images separately

Some libraries (like `docling`) can extract text and images from PDFs, and convert them into Markdown files.

Here, we've pre-converted this PDF into markdown.

In [None]:
from pathlib import Path

md_filepath = Path("data/parsed/hai_ai-index-report-2025_chapter2_excerpts-parsed-w-imgs.md")
md_txt = md_filepath.read_text()
print(md_txt[:1000])

#### Chunking text files with images

More complex than just text, since we need to handle images as well.

- Must include entire image string in the chunk
- When vectorizing, optionally include base64 of image
    - Your embedding model must be multimodal

Chunking becomes more complex.

One method: try a specialized library like `chonkie` to handle this

Chonkie offers a variety of chunking strategies:

<img src="images/chonkie_methods.png" />

There isn't going to be a "one size fits all" solution for chunking PDFs with images. But these libraries can help you get started.

Let's try a couple of different approaches:

In [None]:
%pip install -Uqq "chonkie[all]"

In [None]:
from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

In [None]:
chunk_texts = chunker.chunk(md_txt)

In [None]:
import textwrap

for chunk in chunk_texts[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Chunk text:")
    wrapped_text = textwrap.fill(chunk.text[:500]+"...", width=80)
    print(textwrap.indent(wrapped_text, "    "))

Let's try a "semantic" chunker:

In [None]:
from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=2048,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

In [None]:
# Chunk text into `chunk_texts` as we've done before
# ADD YOUR CODE HERE

In [None]:
for chunk in chunk_texts[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Chunk text:")
    wrapped_text = textwrap.fill(chunk.text[:500]+"...", width=80)
    print(textwrap.indent(wrapped_text, "    "))

We get a relatively "even" distribution of chunks here. 

So let's continue on with this approach. 