In [1]:
from dotenv import load_dotenv

load_dotenv()

True

## Working with PDFs with images

PDFs contain more than rich formatting - they have images!

<img src="data/imgs/hai_ai_index_report_2025_chapter_2_34_of_80.jpg" width="200px" />
<img src="data/imgs/hai_ai_index_report_2025_chapter_2_58_of_80.jpg" width="200px" />
<img src="data/imgs/hai_ai_index_report_2025_chapter_2_69_of_80.jpg" width="200px" />

How do we work with these for RAG?

### Approach 1 - Extract text and images separately

Some libraries (like `docling`) can extract text and images from PDFs, and convert them into Markdown files.

In [2]:
from pathlib import Path

data_folder = Path("data/pdfs")
output_dir = Path("data/parsed")
output_dir.mkdir(parents=True, exist_ok=True)

In [3]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc import ImageRefMode

IMAGE_RESOLUTION_SCALE = 2.0


def parse_pdf_with_images(input_doc_path: Path, output_dir: Path):
    # Reference: https://docling-project.github.io/docling/examples/export_figures/
    md_filename = output_dir / f"{input_doc_path.name.split('.')[0]}-parsed-w-imgs.md"
    if md_filename.exists():
        print(f"Skipping {md_filename} as it already exists.")
        return

    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    conv_res = doc_converter.convert(input_doc_path)

    # Save markdown with embedded pictures
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)


pdf_names = [f.name for f in data_folder.glob("*.pdf") if f.is_file()]

for pdf_fname in pdf_names:
    print(f"Processing file: {pdf_fname}")

    input_doc_path = data_folder / pdf_fname

    print(f"Converting document {input_doc_path} to multimodal pages...")
    parse_pdf_with_images(input_doc_path, output_dir)


Processing file: howto-free-threading-python.pdf
Converting document data/pdfs/howto-free-threading-python.pdf to multimodal pages...
Skipping data/parsed/howto-free-threading-python-parsed-w-imgs.md as it already exists.
Processing file: manual_bosch_WGG254Z0GR.pdf
Converting document data/pdfs/manual_bosch_WGG254Z0GR.pdf to multimodal pages...
Skipping data/parsed/manual_bosch_WGG254Z0GR-parsed-w-imgs.md as it already exists.
Processing file: hai_ai_index_report_2025_chapter_2.pdf
Converting document data/pdfs/hai_ai_index_report_2025_chapter_2.pdf to multimodal pages...
Skipping data/parsed/hai_ai_index_report_2025_chapter_2-parsed-w-imgs.md as it already exists.


In [4]:
md_filepath = Path("data/parsed/hai_ai_index_report_2025_chapter_2-parsed-w-imgs.md")
md_txt = md_filepath.read_text()
print(md_txt[:1000])

## Arti fi cial Intelligence Index Report 2025

![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000000_c6ec2f8165962bd018633d07074eaf41cfb78512dd53037fa2fcdda5ff3e6b52.png)

![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000001_23acef5ec9d7e4fa2bc47b734cc808da8578bc8d91748aea27d76aedc4f31dd0.png)

![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000002_2ef37eac1d46d33cf738050ed266e05091775c5ea224676eb6a41aeb01cb00f1.png)

## Chapter 2: Technical Performance

Overview

84

Chapter Highlights

85

## 2.1 Overview of AI in 2024

87

Timeline: Signi fi cant Model and Dataset Releases

87

State of AI Performance

93

Overall Review

93

Closed vs. Open-Weight Models

94

US vs. China Technical Performance

96

Improved Performance From Smaller Models

98

Model Performance Converges at the Frontier

99

Benchmarking AI

100

## 2.2 Language

## 103

| Understanding                                        |   104 |

#### Chunking text files with images

More complex than just text, since we need to handle images as well.

- Must include entire image string in the chunk
- When vectorizing, optionally include base64 of image
    - Your embedding model must be multimodal

Chunking becomes more complex.

One method: try a specialized library like `chonkie` to handle this

Chonkie offers a variety of chunking strategies:

<img src="assets/chonkie_methods.png" />

There isn't going to be a "one size fits all" solution for chunking PDFs with images. But these libraries can help you get started.

Let's try a couple of different approaches:

In [5]:
from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

In [6]:
chunk_texts = chunker.chunk(md_txt)

In [7]:
import textwrap

for chunk in chunk_texts[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Chunk text:")
    wrapped_text = textwrap.fill(chunk.text[:500]+"...", width=80)
    print(textwrap.indent(wrapped_text, "    "))


Token count: 911
Chunk text:
    ## Arti fi cial Intelligence Index Report 2025
    ![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000000
    _c6ec2f8165962bd018633d07074eaf41cfb78512dd53037fa2fcdda5ff3e6b52.png)
    ![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000001
    _23acef5ec9d7e4fa2bc47b734cc808da8578bc8d91748aea27d76aedc4f31dd0.png)
    ![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000002
    _2ef37eac1d46d33cf738050ed266e05091775c5ea224676eb6a41aeb01cb00f1.pn...

Token count: 1755
Chunk text:
    ## 2.2 Language  ## 103  | Understanding
    |   104 | |------------------------------------------------------|-------| |
    MMLU: Massive Multitask Language Understanding       |   104 | | Generation
    |   105 | | Chatbot Arena Leaderboard                            |   105 | |
    Arena-Hard-Auto                                      |   107 | | WildBench
    |   108 | | Highlight: o1, o3,...

To

Let's try a "semantic" chunker:

In [8]:
from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=2048,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

In [9]:
# Chunk text into `chunk_texts` as we've done before
# BEGIN_SOLUTION
chunk_texts = chunker.chunk(md_txt)
# END_SOLUTION

In [10]:
for chunk in chunk_texts[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Chunk text:")
    wrapped_text = textwrap.fill(chunk.text[:500]+"...", width=80)
    print(textwrap.indent(wrapped_text, "    "))


Token count: 269
Chunk text:
    ## Arti fi cial Intelligence Index Report 2025
    ![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000000
    _c6ec2f8165962bd018633d07074eaf41cfb78512dd53037fa2fcdda5ff3e6b52.png)
    ![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000001
    _23acef5ec9d7e4fa2bc47b734cc808da8578bc8d91748aea27d76aedc4f31dd0.png)
    ![Image](hai_ai_index_report_2025_chapter_2-parsed-w-imgs_artifacts/image_000002
    _2ef37eac1d46d33cf738050ed266e05091775c5ea224676eb6a41aeb01cb00f1.pn...

Token count: 554
Chunk text:
     Chapter Highlights  85  ## 2.1 Overview of AI in 2024  87  Timeline: Signi fi
    cant Model and Dataset Releases  87  State of AI Performance  93  Overall Review
    93  Closed vs. Open-Weight Models  94  US vs. China Technical Performance  96
    Improved Performance From Smaller Models  98  Model Performance Converges at the
    Frontier  99  Benchmarking AI  100  ## 2.2 Language  ## 103  | Unders

### Set up Weaviate Collection

In [11]:
import utils

# Helper function to connect to Weaviate
client = utils.connect_to_weaviate()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [12]:
client.collections.delete("Chunks")

In [13]:
from weaviate.classes.config import Property, DataType, Configure, Tokenization

client.collections.create(
    name="Chunks",
    properties=[
        Property(
            name="document_title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="chunk",
            data_type=DataType.TEXT,
        ),
        Property(
            name="chunk_number",
            data_type=DataType.INT,
        ),
        Property(
            name="filename",
            data_type=DataType.TEXT,
            tokenization=Tokenization.FIELD
        ),
    ],
    vector_config=[
        Configure.Vectors.text2vec_cohere(
            name="default",
            source_properties=["document_title", "chunk"],
            model="embed-v4.0"
        )
    ]
)

<weaviate.collections.collection.sync.Collection at 0x1230000a0>

In [14]:
chunks = client.collections.get("Chunks")

### Import data

In [15]:
from tqdm import tqdm

with chunks.batch.fixed_size(batch_size=100) as batch:
    for i, chunk_text in tqdm(enumerate(chunk_texts)):
        obj = {
            "document_title": "Stanford HAI Report 2025",
            "filename": "data/pdfs/hai_ai_index_report_2025_chapter_2.pdf",
            "chunk": chunk_text.text,
            "chunk_number": i + 1,
        }

        # Add object to batch for import with (batch.add_object())
        # BEGIN_SOLUTION
        batch.add_object(
            properties=obj
        )
        # END_SOLUTION

144it [00:00, 56116.30it/s]


### RAG queries

How do we perform RAG in this scenario? 

This is a bit different, because we haven't embedded the images (or stored them in Weaviate).

In this scenario, let's:

- Retrieve text chunks
- Get images referred to in the text
- Convert the images to base64
- Send (retrieved text + images + prompt) to LLM for RAG

In [16]:
response = chunks.query.hybrid(
    query="Latest developments in self-driving cars / autonomous vehicles",
    limit=10
)

for o in response.objects:
    print(f"\n" + "=" * 40)
    print(o.properties["chunk"][:1000] + "...")



Self-driving vehicles have long been a goal for AI researchers and technologists. However, their widespread adoption has been  slower  than  anticipated.  Despite  many  predictions that  fully  autonomous  driving  is  imminent,  widespread  use of  self-driving  vehicles  has yet  to  become  a  reality.  Still,  in recent  years,  signi fi cant  progress  has  been  made.  In  cities like  San  Francisco  and  Phoenix, fl eets  of  self-driving  taxis are  now  operating  commercially.  This  section  examines recent advancements in autonomous driving, focusing on deployment, technological breakthroughs and new benchmarks, safety performance, and policy challenges.

## Deployment

Self-driving cars are increasingly being deployed worldwide. Cruise, a subsidiary of General Motors, launched its autonomous vehicles  in  San  Francisco  in  late  2022  before having its license suspended in 2023 after a litany of safety incidents. Waymo, a subsidiary of Alphabet, began deploying its  

In [17]:
import re

def extract_image_paths(text):
    """Extract image paths from markdown-style image references."""
    pattern = r'!\[.*?\]\((.*?)\)'
    return re.findall(pattern, text)

In [18]:
def get_image_base64s(image_paths, base_path=None):
    import base64
    base64_images = []
    for img_path in image_paths:
        full_path = Path(base_path) / img_path if base_path else Path(img_path)
        image_bytes = full_path.read_bytes()
        base64_string = base64.b64encode(image_bytes).decode("utf-8")
        base64_images.append(base64_string)

    return base64_images

In [19]:
all_chunks = ""
all_images = []

for o in response.objects:
    chunk_text = o.properties["chunk"]
    image_paths = extract_image_paths(chunk_text)
    all_images.extend(get_image_base64s(image_paths, base_path="data/parsed"))

    all_chunks += "\n\n" + chunk_text

In [20]:
task_text = """
What developments in self-driving cars / autonomous vehicles are mentioned here? Answer based on the provided text and images.

Describe the details from the figures as well, if necessary.
""" + "\n\n" + all_chunks

message = {
    "role": "user", "content": [
        {"type": "text", "text": task_text}
    ]
}

for img in all_images:
    content = {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": img,
        }
    }
    # Append `content`` to message["content"]
    # BEGIN_SOLUTION
    message["content"].append(content)
    # END_SOLUTION

In [None]:
import anthropic

anthropic_response = anthropic.Anthropic().messages.create(
    model="claude-3-5-haiku-latest",
    max_tokens=1024,
    # Add [message] as the messages to pass to Claude
    # BEGIN_SOLUTION
    messages=[message]
    # END_SOLUTION
)

In [22]:
print(anthropic_response.content[0].text)

Based on the provided text and images, here are the key developments in self-driving cars/autonomous vehicles mentioned:

## Deployment Developments

**United States:**
- **Waymo** (Alphabet subsidiary) has emerged as a leading player, operating in four major U.S. cities: Phoenix, San Francisco, Los Angeles, and Austin
- As of January 2025, Waymo provides 150,000 paid rides per week, covering over 1 million miles
- Plans to test in 10 additional cities including Las Vegas, San Diego, Miami, upstate New York, and Truckee, California (specifically chosen for snowy weather testing)

- **Cruise** (General Motors subsidiary) launched in San Francisco in late 2022 but had its license suspended in 2023 due to safety incidents

- **Self-driving trucks**: Companies like Kodiak completed first driverless deliveries, and Aurora reported over 1 million miles of autonomous freight hauling on U.S. highways since 2021 (with human safety drivers present). Aurora delayed its commercial launch from end 

In [23]:
client.close()

{"build_git_commit":"08d409a988","build_go_version":"go1.25.0","build_image_tag":"HEAD","build_wv_version":"1.32.5","error":"cannot find peer","level":"error","msg":"transferring leadership","time":"2025-09-16T12:04:03+01:00"}
