## Working with PDFs with images

PDFs contain rich formatting - here's an example:

<img src="data/imgs/manual_bosch_WGG254Z0GR_38_of_56.png" width="400px" />

### Approach 1 - Extract text and images separately

Some libraries (like `docling`) can extract text and images from PDFs, and convert them into Markdown files.

In [1]:
from pathlib import Path

data_folder = Path("data/pdfs")
output_dir = Path("data/parsed")
output_dir.mkdir(parents=True, exist_ok=True)

In [2]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc import ImageRefMode

IMAGE_RESOLUTION_SCALE = 2.0


def parse_pdf_with_images(input_doc_path: Path, output_dir: Path):
    # Reference: https://docling-project.github.io/docling/examples/export_figures/
    md_filename = output_dir / f"{input_doc_path.name.split('.')[0]}-parsed-w-imgs.md"
    if md_filename.exists():
        print(f"Skipping {md_filename} as it already exists.")
        return

    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    conv_res = doc_converter.convert(input_doc_path)

    # Save markdown with embedded pictures
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)


pdf_names = [f.name for f in data_folder.glob("*.pdf") if f.is_file()]

for pdf_fname in pdf_names:
    print(f"Processing file: {pdf_fname}")

    input_doc_path = data_folder / pdf_fname

    print(f"Converting document {input_doc_path} to multimodal pages...")
    parse_pdf_with_images(input_doc_path, output_dir)


Processing file: howto-free-threading-python.pdf
Converting document data/pdfs/howto-free-threading-python.pdf to multimodal pages...
Processing file: manual_bosch_WGG254Z0GR.pdf
Converting document data/pdfs/manual_bosch_WGG254Z0GR.pdf to multimodal pages...




In [3]:
md_filepath = Path("data/parsed/manual_bosch_WGG254Z0GR-parsed-w-imgs.md")
md_txt = md_filepath.read_text()
print(md_txt[:1000])

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000000_5c54d11a8c20ca25ddd9d2f56c9f9680ede4e06e7883ce26742dd8b92f37e50b.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000001_b4602c47a93b10cce1805fd72fa0b8d8885610e99dc4769a522107f918300343.png)

## Washing machine

## WGG254Z0GR

User manual and installation

[en] instructions

## Futher information and explanations are available online:

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000002_71ba2e278ecb31d5477d5b97185c8f2215227116e1615b161b0883eeea87bb26.png)

## Table of contents

| 1 Safety...........................................                      | 1 Safety...........................................                        | 4                           |
|--------------------------------------------------------------------------|----------------------------------------------------------------------------|-----------------------------|
| 1.1                                    

#### Chunking text files with images

More complex than just text, since we need to handle images as well.

- Must include entire image string in the chunk
- When vectorizing, must replace image references with base64 of actual images

One method: try a specialized library like `chonkie` to handle this

Chonkie offers a variety of chunking strategies:

<img src="assets/chonkie_methods.png" />

There isn't going to be a "one size fits all" solution for chunking PDFs with images. But these libraries can help you get started.

Let's try a couple of different approaches:

In [4]:
from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

In [5]:
chunk_texts = chunker.chunk(md_txt)

In [6]:
import textwrap

for chunk in chunk_texts[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print(f"Chunk text:")
    wrapped_text = textwrap.fill(chunk.text[:500]+"...", width=80)
    print(textwrap.indent(wrapped_text, "    "))


Token count: 569
Start index: 0
End index: 569
Chunk text:
    ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000000_5c54d11a8c
    20ca25ddd9d2f56c9f9680ede4e06e7883ce26742dd8b92f37e50b.png)
    ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000001_b4602c47a9
    3b10cce1805fd72fa0b8d8885610e99dc4769a522107f918300343.png)  ## Washing machine
    ## WGG254Z0GR  User manual and installation  [en] instructions  ## Futher
    information and explanations are available online:
    ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000002_71...

Token count: 2046
Start index: 569
End index: 2615
Chunk text:
    ## Table of contents  | 1 Safety...........................................
    | 1 Safety...........................................                        | 4
    | |--------------------------------------------------------------------------|--
    --------------------------------------------------------------------------|-----
    ------

Let's try a "semantic" chunker:

In [7]:
from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=2048,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

In [8]:
chunk_texts = chunker.chunk(md_txt)

In [9]:
for chunk in chunk_texts[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print(f"Chunk text:")
    wrapped_text = textwrap.fill(chunk.text[:500]+"...", width=80)
    print(textwrap.indent(wrapped_text, "    "))


Token count: 203
Start index: 0
End index: 427
Chunk text:
    ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000000_5c54d11a8c
    20ca25ddd9d2f56c9f9680ede4e06e7883ce26742dd8b92f37e50b.png)
    ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000001_b4602c47a9
    3b10cce1805fd72fa0b8d8885610e99dc4769a522107f918300343.png)  ## Washing machine
    ## WGG254Z0GR  User manual and installation  [en] instructions  ## Futher
    information and explanations are available online: ...

Token count: 134
Start index: 427
End index: 645
Chunk text:
     ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000002_71ba2e278
    ecb31d5477d5b97185c8f2215227116e1615b161b0883eeea87bb26.png)  ## Table of
    contents  | 1 Safety........................................... ...

Token count: 1953
Start index: 645
End index: 7375
Chunk text:
                         | 1 Safety...........................................
    | 4                           | |--------

### Set up Weaviate Collection

In [10]:
import weaviate
import os

client = weaviate.connect_to_embedded(
    version="1.32.0",
    headers={
        "X-Cohere-Api-Key": os.getenv("COHERE_API_KEY"),
    },
    environment_variables={"LOG_LEVEL": "error"}  # Reduce amount of logs
)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [14]:
client.collections.delete("Chunks")

In [15]:
from weaviate.classes.config import Property, DataType, Configure, Tokenization

client.collections.create(
    name="Chunks",
    properties=[
        Property(
            name="document_title",
            data_type=DataType.TEXT,
        ),
        Property(
            name="chunk",
            data_type=DataType.TEXT,
        ),
        Property(
            name="chunk_number",
            data_type=DataType.INT,
        ),
        Property(
            name="filename",
            data_type=DataType.TEXT,
            tokenization=Tokenization.FIELD
        ),
    ],
    vector_config=[
        Configure.Vectors.text2vec_cohere(
            name="default",
            source_properties=["document_title", "chunk"],
            model="embed-v4.0"
        )
    ]
)

<weaviate.collections.collection.sync.Collection at 0x4008b3a90>

In [16]:
chunks = client.collections.get("Chunks")

### Import data

In [17]:
from tqdm import tqdm
from weaviate.util import generate_uuid5

with chunks.batch.fixed_size(batch_size=100) as batch:
    for i, chunk_text in tqdm(enumerate(chunk_texts)):
        obj = {
            "document_title": "Bosch WGG254Z0GR Manual",
            "filename": "data/pdfs/manual_bosch_WGG254Z0GR.pdf",
            "chunk": chunk_text.text,
            "chunk_number": i + 1,
        }

        # Add object to batch for import
        batch.add_object(
            properties=obj,
            uuid=generate_uuid5(obj),
        )

170it [00:00, 26827.89it/s]


### RAG queries

How to perform RAG in this scenario?

- Retrieve text chunks
- Get images referred to in the text
- Convert the images to base64
- Send (retrieved text + images + prompt) to LLM for RAG

In [22]:
response = chunks.query.hybrid(
    query="How to clean the drain pump",
    limit=10
)

for o in response.objects:
    print(f"\n" + "=" * 40)
    print(o.properties["chunk"][:1000] + "...")



![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000058_04dccced21c2179765f4eb775befd46cac74f808901d8ae529124f232faa6274.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000059_dd70b8adb294a82cbf04f72acae0f75d553d2b0f0aea2989a28c95d1114ed57c.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000060_b0bd539957118736f36bb5b1cc063528eab6ea593cbe7b74a39885deabb28ff6.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000061_c7bf0b232a5a96f943c0537143799264c3245cda435c48450457ef67e1389a97.png)

## 17.3 Cleaning the drain pump

Clean the drain pump regularly, at least once a year, as well as in the event of faults, e.g. ...

→ Page 36

1. Since water may remain in the drain pump, unscrew the pump cap carefully.
2. -The filter insert in the pump housing may become stuck due to coarse particles of dirt. Loosen the dirt and remove the filter insert.
2. Clean the interior, the thread on the pump cap and the pump housing.
3. E

In [27]:
import re

def extract_image_paths(text):
    """Extract image paths from markdown-style image references."""
    pattern = r'!\[.*?\]\((.*?)\)'
    return re.findall(pattern, text)

In [35]:
def get_image_base64s(image_paths, base_path=None):
    import base64
    base64_images = []
    for img_path in image_paths:
        full_path = Path(base_path) / img_path if base_path else Path(img_path)
        image_bytes = full_path.read_bytes()
        base64_string = base64.b64encode(image_bytes).decode("utf-8")
        base64_images.append(base64_string)

    return base64_images

In [41]:
all_chunks = ""
all_images = []

for o in response.objects:
    chunk_text = o.properties["chunk"]
    image_paths = extract_image_paths(chunk_text)
    all_images.extend(get_image_base64s(image_paths, base_path="data/parsed"))

    all_chunks += "\n\n" + chunk_text

In [60]:
all_images[:3]

['iVBORw0KGgoAAAANSUhEUgAAAUYAAADqCAIAAABRHHuGAABgsUlEQVR4nO2ddVhU297HpweG7u4QRUFEVFCwE+nuBunuRjBAFEwUle4GxW5sxRZEEQQEBKRzar/PMOfwej2IxQiM+/PcP+4ZZu+9Fvjda61fQgEAIBKJAABAQEBA5jJQKBQGgyEgEMiFCxdevnwJh8NnekggICC/AhQKxWKxKioqioqKJEkPDw/39vYiEAgikfhLNwQBAZlJYDAYFosdHR2FQCAkSUOhUAgEwsrKKisrS/7/ICAgcwUsFvvs2bPPnz+TxUuSNAQCAQCAlZV1zZo1Mz08EBCQnwOLxb5//76rq4v8n7CJHwDj/OTdQEBAZhg8Hv+lcv9f0iAgIFQAKGkQEKoClDQICFUBShoEhKoAJT09fO7qPHoo8eL5czM9EJC/nX+cWCC/Bh6He/ToUWFebtGFq58hGOjgZ29rI08fPwYGhpkeGshfCijpX6Sv+3N5eVluYfGd6udQ6TXsGj6SYjJDLW/35sY9fPQoISFRTEJypscI8jcCbrx/DgAgPnpwPzQwQFFV1/v0mVeiqvB5inRs3IyiC6EwGL2QlKRrwu0xjs1qmoV5eTM9WJC/EXCV/lGGBwcqysuLSsvuNHTghJayq3kJcgtDIRC2BfKfruQ05cbybLFCs/MiaOjEjLw77p+38Ql+Uv04MCSUHtyEg/xBQEl/n+fPnp09U1Fw+XYLnoZOfjOb8iI0PRNAJAB4LACBQBFIni1Wvc9vthQfYl+pziitBBDwXIrb6IXmJ2TvvfdI42BiovTCRTM9CZC/BVDS32RsdOTShfNZ2Tl3n9U09w6zSMkLWQTCEWgibpSIG/v/7xGJAITILKtCw8776XL26KcmDmVNgAjDcAvNc9r3rDJFVV0zJjLcxMx8JicD8tcAnqUnJz0tTWGlinnQnmtYbjqLPYvDsuh4RZuz9oy0vIUikP/9PoDH0fCJ8+t54gd6Wo

In [64]:
task_text = """
How do I clean the drain pump? Answer based on the provided text and images.

Describe the details from the figures as well, if necessary.
""" + "\n\n" + all_chunks

message = {
    "role": "user", "content": [
        {"type": "text", "text": task_text}
    ]
}

for img in all_images:
    content = {
        "type": "image",
        "source": {
            "type": "base64",
            "media_type": "image/png",
            "data": img,
        }
    }
    message["content"].append(content)

In [65]:
len(message["content"])

16

In [67]:
import anthropic

anthropic_response = anthropic.Anthropic().messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[message]
)

In [72]:
print(anthropic_response.content[0].text)

Based on the provided text and images, here's how to clean the drain pump:

## Prerequisites
- Ensure the drain pump is empty first
- Turn off the water tap
- Switch off the appliance
- Disconnect the mains plug from power supply
- Open and remove the service flap (maintenance flap for drain pump shown as item 2 in the appliance diagram)

## Cleaning Steps

**1. Unscrew the pump cap carefully**
- Water may remain in the drain pump, so be cautious
- The images show the pump cap location and how to access it

**2. Remove the filter insert**
- The filter insert in the pump housing may become stuck due to coarse particles of dirt
- Loosen the dirt and remove the filter insert
- The images demonstrate this removal process

**3. Clean components thoroughly**
- Clean the interior of the pump housing
- Clean the thread on the pump cap
- Clean the pump housing itself
- The images show cleaning with water and proper handling

**4. Check the impeller**
- Ensure that the impeller in the drain pump

In [73]:
client.close()

{"build_git_commit":"7cebee0421","build_go_version":"go1.24.5","build_image_tag":"HEAD","build_wv_version":"1.32.0","error":"context canceled","level":"error","msg":"replication engine failed to start after FSM caught up","time":"2025-07-16T16:32:51+01:00"}
{"build_git_commit":"7cebee0421","build_go_version":"go1.24.5","build_image_tag":"HEAD","build_wv_version":"1.32.0","error":"cannot find peer","level":"error","msg":"transferring leadership","time":"2025-07-16T16:32:51+01:00"}
