## Working with PDFs with images

PDFs contain rich formatting - here's an example:

<img src="data/imgs/manual_bosch_WGG254Z0GR_38_of_56.png" width="400px" />

### Approach 1 - Extract text and images separately

Some libraries (like `docling`) can extract text and images from PDFs, and convert them into Markdown files.

In [1]:
from pathlib import Path

data_folder = Path("data/pdfs")
output_dir = Path("data/parsed")
output_dir.mkdir(parents=True, exist_ok=True)

In [2]:
from docling.datamodel.base_models import InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling_core.types.doc import ImageRefMode

IMAGE_RESOLUTION_SCALE = 2.0


def parse_pdf_with_images(input_doc_path: Path, output_dir: Path):
    # Reference: https://docling-project.github.io/docling/examples/export_figures/
    pipeline_options = PdfPipelineOptions()
    pipeline_options.images_scale = IMAGE_RESOLUTION_SCALE
    pipeline_options.generate_picture_images = True

    doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
        }
    )

    conv_res = doc_converter.convert(input_doc_path)
    # Save markdown with embedded pictures
    md_filename = output_dir / f"{input_doc_path.name.split('.')[0]}-parsed-w-imgs.md"
    conv_res.document.save_as_markdown(md_filename, image_mode=ImageRefMode.REFERENCED)


pdf_names = [f.name for f in data_folder.glob("*.pdf") if f.is_file()]

for pdf_fname in pdf_names:
    print(f"Processing file: {pdf_fname}")

    input_doc_path = data_folder / pdf_fname

    print(f"Converting document {input_doc_path} to multimodal pages...")
    parse_pdf_with_images(input_doc_path, output_dir)


Processing file: howto-free-threading-python.pdf
Converting document data/pdfs/howto-free-threading-python.pdf to multimodal pages...
Processing file: manual_bosch_WGG254Z0GR.pdf
Converting document data/pdfs/manual_bosch_WGG254Z0GR.pdf to multimodal pages...




In [2]:
md_filepath = Path("data/parsed/manual_bosch_WGG254Z0GR-parsed-w-imgs.md")
md_txt = md_filepath.read_text()
print(md_txt[:1000])

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000000_5c54d11a8c20ca25ddd9d2f56c9f9680ede4e06e7883ce26742dd8b92f37e50b.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000001_b4602c47a93b10cce1805fd72fa0b8d8885610e99dc4769a522107f918300343.png)

## Washing machine

## WGG254Z0GR

User manual and installation

[en] instructions

## Futher information and explanations are available online:

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000002_71ba2e278ecb31d5477d5b97185c8f2215227116e1615b161b0883eeea87bb26.png)

## Table of contents

| 1 Safety...........................................                      | 1 Safety...........................................                        | 4                           |
|--------------------------------------------------------------------------|----------------------------------------------------------------------------|-----------------------------|
| 1.1                                    

#### Chunking text files with images

More complex than just text, since we need to handle images as well.

- Must include entire image string in the chunk
- When vectorizing, must replace image references with base64 of actual images

One method: try a specialized library like `chonkie` to handle this

Chonkie offers a variety of chunking strategies:

<img src="assets/chonkie_methods.png" />

There isn't going to be a "one size fits all" solution for chunking PDFs with images. But these libraries can help you get started.

Let's try a couple of different approaches:

In [12]:
from chonkie import RecursiveChunker

# Initialize the recursive chunker to chunk Markdown
chunker = RecursiveChunker.from_recipe("markdown", lang="en")

v1.schema.json: 0.00B [00:00, ?B/s]

markdown_en.json: 0.00B [00:00, ?B/s]

In [13]:
chunks = chunker.chunk(md_txt)

In [14]:
for chunk in chunks[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print(f"Chunk text: {chunk.text}")


Token count: 569
Start index: 0
End index: 569
Chunk text: ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000000_5c54d11a8c20ca25ddd9d2f56c9f9680ede4e06e7883ce26742dd8b92f37e50b.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000001_b4602c47a93b10cce1805fd72fa0b8d8885610e99dc4769a522107f918300343.png)

## Washing machine

## WGG254Z0GR

User manual and installation

[en] instructions

## Futher information and explanations are available online:

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000002_71ba2e278ecb31d5477d5b97185c8f2215227116e1615b161b0883eeea87bb26.png)



Token count: 2046
Start index: 569
End index: 2615
Chunk text: ## Table of contents

| 1 Safety...........................................                      | 1 Safety...........................................                        | 4                           |
|--------------------------------------------------------------------------|------------------------

Let's try a "semantic" chunker:

In [9]:
from chonkie import SemanticChunker

# Basic initialization with default parameters
chunker = SemanticChunker(
    embedding_model="minishlab/potion-base-8M",  # Default model
    threshold=0.5,                               # Similarity threshold (0-1) or (1-100) or "auto"
    chunk_size=2048,                              # Maximum tokens per chunk
    min_sentences=1                              # Initial sentences per chunk
)

model.safetensors:   0%|          | 0.00/30.2M [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/202 [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

In [10]:
chunks = chunker.chunk(md_txt)

In [11]:
for chunk in chunks[:5]:
    print(f"\n" + "=" * 40)
    print(f"Token count: {chunk.token_count}")
    print(f"Start index: {chunk.start_index}")
    print(f"End index: {chunk.end_index}")
    print(f"Chunk text: {chunk.text}")


Token count: 203
Start index: 0
End index: 427
Chunk text: ![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000000_5c54d11a8c20ca25ddd9d2f56c9f9680ede4e06e7883ce26742dd8b92f37e50b.png)

![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000001_b4602c47a93b10cce1805fd72fa0b8d8885610e99dc4769a522107f918300343.png)

## Washing machine

## WGG254Z0GR

User manual and installation

[en] instructions

## Futher information and explanations are available online:


Token count: 134
Start index: 427
End index: 645
Chunk text: 
![Image](manual_bosch_WGG254Z0GR-parsed-w-imgs_artifacts/image_000002_71ba2e278ecb31d5477d5b97185c8f2215227116e1615b161b0883eeea87bb26.png)

## Table of contents

| 1 Safety........................................... 

Token count: 1953
Start index: 645
End index: 7375
Chunk text:                      | 1 Safety...........................................                        | 4                           |
|------------------------------------

How many chunks did we end up with?

In [15]:
len(chunks)

89

89 chunks for a 56 page manual sounds pretty reasonable.