![assets/workshop_cover.jpg](assets/workshop_cover.jpg)

![assets/workshop_agenda.jpg](assets/workshop_agenda.jpg)

## Working with PDFs - an introduction

### Get text from PDFs

PDFs contain rich formatting:

<img src="data/imgs/howto-free-threading-python_2_of_4.png" width="300px" />
<img src="data/imgs/manual_bosch_WGG254Z0GR_13_of_56.png" width="300px" />

Modern libraries can preserve document structure while converting to text format, making the text easier to process while maintaining semantic relationships.

Let's try a popular library, `docling`.

Here's how we can convert PDFs into text files:

In [1]:
from pathlib import Path
from docling.document_converter import DocumentConverter


def parse_pdf(input_doc_path: Path, output_dir: Path):
    doc_converter = DocumentConverter()
    conv_res = doc_converter.convert(input_doc_path)

    # Save markdown
    filename = input_doc_path.stem
    md_filepath = output_dir / f"{filename}-parsed-text.md"
    print(f"Saving parsed text to {md_filepath}")
    with md_filepath.open("w", encoding="utf-8") as md_file:
        md_file.write(conv_res.document.export_to_markdown())

In [2]:
data_folder = Path("data/pdfs")
output_dir = Path("data/parsed")
output_dir.mkdir(parents=True, exist_ok=True)

input_pdf_name = "howto-free-threading-python.pdf"
input_doc_path = data_folder / input_pdf_name

In [3]:
# Run the parser function
# BEGIN_SOLUTION
parse_pdf(input_doc_path, output_dir)
# END_SOLUTION

Saving parsed text to data/parsed/howto-free-threading-python-parsed-text.md


Inspect the converted file:

In [4]:
md_filepath = Path("data/parsed/howto-free-threading-python-parsed-text.md")
md_txt = md_filepath.read_text()

# Print some part of the parsed text:
# BEGIN_SOLUTION
print(md_txt[:1000])
# END_SOLUTION

## Python experimental support for free threading

Release 3.13.3

## Guido van Rossum and the Python development team

April 27, 2025

Python Software Foundation Email: docs@python.org

## Contents

| 1     | Installation                                        |   2 |
|-------|-----------------------------------------------------|-----|
| 2     | Identifying free-threaded Python                    |   2 |
| 3     | The global interpreter lock in free-threaded Python |   2 |
| 4     | Thread safety                                       |   2 |
| 5     | Known limitations                                   |   2 |
| 5.1   | Immortalization . . . . . . . . . . . . . . . . .   |   3 |
| 5.2   | Frame objects . . . . . . . . . . . . . . . . . .   |   3 |
| 5.3   | Iterators . . . . . . . . . . . . . . . . . . . . . |   3 |
| 5.4   | Single-threaded performance . . . . . . . . . .     |   3 |
| Index |                                                     |   4 |

Starting with the 3.13 releas

### Chunking

Chunking breaks documents into smaller, manageable pieces while preserving context. 

![assets/chunking_why.png](assets/chunking_why.png)

#### RAG - Recap

![assets/llm_2_rag_basic.png](assets/llm_2_rag_basic.png)

What does this have to do with chunking? 

Each "chunk" becomes a "unit" of context to feed into a RAG input!

### Chunking strategies

Different chunking strategies serve different use cases. 

![assets/chunking_methods.png](assets/chunking_methods.png)

Let's try a few options:

#### Chunk by text length with overlap

Overlapping chunks help maintain context across boundaries. 

In [5]:
def get_chunks_by_length_with_overlap(src_text: str, chunk_length: int = 500, overlap: int = 100) -> list[str]:
    """
    Split text into chunks of approximately `chunk_length` characters.
    """
    chunks = []
    for i in range(0, len(src_text), chunk_length):
        chunks.append(src_text[i:i + chunk_length + overlap])
    return chunks

In [6]:
# Try chunking using the get_chunks_by_length_with_overlap method
# BEGIN_SOLUTION
chunks = get_chunks_by_length_with_overlap(md_txt)
# END_SOLUTION

# Have a look at some of the chunks
display(chunks[:5])
print(len(chunks[0]))

['## Python experimental support for free threading\n\nRelease 3.13.3\n\n## Guido van Rossum and the Python development team\n\nApril 27, 2025\n\nPython Software Foundation Email: docs@python.org\n\n## Contents\n\n| 1     | Installation                                        |   2 |\n|-------|-----------------------------------------------------|-----|\n| 2     | Identifying free-threaded Python                    |   2 |\n| 3     | The global interpreter lock in free-threaded Python |   2 |\n| 4     | Thread safety                                       |   2 |\n| 5     | Known limitations                       ',
 'ety                                       |   2 |\n| 5     | Known limitations                                   |   2 |\n| 5.1   | Immortalization . . . . . . . . . . . . . . . . .   |   3 |\n| 5.2   | Frame objects . . . . . . . . . . . . . . . . . .   |   3 |\n| 5.3   | Iterators . . . . . . . . . . . . . . . . . . . . . |   3 |\n| 5.4   | Single-threaded performance . .

600


#### Chunk using markers

Using document markers (like headers) creates chunks that respect natural document boundaries. 

This approach preserves semantic structure and is ideal for documents with clear hierarchical organization like reports, manuals, or academic papers.

In [7]:
def get_chunks_using_markers(src_text: str) -> list[str]:
    """
    Split the source text into chunks using markers.
    """
    marker = "\n##"

    # Split by marker and reconstruct with markers (except first chunk)
    parts = src_text.split(marker)
    chunks = []

    # Add first chunk if it exists and isn't empty
    if parts[0].strip():
        chunks.append(parts[0].strip())

    # Add remaining chunks with markers reattached
    for part in parts[1:]:
        if part.strip():
            chunks.append(marker + part.strip())

    return chunks

In [8]:
# Try chunking using the marker-based method
# BEGIN_SOLUTION
chunks = get_chunks_using_markers(md_txt)
# END_SOLUTION

# Have a look at some of the chunks
display(chunks[:5])
print(len(chunks[0]))

['## Python experimental support for free threading\n\nRelease 3.13.3',
 '\n##Guido van Rossum and the Python development team\n\nApril 27, 2025\n\nPython Software Foundation Email: docs@python.org',
 '\n##Contents\n\n| 1     | Installation                                        |   2 |\n|-------|-----------------------------------------------------|-----|\n| 2     | Identifying free-threaded Python                    |   2 |\n| 3     | The global interpreter lock in free-threaded Python |   2 |\n| 4     | Thread safety                                       |   2 |\n| 5     | Known limitations                                   |   2 |\n| 5.1   | Immortalization . . . . . . . . . . . . . . . . .   |   3 |\n| 5.2   | Frame objects . . . . . . . . . . . . . . . . . .   |   3 |\n| 5.3   | Iterators . . . . . . . . . . . . . . . . . . . . . |   3 |\n| 5.4   | Single-threaded performance . . . . . . . . . .     |   3 |\n| Index |                                                     |   4 |\n\

65


### Choosing the right strategy

The best chunking strategy depends on your use case. 

Marker-based chunking excels with structured documents - but in some cases, it may not work as well:

In [9]:
md_file_2 = Path("data/parsed/manual_bosch_WGG254Z0GR-parsed-text.md")
md_text_2 = md_file_2.read_text(encoding="utf-8")
get_chunks_using_markers(md_text_2)

['<!-- image -->\n\n<!-- image -->',
 '\n##Washing machine',
 '\n##WGG254Z0GR\n\nUser manual and installation\n\n[en] instructions',
 '\n##Futher information and explanations are available online:\n\n<!-- image -->',
 '\n##Table of contents\n\n| 1 Safety...........................................                      | 1 Safety...........................................                        | 4                           |\n|--------------------------------------------------------------------------|----------------------------------------------------------------------------|-----------------------------|\n| 1.1                                                                      | General information...................                                     | 4                           |\n| 1.2                                                                      | Intended use..............................                                 | 4                           |\n| 1.3           

Here, the page headers are mistakenly interpreted as headings, which confuses our structure. 

As a result, a length-based chunking method tends to be quite robust.