## PDF Text extraction & chunking

### PDF to text

PDFs contain rich formatting that needs to be extracted as clean text for AI processing. 

Modern libraries like `docling` can preserve document structure while converting to text format, making the text easier to process while maintaining semantic relationships.

Inspect a PDF converted with `docling`:

In [None]:
from pathlib import Path

md_filepath = Path("data/parsed/amazon-2025-08-8k-excerpts-parsed-text.md")
md_txt = md_filepath.read_text()
print(md_txt[:500])

In [None]:
table_example = md_txt.split("## Consolidated Statements of Operations")[1]
print(table_example[:2000])

### Chunking

Raw text from PDFs is often too long for AI models to process effectively. Chunking breaks documents into smaller, manageable pieces while preserving context. 

#### Chunk by text length

Fixed-length chunks are simple but can break sentences or paragraphs mid-thought. This approach is fast and predictable, making it suitable for initial processing or when document structure is uniform.

![images/chunking_why.png](images/chunking_why.png)

Different chunking strategies serve different use cases. 

![images/chunking_methods.png](images/chunking_methods.png)

Let's try a few options:

#### Chunk by text length with overlap

Overlapping chunks help maintain context across boundaries. 

When a concept spans multiple chunks, the overlap helps to capture it. This is especially important for maintaining semantic coherence in search and retrieval systems.

In [None]:
def get_chunks_by_length_with_overlap(src_text: str, chunk_length: int = 500, overlap: int = 100) -> list[str]:
    """
    Split text into chunks of approximately `chunk_length` characters.
    """
    chunks = []
    for i in range(0, len(src_text), chunk_length):
        chunks.append(src_text[i:i + chunk_length + overlap])
    return chunks

In [None]:
# STUDENT TODO
# Chunk `md_text_1` with `get_chunks_by_length_with_overlap`
# Inspect the first 5 or so chunks
# START_SOLUTION
chunks = get_chunks_by_length_with_overlap(md_txt)

for chunk in chunks[:5]:
    print("\n\nChunk: " + "=" * 10 + f"\n{chunk}")
# END_SOLUTION

#### Chunk using markers

Using document markers (like headers) creates chunks that respect natural document boundaries. 

This approach preserves semantic structure and is ideal for documents with clear hierarchical organization like reports, manuals, or academic papers.

In [None]:
def get_chunks_using_markers(src_text: str) -> list[str]:
    """
    Split the source text into chunks using markers.
    """
    marker = "\n##"

    # Split by marker and reconstruct with markers (except first chunk)
    parts = src_text.split(marker)
    chunks = []

    # Add first chunk if it exists and isn't empty
    if parts[0].strip():
        chunks.append(parts[0].strip())

    # Add remaining chunks with markers reattached
    for part in parts[1:]:
        if part.strip():
            chunks.append(marker + part.strip())

    return chunks

In [None]:
md_file_1 = Path("data/parsed/amazon-2025-08-8k-excerpts-parsed-text.md")
md_text_1 = md_file_1.read_text(encoding="utf-8")

# STUDENT TODO
# Chunk `md_text_1` with `get_chunks_using_markers`
# Inspect the first 5 or so chunks
# START_SOLUTION
chunks = get_chunks_using_markers(md_text_1)

for chunk in chunks[:5]:
    print("\n\nChunk: " + "=" * 10 + f"\n{chunk}")
# END_SOLUTION

### Choosing the right strategy

The best chunking strategy depends on your use case. 

Marker-based chunking excels with structured documents as you saw. But in some cases, it may not work as well:

In [None]:
md_file_2 = Path("data/parsed/hai_ai-index-report-2025_chapter2_excerpts-parsed-text.md")
md_text_2 = md_file_2.read_text(encoding="utf-8")

chunks = get_chunks_using_markers(md_text_2)

for chunk in chunks[:5]:
    print("\n\nChunk:" + "=" * 10 + f"\n{chunk[:50]}")

Here, the page headers are mistakenly interpreted as headings, which confuses our structure. 

In general, the best chunking strategy for you will depend on your specific set of circumstances. But a fixed-length chunking strategy with overlap is a good default choice.