In [None]:
# If needed, install chunknorris
%pip install chunknorris -q

# PDF file chunking

This notebook aims at showing a simple example of chunking for PDF files.

**Note:** You may want to have a look at the tutorial [In-depth .pdf file parsing](https://wikit-ai.github.io/chunknorris/examples/pdf_parsing/#observe-the-detected-tables) to get more info about the functionnalities of the ``PdfParser``.


## Pipeline setup

In [1]:
from chunknorris.parsers import PdfParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline
from IPython.display import Markdown

In [4]:
# Setup the pipe. Feel free to play with the parser and chunker's arguments.
pipeline = BasePipeline(
    PdfParser(),
    MarkdownChunker(),
)

chunks = pipeline.chunk_file("./example_data/sample.pdf")
print(f"Got {len(chunks)} chunks !")

2025-07-01 15:39:ChunkNorris:INFO:Function "get_tables" took 1.4941 seconds
2025-07-01 15:39:ChunkNorris:INFO:Function "parse_file" took 2.3959 seconds
2025-07-01 15:39:ChunkNorris:INFO:Function "chunk" took 0.0955 seconds


Got 218 chunks !


As we can see, the chunking of this 165 pages documents took around:
- 2.4s for parsing (including 1.3s for parsing the tables)
- 0.1s for chunking.

--> around 2.5s total

It led to 218 chunks.

(Hardware : CPU - i7-13620H, 2.40 GHz, RAM - 16 Go)

## View the chunks

To look at the chunk's text, you may use the ``Chunk.get_text()`` method.

Another thing : for pdf file chunking, **each chunk contains information about the pages this chunk comes from**.

In [5]:
for chunk_idx in [10, 11]: # choose any
    chunk = chunks[chunk_idx]
    print(f"\n===== Start page: {chunk.start_page} --- End page: {chunk.end_page} ======\n")
    print(chunk.get_text())



## **Welcome**

### 2.3 Phone Features

The following table describes the IP Phone features:
User Guide 4
**Welcome**
|  Feature  |  6930 IP Phone  |  6930w IP Phone  |
|:---|:---|:---|
| Display  | 4.3” WQVGA (480x272) color TFT LCD display with brightness controls  | 4.3" WQVGA (480x272) color TFT LCD display with brightness controls  |
| Programmable Keys  | 12 top softkeys  | 12 top softkeys  |
| Context Sensitive Keys  | 5 context-sensitive bottom softkeys  | 5 context-sensitive bottom softkeys  |
| Ethernet  | Built-in-two-port, 10/100/1000 Gigabit Ethernet switch - lets you share a connection with your computer | Built-in-two-port, 10/100/1000 Gigabit Ethernet switch - lets you share a connection with your computer 802.3az (EEE)  |
| Power-over-Ethernet (PoE) - LAN | 802.3af, 802.3at  | 802.3af, 802.3at  |
| POE Class  | Class 3 with auto change to 4 when PKMs are attached.  | Class 3 with auto change to 4 when PKMs are attached. If an accessory is installed in the sidecar acc

## Save the chunks
In order to save the chunks in a JSON file, just use this:

In [16]:
pipeline.save_chunks(chunks, "mychunks.json")