In [None]:
# If needed, install chunknorris
%pip install chunknorris -q

# PDF file chunking

This notebook aims at showing a simple example of chunking for PDF files.

**Note:** You may want to have a look at the tutorial [In-depth .pdf file parsing](https://wikit-ai.github.io/chunknorris/examples/pdf_parsing/#observe-the-detected-tables) to get more info about the functionnalities of the ``PdfParser``.


## Pipeline setup

In [5]:
from chunknorris.parsers import PdfParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import PdfPipeline
from IPython.display import Markdown

Here we import the ``PdfPipeline``. 

Note that ``BasePipeline`` would work as well, but the **``PdfPipeline`` handles more advanced mechanics specific to PDF files**. For example, it will:
- split by pages the documents derived from powerpoint in which no table of content have been found.
- cleanup cached objects to avoid memory leaks

As the ``PdfParser`` outputs a ``MarkdownDoc``, we use the ``MarkdownChunker`` to chunk the parsed document.

In [3]:
# Setup the pipe. Feel free to play with the parser and chunker's arguments.
pipeline = PdfPipeline(
    PdfParser(),
    MarkdownChunker(),
)

chunks = pipeline.chunk_file("./data/sample.pdf")
print(f"Got {len(chunks)} chunks !")

2024-12-17 17:09:ChunkNorris:INFO:Function "_create_spans" took 0.4265 seconds
2024-12-17 17:09:ChunkNorris:INFO:Function "get_tables" took 1.3100 seconds
2024-12-17 17:09:ChunkNorris:INFO:Function "parse_file" took 2.1334 seconds
2024-12-17 17:09:ChunkNorris:INFO:Function "chunk" took 0.0536 seconds


Got 217 chunks !


As we can see, the chunking of this 165 pages documents took around:
- 2.1s for parsing (including 1.3s for parsing the tables)
- 0.05s for chunking.

--> around 2.2s total

It led to 217 chunks.

(Hardware : CPU - i7-13620H, 2.40 GHz, RAM - 16 Go)

## View the chunks

To look at the chunk's text, you may use the ``Chunk.get_text()`` method.

Another thing : for pdf file chunking, **each chunk contains information about the pages this chunk comes from**.

In [17]:
for chunk_idx in [10, 11]: # choose any
    chunk = chunks[chunk_idx]
    print(f"\n===== Start page: {chunk.start_page} --- End page: {chunk.end_page} ======\n")
    print(chunk.get_text())



# Mitel 6930/6930w SIP Phone User Guide

## **Welcome**

### 2.4 Requirements
The 6930 requires the following environment:
- SIP-based IP PBX system or network installed and running with a SIP account created for the 6930 phone
- Access to a Trithroughl File Transfer Protocol (TFTP), File Transfer Protocol (FTP), Hypertext Transfer Protocol (HTTP) server, or Hyper Text Transfer Protocol over Secure Sockets Layer (SSL) (HTTPS)
User Guide 6
- Ethernet/Fast Ethernet LAN (10/100 Mbps) (Gigabit Ethernet LAN [1000 Mbps] recommended)
- Category 5/5e straight-through cabling (Category 6 straight-through cabling required for optimum Gigabit Ethernet performance)
- Power source:
- For Ethernet networks that supply inline power to the phone (IEEE 802.3af) use an Ethernet cable to connect from the phone directly to the network for power (no 48V AC power adapter required if using Power-over-Ethernet [PoE])
- For Ethernet networks that DO NOT supply power to the phone:
- Use only the GlobTek Inc. 

## Save the chunks
In order to save the chunks in a JSON file, just use this:

In [16]:
pipeline.save_chunks(chunks, "mychunks.json")