In [2]:
# If needed, install chunknorris
%pip install chunknorris -q

Note: you may need to restart the kernel to use updated packages.


# HTML file chunking

This notebook aims at showing a simple example of chunking for Microsoft Word documents (.docx).

## Pipeline setup

In [3]:
# imported the required chunknorris components
from chunknorris.parsers import DocxParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline

3 components are needed to chunk a markdown file:

- ``DocxParser``: Behind the scene, it will use ``mammoth`` and ``markfownify`` to convert the docx file to markdown, and perform extra cleaning of the content. It returns a ``MarkdownDoc``.

- ``MarkdownChunker``: the chunker takes as input a ``MarkdownDoc`` and performs chunking in multiple ``Chunks`` objects.

- ``BasePipeline``: this pipline just plugs together the parser and the chunker so that the output of the parser in fed to the chunker.

In [5]:
# instanciate
pipeline = BasePipeline(DocxParser(), MarkdownChunker())

In [7]:
# Get those chunks !
path_to_file = "../../tests/test_files/file.docx"
chunks = pipeline.chunk_file(path_to_file)
print(f"Got {len(chunks)} chunks !")

2025-02-21 15:17:ChunkNorris:INFO:Function "chunk" took 0.0014 seconds


Got 11 chunks !


## View the chunks

In [9]:
# Let's look at the chunks
for i, chunk in enumerate(chunks[1:4]): # we only look at the 2 first chunks
    print(f"\n------------- chunk {i} ----------------\n")
    print(chunk.get_text())


------------- chunk 0 ----------------

# Dummy Table

| | Age | Likes Pdf | Likes AI |
| --- | --- | --- | --- |
| Marc | 20 | Yes ? | **Yes !** |
| Alice | 30 | **No** |
| Rob | 40 |
| Julia | 50 |
| The cat | 60 | **No** |

This is a dummy table that has nothing to do with the rest of the content but is here for testing purposes

------------- chunk 1 ----------------

# Insérer une table des matières

Pour ajouter une table des matières, décidez simplement de l’emplacement souhaité. Word se charge du reste.

Essayez par vous-même : appuyez sur ENTRÉE après le premier paragraphe dans ce document pour obtenir une nouvelle ligne. Accédez ensuite à l’onglet **Références**, sélectionnez **Table des matières** et choisissez une table des matières dans la liste.

Vous avez terminé ! Word a détecté tous les titres dans ce document et ajouté une table des matières.

------------- chunk 2 ----------------

# Mise à jour quand il y a des changements

Le travail ne s’arre

## Save the chunks
The pipeline as a method to save the chunks and their attributes as a json file. Here is how to use it.

In [None]:
# Let's save the chunks. We can just pass the chunks we obtain and the filename we want
pipeline.save_chunks(chunks, "mychunk.json")