In [2]:
# If needed, install chunknorris
%pip install chunknorris -q

Note: you may need to restart the kernel to use updated packages.


# HTML file chunking

This notebook aims at showing a simple example of chunking for markdown (.md) files.

## Pipeline setup

In [4]:
# imported the required chunknorris components
from chunknorris.parsers import HTMLParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline

3 components are needed to chunk a HTML file:

- ``HTMLParser``: Behind the scene, it will use ``markdownify`` to convert the html to markdown, and perform extra cleaning of the content. It returns a ``MarkdownDoc``.

- ``MarkdownChunker``: the chunker takes as input a ``MarkdownDoc`` and performs chunking in multiple ``Chunks`` objects.

- ``BasePipeline``: this pipline just plugs together the parser and the chunker so that the output of the parser in fed to the chunker.

In [5]:
# instanciate
pipeline = BasePipeline(HTMLParser(), MarkdownChunker())

In [6]:
# Get those chunks !
path_to_html_file = "../../tests/test_files/file.html"
chunks = pipeline.chunk_file(path_to_html_file)
print(f"Got {len(chunks)} chunks !")

2024-12-17 17:33:ChunkNorris:INFO:Function "chunk" took 0.0032 seconds


Got 17 chunks !


## View the chunks

In [8]:
# Let's look at the chunks
for i, chunk in enumerate(chunks[:2]): # we only look at the 2 first chunks
    print(f"\n------------- chunk {i} ----------------\n")
    print(chunk.get_text())


------------- chunk 0 ----------------

# Le Jardin

Un [jardin japonais](https://fr.wikipedia.org/wiki/Jardin_japonais "Jardin japonais").

Une femme de 87 ans en train de cultiver son jardin. [Comté de Harju](https://fr.wikipedia.org/wiki/Comt%C3%A9_de_Harju "Comté de Harju"), Estonie, juin 2016.

Un **jardin** est un lieu durablement et hypothétiquement aménagé où l'on cultive
de façon ordonnée des [plantes](https://fr.wikipedia.org/wiki/Plante "Plante") domestiquées ou
sélectionnées. Il est le produit de
la technique du [jardinage](https://fr.wikipedia.org/wiki/Jardinage "Jardinage") et, comme elle, il
remonte au moins à l'Antiquité. Les différentes cultures humaines dans le monde, au
fil des époques,
ont inventé de nombreux types et styles de jardins. Lieux d'agrément, de repos, de
rêverie solitaire ou partagée, les jardins ont aussi été revêtus dès l'Antiquité
d'une valeur symbolique. Ils apparaissent dans les mythologies et les religions, et
ils ont éte

## Save the chunks
The pipeline as a method to save the chunks and their attributes as a json file. Here is how to use it.

In [None]:
# Let's save the chunks. We can just pass the chunks we obtain and the filename we want
pipeline.save_chunks(chunks, "mychunk.json")