In [1]:
# If needed, install chunknorris
%pip install chunknorris -q

Note: you may need to restart the kernel to use updated packages.


# Markdown file chunking
This notebook aims at showing a simple example of chunking for markdown (.md) files.
## Pipeline setup

In [2]:
# imported the required chunknorris components
from chunknorris.parsers import MarkdownParser
from chunknorris.chunkers import MarkdownChunker
from chunknorris.pipelines import BasePipeline

3 components are needed to chunk a markdown file :
- ``MarkdownParser`` : this parser ensures the formatting of the markdown file. In particular, it will :
    - ensure that the headers are in ATX format
    - detect tables, code blocks, metadata, ... to make sur they are not splitted across multiple chunks
    
    It returns a ``MarkdownDoc`` containing all information

- ``MarkdownChunker`` : the chunker takes as input a ``MarkdownDoc`` and performs chunking in multiple ``Chunks`` objects.

- ``BasePipeline`` : this pipline is pretty basic (you would have guessed considering its name... 😅). It just plugs together the parser and the chunker so that the output of the parser in fed to the chunker. Other pipelines, such as ``PdfPipeline`` handle more complex mechanics.

In [3]:
# instanciate
parser = MarkdownParser()
chunker = MarkdownChunker()
pipeline = BasePipeline(parser, chunker)

In [4]:
# Get those chunks !
path_to_md_file = "../../tests/test_files/file.md"
chunks = pipeline.chunk_file(path_to_md_file)
print(f"Got {len(chunks)} chunks !")

2024-12-17 09:59:ChunkNorris:INFO:Function "chunk" took 0.0068 seconds


Got 17 chunks !


## View the chunks

In [5]:
# Let's look at the chunks
for i, chunk in enumerate(chunks[:3]): # we only look at the 3 first chunks
    print(f"\n------------- chunk {i} ----------------\n")
    print(chunk.get_text())


------------- chunk 0 ----------------

# Jardin

Un [jardin japonais](https://fr.wikipedia.org/wiki/Jardin_japonais "Jardin japonais").

Une femme de 87 ans en train de cultiver son jardin. [Comté de Harju](https://fr.wikipedia.org/wiki/Comt%C3%A9_de_Harju "Comté de Harju"), Estonie, juin 2016\.

Un **jardin** est un lieu durablement et hypothétiquement aménagé où l'on cultive de façon ordonnée des [plantes](https://fr.wikipedia.org/wiki/Plante "Plante") domestiquées ou sélectionnées. Il est le produit de la technique du [jardinage](https://fr.wikipedia.org/wiki/Jardinage "Jardinage") et, comme elle, il remonte au moins à l'Antiquité. Les différentes cultures humaines dans le monde, au fil des époques, ont inventé de nombreux types et styles de jardins. Lieux d'agrément, de repos, de rêverie solitaire ou partagée, les jardins ont aussi été revêtus dès l'Antiquité d'une valeur symbolique. Ils apparaissent dans les mythologies et les religions, et ils ont été 

You may want to remove the links in a chunk by using ``Chunk.get_text(remove_links=True)``.

The ``Chunk.get_text()`` method allows to to directly concatenate the headers of all top-level sections with the chunk's content. If you want to customize this behavior, you may use the ``Chunk.headers`` and ``Chunk.content`` attributes. Both contain ``MarkdownLine`` objects, which represent a markdown line of the file and its metadata.

In [6]:
# Let's see a chunk in details (chunk number 10 for example)
for line in chunks[10].headers:
    print(line)
print("=======================")
for line in chunks[10].content:
    print(line)

{'text': '# Jardin', 'line_idx': 0, 'isin_code_block': False, 'page': None}
{'text': '## Les jardins en France\\[[modifier](https://fr.wikipedia.org/w/index.php?title=Jardin&veaction=edit&section=10 "Modifier la section\u202f: Les jardins en France") \\| [modifier le code](https://fr.wikipedia.org/w/index.php?title=Jardin&action=edit&section=10 "Modifier le code source de la section : Les jardins en France")]', 'line_idx': 99, 'isin_code_block': False, 'page': None}
{'text': '### Protection à titre patrimonial de certains parcs et jardins\\[[modifier](https://fr.wikipedia.org/w/index.php?title=Jardin&veaction=edit&section=11 "Modifier la section\u202f: Protection à titre patrimonial de certains parcs et jardins") \\| [modifier le code](https://fr.wikipedia.org/w/index.php?title=Jardin&action=edit&section=11 "Modifier le code source de la section : Protection à titre patrimonial de certains parcs et jardins")]', 'line_idx': 113, 'isin_code_block': False, 'page': None}
{'text': '', 'line

## Save the chunks
The pipeline as a method to save the chunks and their attributes as a json file. Here is how to use it.

In [None]:
# Let's save the chunks. We can just pass the chunks we obtain and the filename we want
pipeline.save_chunks(chunks, "mychunk.json")