## Run evaluations

This notebook shows examples on how to run parsing and chunking evaluations

### Parsing evaluation (speed and energy consumption)

In [None]:
from src.utils import get_pdf_filepaths
from src.evaluation.parsing_evaluator import ParsingEvaluator
from src.pipelines.chunknorris_pipeline import ChunkNorrisPipeline # or any other pipeline

In [None]:
filepaths = get_pdf_filepaths("path/to/folder")
pipeline = ChunkNorrisPipeline()
parsing_eval = ParsingEvaluator(pipeline) # results will be saved in "./results" folder by default
parsing_eval.evaluate_parsing(filepaths)

### Chunking evaluation (recall and NDCG)

In [None]:
from src.evaluation.chunking_evaluator import ChunkingEvaluator
from src.pipelines.chunknorris_pipeline import ChunkNorrisPipeline # or any pipeline used for parsing
from src.chunkers.page_chunker import PageChunker # or any chunker
from src.utils import get_pdf_filepaths

In [None]:
pdf_filepaths = get_pdf_filepaths("path/to/pdf_files_dir")
chunking_eval = ChunkingEvaluator(
    pipeline=ChunkNorrisPipeline(),
    chunkers=[PageChunker(), None], # <-- We pass "None" as a chunker to also use the pipeline's default chunker
)
# Results will be saved by default in "./results" folder
chunking_eval.evaluate_chunking(pdf_filepaths)

You may want to run the evaluation on another embedding model, **without rerunning the parsing and chunking**. In that case you can just reuse the obtained chunks. In that case you can use the following snippet.

In [None]:
from src.evaluation.chunking_evaluator_utils import chunks_to_dataset

In [None]:
# load the chunks.json file as a datasetdict
datasetdict = chunks_to_dataset("path/to/chunk.json")
chunking_eval.run_chunking_evaluation(
    datasetdict,
    sentence_transformer_hf_repo="path_to_hf_repo/sentence_transformer_compatible"
    )