# `eval_llm_pipeline` Sample Usage

This notebook demonstrates how to reuse `eval_llm_pipeline.LLMEvalPipeline` so every LLM experiment shares identical evaluation/plotting logic. Adjust the paths in the code cells to point to whichever result JSON files and datasets you want to compare.


## 1. Environment setup

1. Import the pipeline class.
2. Point `PROJECT_ROOT` at the repository root.
3. Reuse the helper constants (`DATASET_PATH`, `OUTPUT_DIR`) in later cells.


In [None]:
from pathlib import Path
import sys
from IPython.display import display

PROJECT_ROOT = Path('..').resolve()
DATASET_PATH = PROJECT_ROOT / 'Dataset/reddit_data.csv'
OUTPUT_DIR = PROJECT_ROOT / '1_LLM_Eval/test_results/figs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

sys.path.append(str(PROJECT_ROOT / '1_LLM_Eval'))
from eval_llm_pipeline import LLMEvalPipeline


## 2. Binary evaluation example

Below we compare two binary prompting strategies (`cot` vs `fewshot`). Replace the JSON paths or extend the dictionary with additional runs as needed.


In [None]:
binary_results = {
    'cot': PROJECT_ROOT / '1_LLM_Eval/test_results/binary_cot.json',
    'fewshot': PROJECT_ROOT / '1_LLM_Eval/test_results/binary_fewshot.json',
}

binary_pipeline = LLMEvalPipeline(
    results_map=binary_results,
    ground_truth=DATASET_PATH,
    task='binary',
    output_dir=OUTPUT_DIR,
    positive_label='IPV'
)

binary_metrics = binary_pipeline.save_and_print_metrics(
    OUTPUT_DIR / 'binary_metrics.csv'
)
display(binary_metrics)

binary_pipeline.plot_roc_auc_bars('binary_roc_auc_bar.png')
binary_pipeline.plot_roc_curves('binary_roc_curve')
binary_pipeline.plot_precision_vs_residual('binary_precision_residual')


*Artifacts saved*: metrics table (`binary_metrics.csv`) plus ROC/precision-residual plots under `1_LLM_Eval/test_results/figs/`.


## 3. Multitype evaluation example

For subtype-aware prompts, pass the multilabel JSON files and reuse the same dataset. The pipeline automatically derives per-class ROC curves, macro metrics, and confidence waterfalls.


In [None]:
multilabel_results = {
    'cot': PROJECT_ROOT / '1_LLM_Eval/test_results/multilabel_cot_20251029-192707.json',
    'fewshot': PROJECT_ROOT / '1_LLM_Eval/test_results/multilabel_fewshot_20251029-192707.json',
}

multilabel_pipeline = LLMEvalPipeline(
    results_map=multilabel_results,
    ground_truth=DATASET_PATH,
    task='multitype',
    output_dir=OUTPUT_DIR,
)

multi_metrics = multilabel_pipeline.save_and_print_metrics(
    OUTPUT_DIR / 'multilabel_metrics.csv'
)
display(multi_metrics)

multilabel_pipeline.plot_roc_auc_bars('multilabel_roc_auc_bar.png')
multilabel_pipeline.plot_roc_curves('multilabel_roc_curve')
multilabel_pipeline.plot_precision_vs_residual('multilabel_precision_residual')
multilabel_pipeline.multi_confidence_score_plot('multilabel_confidence_waterfall')


## 4. Next steps

* Swap in new JSON outputs by updating the dictionaries.
* Point `DATASET_PATH` to alternative ground-truth files if you are testing new cohorts.
* The plotting methods return `Path` objects, so you can capture them for downstream logging if desired.
