# `eval_llm_pipeline` Sample Usage

This notebook shows how to wire `eval_llm_pipeline.LLMEvalPipeline` into future experiments so metrics, ROC/PR curves, and confidence visualizations stay identical across prompts/models.

Key takeaways:
- Works for both binary IPV detection and multitype runs.
- Multitype plots now score *exact-match correctness vs confidence*, mirroring the new pipeline behavior.
- All figures/CSVs land in `1_LLM_Eval/test_results/figs/` so you can diff results over time.


## 1. Environment setup

1. Import the pipeline class.
2. Point `PROJECT_ROOT` at the repository root.
3. Reuse the helper constants (`DATASET_PATH`, `OUTPUT_DIR`) in later cells.


In [None]:
# !git config --global user.name "github username"
# !git config --global user.email "Email"

In [None]:
# !rm -rf ipvresearch25   # optional: clears old copy
# !git clone https://github.com/zelaneroz/ipvresearch25.git
# %cd ipvresearch25

In [4]:
from pathlib import Path
import sys
from IPython.display import display

PROJECT_ROOT = Path('..').resolve()
DATASET_PATH = PROJECT_ROOT / 'Dataset/reddit_data.csv'
OUTPUT_DIR = PROJECT_ROOT / '1_LLM_Eval/test_results/figs'
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

sys.path.append(str(PROJECT_ROOT / '1_LLM_Eval'))
from eval_llm_pipeline import LLMEvalPipeline


## 2. Binary evaluation example

Below we compare two prompting strategies (`cot` vs `fewshot`). Extend `binary_results` with additional JSON files to stack more experiments on the same plots.


In [2]:
binary_results = {
    'cot': PROJECT_ROOT / '1_LLM_Eval/test_results/binary_cot.json',
    'fewshot': PROJECT_ROOT / '1_LLM_Eval/test_results/binary_fewshot.json',
}

binary_pipeline = LLMEvalPipeline(
    results_map=binary_results,
    ground_truth=DATASET_PATH,
    task='binary',
    output_dir=OUTPUT_DIR,
    positive_label='IPV'
)

binary_metrics = binary_pipeline.save_and_print_metrics(
    OUTPUT_DIR / 'binary_metrics.csv'
)
display(binary_metrics)

binary_pipeline.plot_roc_auc_bars('binary_roc_auc_bar.png')
binary_pipeline.plot_roc_curves('binary_roc_curve')
binary_pipeline.plot_precision_vs_residual('binary_precision_residual')



=== Evaluation Summary ===
         accuracy  precision      f1  roc_auc
model                                        
cot        0.8058     0.9598  0.7818   0.8143
fewshot    0.8932     0.9745  0.8900   0.8975

Saved metrics to /Users/zeespanto/DevProj/ipvresearch25/1_LLM_Eval/test_results/figs/binary_metrics.csv


Unnamed: 0_level_0,accuracy,precision,f1,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cot,0.805825,0.959821,0.781818,0.814344
fewshot,0.893204,0.974453,0.89,0.897523


[PosixPath('/Users/zeespanto/DevProj/ipvresearch25/1_LLM_Eval/test_results/figs/binary_precision_residual.png')]

*Artifacts saved*: `binary_metrics.csv`, `binary_roc_auc_bar.png`, `binary_roc_curve.png`, and `binary_precision_residual.png` inside `1_LLM_Eval/test_results/figs/`.


## 3. Multitype evaluation example

Multitype runs now judge success by **exact-match correctness** (all abuse subtypes correct within a row). The ROC and precision–residual plots therefore compare the model's confidence against that per-row correctness, while the waterfall plot colors each sentence by the dominant subtype (or Not IPV).


In [3]:
multilabel_results = {
    'cot': PROJECT_ROOT / '1_LLM_Eval/test_results/multilabel_cot_20251029-192707.json',
    'fewshot': PROJECT_ROOT / '1_LLM_Eval/test_results/multilabel_fewshot_20251029-192707.json',
}

multilabel_pipeline = LLMEvalPipeline(
    results_map=multilabel_results,
    ground_truth=DATASET_PATH,
    task='multitype',
    output_dir=OUTPUT_DIR,
)

multi_metrics = multilabel_pipeline.save_and_print_metrics(
    OUTPUT_DIR / 'multilabel_metrics.csv'
)
display(multi_metrics)

multilabel_pipeline.plot_roc_auc_bars('multilabel_roc_auc_bar.png')
multilabel_pipeline.plot_roc_curves('multilabel_roc_curve')
multilabel_pipeline.plot_precision_vs_residual('multilabel_precision_residual')
multilabel_pipeline.multi_confidence_score_plot('multilabel_confidence_waterfall')



=== Evaluation Summary ===
         accuracy  precision      f1  roc_auc
model                                        
cot          0.68     0.3333  0.0667   0.5556
fewshot      0.64     0.0000  0.0000   0.5000

Saved metrics to /Users/zeespanto/DevProj/ipvresearch25/1_LLM_Eval/test_results/figs/multilabel_metrics.csv


Unnamed: 0_level_0,accuracy,precision,f1,roc_auc
model,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
cot,0.68,0.333333,0.066667,0.555556
fewshot,0.64,0.0,0.0,0.5




[PosixPath('/Users/zeespanto/DevProj/ipvresearch25/1_LLM_Eval/test_results/figs/multilabel_confidence_waterfall_cot.png'),
 PosixPath('/Users/zeespanto/DevProj/ipvresearch25/1_LLM_Eval/test_results/figs/multilabel_confidence_waterfall_fewshot.png')]

## 4. Next steps

- Swap in new JSON outputs by updating the dictionaries.
- Point `DATASET_PATH` to alternative ground-truth files if you are testing new cohorts.
- Capture returned `Path` objects from plotting calls if you want to log them elsewhere (e.g., Weights & Biases).
- When adding new multitype models, ensure their outputs include confidence scores (or per-class probabilities) so the exact-match ROC plot stays informative.
