# Demo Evaluation (Manuscript-aligned)

This notebook demonstrates the **evaluation workflow** for the interpretable, evidence-centered QA framework using the repository’s **synthetic demonstration CSVs**.

**Important:** The CSVs in `data/demo_evaluation/` are **synthetic** and provided to reproduce the *logic* of the evaluation pipeline (schema + metric computation). They **do not** represent real study outputs or real expert ratings.

Evaluation dimensions shown here mirror the manuscript:
- **Citation precision@k** (k=3 in the demo)
- **Factual consistency / factuality** (ordinal in the manuscript; demo may use a simplified scale)
- **Interpretability** (Likert-style ordinal ratings)
- **Uncertainty alignment** between system confidence labels and expert consensus confidence
- **Inter-rater agreement** among experts (pairwise Cohen’s κ)

We also compute **weighted Cohen’s κ (linear weights)** for uncertainty alignment (Low/Medium/High),
matching the manuscript’s reporting approach.


In [None]:
from pathlib import Path
import pandas as pd
import numpy as np

from sklearn.metrics import cohen_kappa_score

# Repo evaluation utilities
from evaluation import (
    parse_relevance_list,
    add_precision_at_k,
    aggregate_expert_scores,
    compute_expert_confidence_majority,
    compute_uncertainty_alignment,
    build_report_table,
    compute_summary_stats,
    pairwise_cohens_kappa,
    export_outputs,
)


In [None]:
REPO_ROOT = Path("..").resolve()  # notebook is in notebooks/
DATA_ROOT = REPO_ROOT / "data"
DEMO_EVAL_DIR = DATA_ROOT / "demo_evaluation"
OUT_DIR = REPO_ROOT / "outputs"
OUT_DIR.mkdir(exist_ok=True, parents=True)

DEMO_EVAL_DIR, OUT_DIR

## 1) Load synthetic evaluation tables

In [None]:
scenarios = pd.read_csv(DEMO_EVAL_DIR / "scenarios.csv")
system_conf = pd.read_csv(DEMO_EVAL_DIR / "system_confidence.csv")
expert_ratings = pd.read_csv(DEMO_EVAL_DIR / "expert_ratings.csv")
retrieval_rel = pd.read_csv(DEMO_EVAL_DIR / "retrieval_relevance.csv")

scenarios, system_conf.head(), expert_ratings.head(), retrieval_rel.head()

## 2) Retrieval evaluation: citation precision@k

In [None]:
rel = retrieval_rel.copy()
rel["relevance_list"] = rel["relevance_list"].apply(parse_relevance_list)
rel = add_precision_at_k(rel, relevance_col="relevance_list", out_col="precision_at_k")
rel

In [None]:
# Descriptive summary (demo)
rel["precision_at_k"].mean(), rel["precision_at_k"].std(ddof=1)

## 3) Expert aggregation: factuality & interpretability

In [None]:
agg_scores = aggregate_expert_scores(expert_ratings)
agg_scores

## 4) Expert consensus confidence (majority vote)

In [None]:
expert_maj = compute_expert_confidence_majority(expert_ratings)
expert_maj

## 5) System–expert uncertainty alignment

In [None]:
ua = compute_uncertainty_alignment(
    system_confidence_df=system_conf,
    expert_majority_df=expert_maj,
)
ua

In [None]:
# Exact agreement rate (demo alignment signal)
ua["aligned"].mean()

### Weighted Cohen’s κ (linear weights) for uncertainty alignment

To match manuscript-style reporting, we compute weighted κ between:
- system confidence labels (Low/Medium/High)
- expert majority confidence labels (Low/Medium/High)

This is interpreted as *interpretability alignment* rather than probabilistic calibration.


In [None]:
# Map ordered labels to integers for weighted kappa
order = {"Low": 0, "Medium": 1, "High": 2}

tmp = ua.dropna(subset=["system_confidence", "expert_confidence_majority"]).copy()
sys_y = tmp["system_confidence"].map(order).astype(int)
exp_y = tmp["expert_confidence_majority"].map(order).astype(int)

kappa_weighted = cohen_kappa_score(sys_y, exp_y, weights="linear")
kappa_weighted

## 6) Build a report table (Table-2-like)

In [None]:
results = (
    scenarios
    .merge(rel[["scenario_id", "k", "precision_at_k"]], on="scenario_id", how="left")
    .merge(agg_scores, on="scenario_id", how="left")
    .merge(system_conf, on="scenario_id", how="left")
    .merge(expert_maj, on="scenario_id", how="left")
    .merge(ua[["scenario_id", "aligned"]], on="scenario_id", how="left")
)

report = build_report_table(results)
report

## 7) Inter-rater agreement among experts (pairwise Cohen’s κ)

In [None]:
kappa_f = pairwise_cohens_kappa(expert_ratings, label_col="factuality")
kappa_i = pairwise_cohens_kappa(expert_ratings, label_col="interpretability")

kappa_f, kappa_i

## 8) Summary statistics (mean ± SD)

In [None]:
summary = compute_summary_stats(
    results=results.merge(ua, on="scenario_id", how="left"),
    kappa_factuality=kappa_f,
    kappa_interpretability=kappa_i,
    precision_col="precision_at_k",
    factuality_col="factuality_mean",
    interpretability_col="interpretability_mean",
    alignment_col="aligned",
)
summary

## 9) Export outputs (optional)

In [None]:
export_outputs(
    out_dir=OUT_DIR,
    report_table=report,
    summary_stats=summary,
    kappa_factuality=kappa_f,
    kappa_interpretability=kappa_i,
)

sorted(p.name for p in OUT_DIR.glob("demo_*.csv"))

## Notes

- The demonstration CSVs may use simplified scales for illustration.
  The evaluation utilities are schema-driven and can be reused with real evaluation tables
  as long as the column names and coding conventions are consistent.
- For manuscript-style uncertainty alignment reporting, this notebook computes weighted κ
  (linear weights) using the expert majority confidence labels.
