# Code to calculate agreement between Monique and Reyan

Warning - this is all vibe-coded. Need to ensure it is accureate/correct

# 1  Annotation Framework: Multi-Label, Order-Invariant Agreement

## 1.1  Context and purpose
Each emergency-department visit receives up to five Reason-for-Visit (RFV1–RFV5) labels drawn from the **17 top-level NHAMCS RVC groups** defined in the *MIMIC IV Chief Complaint → NHAMCS Reason-for-Visit Reviewer Manual v0.3*.
These represent patient-stated reasons, not clinician diagnoses.

Our goal is to quantify **inter-rater agreement** among multiple human annotators who each output an **unordered subset** of these groups for every visit.
For analysis we ignore RFV ordering—treating the set
\( S_{jr} = \{\text{categories assigned by rater r to visit j}\} \)
as a simple set of 0–5 elements.

## 1.2  Visit-level agreement classes
For any two raters A and B:

| Class | Definition | Intuition |
|-------|-------------|-----------|
| **Full** | \(S_A = S_B\) | Complete category match |
| **Partial** | \(S_A \cap S_B \neq \varnothing\) and \(S_A \ne S_B\) | Share at least one category |
| **None** | \(S_A \cap S_B = \varnothing\) | Disjoint label sets |

For three raters,
- **Full** = all identical;
- **Partial** = any overlap between any pair;
- **None** = pairwise disjoint.

## 1.3  Set-similarity metrics
Because order is meaningless, comparisons use **set-based overlaps**:

| Metric | Formula | Notes |
|---------|----------|-------|
| **Jaccard** | \(|S_A∩S_B| / |S_A∪S_B|\) | Measures intersection over union |
| **Overlap Coefficient** | \(|S_A∩S_B| / \min(|S_A|, |S_B|)\) | Robust when one rater lists fewer items |
| **F₁ (Set)** | \(2|S_A∩S_B| / (|S_A|+|S_B|)\) | Equivalent to Sørensen–Dice |
| **Micro-F₁ (Set)** | \(2∑|S_A∩S_B| / ∑(|S_A|+|S_B|)\) | Corpus-level summary |

These capture graded similarity beyond binary full/none counts.

## 1.4  Three-rater adjudication logic
When a third reviewer adjudicates:
- Compute the distribution of *full / partial / none* across all three.
- For disagreements between raters 1 and 2, evaluate whether rater 3:
  - matches either rater,
  - matches their union or intersection,
  - introduces any new category, or
  - provides a subset of the union.

This quantifies how the final “reference standard” was reached.


# 2  Agreement Beyond Chance (Chance-Corrected Metrics)

## 2.1  Binary expansion per category
Convert the multi-label sets into a binary decision matrix over the 17 canonical RVC groups:
for each visit × rater × category, record 1 if the category is present.
This yields parallel binary classification tasks where standard chance-corrected statistics apply.

## 2.2  Why Gwet’s AC1
Cohen’s κ is sensitive to marginal prevalence—especially when most visits lack a given category (the *kappa paradox*).
Gwet’s AC1 provides a more stable estimate of “agreement beyond chance” in such sparse settings.

For each category *g* with *m* raters:

\[
A_j =
\frac{\binom{n^+_j}{2} + \binom{m-n^+_j}{2}}{\binom{m}{2}},
\qquad
P_o = \text{mean}(A_j),
\qquad
p = \frac{1}{Nm}\sum_{j,r} X_{jr}^{(g)},
\]
\[
\text{AC1} = \frac{P_o - 2p(1-p)}{1 - 2p(1-p)}.
\]

Report **per-category AC1** and **macro-average** across categories.
Also compute **pairwise κ** (for comparison) and **percent agreement** to provide a full view.

## 2.3  Interpretation
| Metric | What it captures | Typical reporting |
|---------|------------------|-------------------|
| **Percent agreement** | Raw consistency, ignores chance | Always include |
| **Cohen’s κ** | Chance-corrected for two raters | Supplemental |
| **Gwet’s AC1** | Chance-corrected, robust to imbalance | Primary statistic |
| **Multi-rater AC1** | Extension of AC1 to m > 2 | Summary across all raters |

High percent agreement with high AC1 (≈ 0.9–1.0) implies strong, reproducible labeling of the RVC groups defined in the Reviewer Manual v0.3.


In [1]:
# Agreement analysis for multi-label (order-invariant) RFV annotations.

import pandas as pd
import numpy as np
from itertools import combinations
from collections import Counter
import os
from pathlib import Path
from typing import List, Dict, Set
from dotenv import load_dotenv

load_dotenv()

WORK_DIR = Path(os.getenv("WORK_DIR", Path.cwd())).expanduser().resolve()

# -----------------------------
# Configuration
# -----------------------------
EXCEL_PATH = WORK_DIR / "Annotation/Final 2025-10-14 Annotation Sample.xlsx"  # update if needed

SHEET = "cohort_cc_sample"
DATA_SHEET = "Data"  # canonical category names in column "RVC Categories"

outdir = WORK_DIR / "Annotation/Full Annotations/Agreement Metrics"
outdir.mkdir(parents=True, exist_ok=True)

# -----------------------------
# Helpers
# -----------------------------
def load_categories_from_sheet(xlsx_path: str, sheet: str = "Data", col: str = "RVC Categories") -> List[str]:
    try:
        cat_df = pd.read_excel(xlsx_path, sheet_name=sheet)
        return [c for c in cat_df[col].dropna().astype(str).tolist() if c.strip()]
    except Exception:
        # Fallback in case the "Data" sheet is absent
        return [
            'Injuries & adverse effects','Symptom – Respiratory','Symptom – Circulatory',
            'Symptom – Nervous','Symptom – Digestive','Symptom – Genitourinary',
            'Symptom – Musculoskeletal','Symptom – Skin/Hair/Nails','Symptom – Eye/Ear',
            'Symptom – General','Symptom – Psychological','Diseases (patient-stated diagnosis)',
            'Abnormal test result','Diagnostic/Screening/Preventive','Treatment/Medication',
            'Administrative','Uncodable/Unknown',
        ]

CANONICAL_CATS = load_categories_from_sheet(EXCEL_PATH, DATA_SHEET, "RVC Categories")

def normalize_label(s: str) -> str:
    """Normalize minor punctuation/spacing so category strings match reliably."""
    if pd.isna(s): return None
    s = str(s).strip()
    if not s: return None
    s = s.replace(" - ", " – ").replace("—", "–").replace("-", "–")
    s = " ".join(s.split())
    return s

def extract_rater_sets(df: pd.DataFrame, rater: int) -> List[Set[str]]:
    """Return a list of sets: one set of categories per row for the given rater."""
    cols = [f"annot{rater}_rvs{i}_cat" for i in range(1, 6) if f"annot{rater}_rvs{i}_cat" in df.columns]
    sets = []
    for _, row in df[cols].iterrows():
        labels = set()
        for c in cols:
            lab = normalize_label(row[c])
            if lab:
                labels.add(lab)
        sets.append(labels)
    return sets

# --- set-based similarities ---
def jaccard(a: Set[str], b: Set[str]) -> float:
    u = len(a | b)
    return 1.0 if u == 0 else len(a & b) / u

def overlap_coeff(a: Set[str], b: Set[str]) -> float:
    m = min(len(a), len(b))
    return 1.0 if m == 0 and len(a) == len(b) == 0 else (0.0 if m == 0 else len(a & b) / m)

def f1_set(a: Set[str], b: Set[str]) -> float:
    denom = len(a) + len(b)
    return 1.0 if denom == 0 else 2 * len(a & b) / denom

def classify_pair(a: Set[str], b: Set[str]) -> str:
    if a == b: return "full"
    return "partial" if (a & b) else "none"

def classify_three(a: Set[str], b: Set[str], c: Set[str]) -> str:
    if a == b == c: return "full"
    return "partial" if (a & b) or (a & c) or (b & c) else "none"

def micro_f1_raters(A: List[Set[str]], B: List[Set[str]]) -> float:
    tp = sum(len(a & b) for a, b in zip(A, B))
    denom = sum(len(a) + len(b) for a, b in zip(A, B))
    return 1.0 if denom == 0 else 2 * tp / denom

# --- multi-label -> binary matrices over categories ---
def flatten_binary_decisions(raters_sets: Dict[str, List[Set[str]]], categories: List[str]) -> Dict[str, np.ndarray]:
    """Return {rater: matrix (n_items, n_categories)} with 1/0 presence for each category."""
    n = len(next(iter(raters_sets.values())))
    k = len(categories)
    idx = {c: i for i, c in enumerate(categories)}
    out = {}
    for rater, sets in raters_sets.items():
        M = np.zeros((n, k), dtype=int)
        for j, s in enumerate(sets):
            for lab in s:
                lab = normalize_label(lab)
                if lab in idx:
                    M[j, idx[lab]] = 1
        out[rater] = M
    return out

# --- pairwise chance-corrected (per category) ---
def pairwise_binary_agreement_stats(M1: np.ndarray, M2: np.ndarray) -> pd.DataFrame:
    """Per-category percent agreement, Cohen's kappa, and Gwet's AC1 (binary)."""
    n, k = M1.shape
    rows = []
    for c in range(k):
        y1, y2 = M1[:, c], M2[:, c]
        tp = int(((y1 == 1) & (y2 == 1)).sum())
        tn = int(((y1 == 0) & (y2 == 0)).sum())
        fp = int(((y1 == 0) & (y2 == 1)).sum())
        fn = int(((y1 == 1) & (y2 == 0)).sum())
        N = tp + tn + fp + fn
        if N == 0:
            rows.append(dict(category=c, N=0, percent_agreement=np.nan, cohen_kappa=np.nan, gwet_ac1=np.nan))
            continue
        Po = (tp + tn) / N
        p1, p2 = y1.mean(), y2.mean()
        Pe_kappa = p1 * p2 + (1 - p1) * (1 - p2)
        kappa = (Po - Pe_kappa) / (1 - Pe_kappa) if (1 - Pe_kappa) != 0 else np.nan
        pbar = (p1 + p2) / 2
        Pe_ac1 = 2 * pbar * (1 - pbar)
        ac1 = (Po - Pe_ac1) / (1 - Pe_ac1) if (1 - Pe_ac1) != 0 else np.nan
        rows.append(dict(category=c, N=N, percent_agreement=Po, cohen_kappa=kappa, gwet_ac1=ac1))
    return pd.DataFrame(rows)

# --- multi-rater AC1 (binary, per category) ---
def multirater_ac1_per_category(Ms: List[np.ndarray]) -> pd.DataFrame:
    """Gwet's AC1 for m>=2 raters, per category, using average pairwise agreement."""
    m = len(Ms)
    n, k = Ms[0].shape
    stack = np.stack(Ms, axis=0)  # (m, n, k)
    denom_pairs = m * (m - 1) / 2
    rows = []
    for c in range(k):
        M_c = stack[:, :, c]           # (m, n)
        npos = M_c.sum(axis=0)         # length n
        nneg = m - npos
        Aj = (npos * (npos - 1) / 2 + nneg * (nneg - 1) / 2) / denom_pairs
        Po = Aj.mean()
        p = npos.sum() / (n * m)
        Pe = 2 * p * (1 - p)
        ac1 = (Po - Pe) / (1 - Pe) if (1 - Pe) != 0 else np.nan
        rows.append(dict(category=c, percent_agreement=Po, gwet_ac1=ac1, prevalence=p))
    return pd.DataFrame(rows)

def summarize_set_agreement(r1_sets: List[Set[str]], r2_sets: List[Set[str]]) -> pd.DataFrame:
    recs = []
    for a, b in zip(r1_sets, r2_sets):
        recs.append({
            "exact": int(a == b),
            "partial": int((a != b) and (len(a & b) > 0)),
            "none": int(len(a & b) == 0),
            "jaccard": jaccard(a, b),
            "overlap": overlap_coeff(a, b),
            "f1_set": f1_set(a, b),
            "len_a": len(a),
            "len_b": len(b),
            "len_inter": len(a & b),
            "len_union": len(a | b),
        })
    df = pd.DataFrame(recs)
    summary = pd.Series({
        "N_items": len(df),
        "exact_rate": df["exact"].mean(),
        "partial_rate": df["partial"].mean(),
        "none_rate": df["none"].mean(),
        "mean_jaccard": df["jaccard"].mean(),
        "mean_overlap": df["overlap"].mean(),
        "mean_f1_set": df["f1_set"].mean(),
        "micro_f1_set": micro_f1_raters(r1_sets, r2_sets),
        "mean_len_a": df["len_a"].mean(),
        "mean_len_b": df["len_b"].mean(),
    })
    return df, summary

def summarize_three_way(r1: List[Set[str]], r2: List[Set[str]], r3: List[Set[str]]) -> pd.Series:
    labels = [classify_three(a, b, c) for a, b, c in zip(r1, r2, r3)]
    N = len(labels)
    counts = Counter(labels)
    return pd.Series({
        "N_items": N,
        "full_rate": counts.get("full", 0) / N,
        "partial_rate": counts.get("partial", 0) / N,
        "none_rate": counts.get("none", 0) / N
    })

def adjudication_resolution(r1, r2, r3) -> pd.Series:
    recs = []
    for a, b, c in zip(r1, r2, r3):
        if a == b: 
            continue
        u, inter = (a | b), (a & b)
        recs.append({
            "r3_equals_r1": int(c == a),
            "r3_equals_r2": int(c == b),
            "r3_equals_union": int(c == u),
            "r3_equals_intersection": int((len(inter) > 0) and (c == inter)),
            "r3_introduces_new": int(len(c - u) > 0),
            "r3_subset_of_union": int(c <= u),
        })
    if not recs:
        return pd.Series({"N_disagreements": 0})
    df = pd.DataFrame(recs)
    return pd.Series({
        "N_disagreements": len(df),
        "r3_equals_r1_rate": df["r3_equals_r1"].mean(),
        "r3_equals_r2_rate": df["r3_equals_r2"].mean(),
        "r3_equals_union_rate": df["r3_equals_union"].mean(),
        "r3_equals_intersection_rate": df["r3_equals_intersection"].mean(),
        "r3_introduces_new_rate": df["r3_introduces_new"].mean(),
        "r3_subset_of_union_rate": df["r3_subset_of_union"].mean(),
    })

# -----------------------------
# Run
# -----------------------------
df = pd.read_excel(EXCEL_PATH, sheet_name=SHEET)

r1_sets = extract_rater_sets(df, 1)
r2_sets = extract_rater_sets(df, 2)
r3_sets = extract_rater_sets(df, 3)

pair12_df, pair12_summary = summarize_set_agreement(r1_sets, r2_sets)
pair13_df, pair13_summary = summarize_set_agreement(r1_sets, r3_sets)
pair23_df, pair23_summary = summarize_set_agreement(r2_sets, r3_sets)

three_summary = summarize_three_way(r1_sets, r2_sets, r3_sets)

raters_sets = {"r1": r1_sets, "r2": r2_sets, "r3": r3_sets}
bin_mats = flatten_binary_decisions(raters_sets, CANONICAL_CATS)

pw12 = pairwise_binary_agreement_stats(bin_mats["r1"], bin_mats["r2"])
pw13 = pairwise_binary_agreement_stats(bin_mats["r1"], bin_mats["r3"])
pw23 = pairwise_binary_agreement_stats(bin_mats["r2"], bin_mats["r3"])

name_map = {i: c for i, c in enumerate(CANONICAL_CATS)}
for df_pw in (pw12, pw13, pw23):
    df_pw["category_name"] = df_pw["category"].map(name_map)

multi_ac1 = multirater_ac1_per_category([bin_mats["r1"], bin_mats["r2"], bin_mats["r3"]])
multi_ac1["category_name"] = multi_ac1["category"].map(name_map)

def summarize_kappa_ac1(df_pw: pd.DataFrame) -> pd.Series:
    return pd.Series({
        "macro_cohen_kappa": df_pw["cohen_kappa"].mean(),
        "macro_gwet_ac1": df_pw["gwet_ac1"].mean(),
        "macro_percent_agreement": df_pw["percent_agreement"].mean(),
        "micro_percent_agreement": (df_pw["N"] * df_pw["percent_agreement"]).sum() / df_pw["N"].sum()
    })

pair12_chance = summarize_kappa_ac1(pw12)
pair13_chance = summarize_kappa_ac1(pw13)
pair23_chance = summarize_kappa_ac1(pw23)

multi_macro_ac1 = multi_ac1["gwet_ac1"].mean()
multi_macro_agree = multi_ac1["percent_agreement"].mean()

# -----------------------------
# Outputs
# -----------------------------

pair12_df.to_csv(os.path.join(outdir, "pair_R1_R2_set_metrics.csv"), index=False)
pair13_df.to_csv(os.path.join(outdir, "pair_R1_R3_set_metrics.csv"), index=False)
pair23_df.to_csv(os.path.join(outdir, "pair_R2_R3_set_metrics.csv"), index=False)

pw12.to_csv(os.path.join(outdir, "pair_R1_R2_binary_stats.csv"), index=False)
pw13.to_csv(os.path.join(outdir, "pair_R1_R3_binary_stats.csv"), index=False)
pw23.to_csv(os.path.join(outdir, "pair_R2_R3_binary_stats.csv"), index=False)

multi_ac1.to_csv(os.path.join(outdir, "all3_multirater_ac1_by_category.csv"), index=False)

adj_summary = adjudication_resolution(r1_sets, r2_sets, r3_sets)

summary_text = f"""
=== Pairwise set-level (order-invariant) ===

R1 vs R2:
{pair12_summary.to_string()}

R1 vs R3:
{pair13_summary.to_string()}

R2 vs R3:
{pair23_summary.to_string()}

=== Three-rater set-level (full/partial/none) ===
{three_summary.to_string()}

=== Chance-corrected (binary per category) ===
R1 vs R2:
{pair12_chance.to_string()}

R1 vs R3:
{pair13_chance.to_string()}

R2 vs R3:
{pair23_chance.to_string()}

=== Multi-rater AC1 (3 raters) ===
macro_gwet_ac1={multi_macro_ac1:.4f}
macro_percent_agreement={multi_macro_agree:.4f}

=== Adjudication (R3) for R1≠R2 ===
{adj_summary.to_string()}
"""

with open(os.path.join(outdir, "summary.txt"), "w") as f:
    f.write(summary_text)

print(summary_text)
print("\nOutputs written to:", os.path.abspath(outdir))



=== Pairwise set-level (order-invariant) ===

R1 vs R2:
N_items         160.000000
exact_rate        0.806250
partial_rate      0.131250
none_rate         0.062500
mean_jaccard      0.856250
mean_overlap      0.882292
mean_f1_set       0.877440
micro_f1_set      0.873950
mean_len_a        1.493750
mean_len_b        1.481250

R1 vs R3:
N_items         160.000000
exact_rate        0.850000
partial_rate      0.075000
none_rate         0.075000
mean_jaccard      0.879167
mean_overlap      0.892708
mean_f1_set       0.891190
micro_f1_set      0.897275
mean_len_a        1.493750
mean_len_b        1.487500

R2 vs R3:
N_items         160.000000
exact_rate        0.831250
partial_rate      0.112500
none_rate         0.056250
mean_jaccard      0.876042
mean_overlap      0.900000
mean_f1_set       0.894583
micro_f1_set      0.892632
mean_len_a        1.481250
mean_len_b        1.487500

=== Three-rater set-level (full/partial/none) ===
N_items         160.00000
full_rate         0.75625
partial_

  for idx, row in parser.parse():


## Agreement with NLP Pipeline

In [2]:
# --- R3 (adjudicator) vs NLP comparison (order-invariant, multi-label) ---

import pandas as pd
import numpy as np
import os
from pathlib import Path
from typing import List, Set, Dict
from dotenv import load_dotenv

try:
    WORK_DIR
except NameError:
    load_dotenv()
    WORK_DIR = Path(os.getenv("WORK_DIR", Path.cwd())).expanduser().resolve()

DATA_DIR = WORK_DIR / "MIMIC tabular data"

# =========================
# Config: update this path
# =========================
NLP_PATH = DATA_DIR / "2025-10-14 MIMICIV all with CC_with_NLP.xlsx"
NLP_SHEET = 0  # or sheet name if needed
NLP_MIN_SIM = None  # Optional: set to a float threshold (e.g., 0.45) to require RFVn_sim >= threshold

# If your adjudication workbook isn't in memory as `df`, set it here:
EXCEL_PATH = WORK_DIR / "Annotation/Final 2025-10-14 Annotation Sample.xlsx"
SHEET = "cohort_cc_sample"

# -----------------------------
# Helpers (duplicated for self-containment)
# -----------------------------
def normalize_label(s: str) -> str:
    if pd.isna(s):
        return None
    s = str(s).strip()
    if not s:
        return None
    s = s.replace(" - ", " – ").replace("—", "–").replace("-", "–")
    s = " ".join(s.split())
    return s

def load_categories_from_sheet(xlsx_path: str, sheet: str = "Data", col: str = "RVC Categories") -> List[str]:
    try:
        cat_df = pd.read_excel(xlsx_path, sheet_name=sheet)
        cats = [c for c in cat_df[col].dropna().astype(str).tolist() if c.strip()]
        return cats
    except Exception:
        # Fallback list
        return [
            'Injuries & adverse effects','Symptom – Respiratory','Symptom – Circulatory',
            'Symptom – Nervous','Symptom – Digestive','Symptom – Genitourinary',
            'Symptom – Musculoskeletal','Symptom – Skin/Hair/Nails','Symptom – Eye/Ear',
            'Symptom – General','Symptom – Psychological','Diseases (patient-stated diagnosis)',
            'Abnormal test result','Diagnostic/Screening/Preventive','Treatment/Medication',
            'Administrative','Uncodable/Unknown',
        ]

try:
    CANONICAL_CATS  # use from prior cell if defined
except NameError:
    CANONICAL_CATS = load_categories_from_sheet(EXCEL_PATH, "Data", "RVC Categories")

def extract_rater_sets_from_df(df: pd.DataFrame, rater_col_prefix: str) -> List[Set[str]]:
    """
    rater_col_prefix = 'annot3' for adjudicator; columns must be like annot3_rvs1_cat .. annot3_rvs5_cat
    """
    cols = [f"{rater_col_prefix}_rvs{i}_cat" for i in range(1, 6) if f"{rater_col_prefix}_rvs{i}_cat" in df.columns]
    out = []
    for _, row in df[cols].iterrows():
        s = set()
        for c in cols:
            lab = normalize_label(row[c])
            if lab:
                s.add(lab)
        out.append(s)
    return out

def extract_nlp_sets_from_df(df_nlp: pd.DataFrame, min_sim=None) -> List[Set[str]]:
    """
    Build set of predicted categories using RFV1_name..RFV5_name (optionally filter by RFVn_sim >= min_sim).
    """
    name_cols = [f"RFV{i}_name" for i in range(1, 6) if f"RFV{i}_name" in df_nlp.columns]
    sim_cols  = [f"RFV{i}_sim"  for i in range(1, 6) if f"RFV{i}_sim"  in df_nlp.columns]
    use_sim = min_sim is not None and len(sim_cols) == len(name_cols)

    sets = []
    for _, row in df_nlp.iterrows():
        s = set()
        for i in range(1, 6):
            name_col = f"RFV{i}_name"
            if name_col not in df_nlp.columns: 
                continue
            name_val = normalize_label(row[name_col])
            if not name_val:
                continue
            if use_sim:
                sim_col = f"RFV{i}_sim"
                sim_val = row.get(sim_col, np.nan)
                if pd.isna(sim_val) or float(sim_val) < float(min_sim):
                    continue
            s.add(name_val)
        sets.append(s)
    return sets

# --- set-based similarities ---
def jaccard(a: Set[str], b: Set[str]) -> float:
    u = len(a | b)
    return 1.0 if u == 0 else len(a & b) / u

def overlap_coeff(a: Set[str], b: Set[str]) -> float:
    m = min(len(a), len(b))
    if m == 0:
        return 1.0 if len(a) == len(b) == 0 else 0.0
    return len(a & b) / m

def f1_set(a: Set[str], b: Set[str]) -> float:
    denom = len(a) + len(b)
    return 1.0 if denom == 0 else 2 * len(a & b) / denom

def summarize_set_agreement(A_sets: List[Set[str]], B_sets: List[Set[str]]) -> (pd.DataFrame, pd.Series):
    recs = []
    for a, b in zip(A_sets, B_sets):
        recs.append({
            "exact": int(a == b),
            "partial": int((a != b) and (len(a & b) > 0)),
            "none": int(len(a & b) == 0),
            "jaccard": jaccard(a, b),
            "overlap": overlap_coeff(a, b),
            "f1_set": f1_set(a, b),
            "len_a": len(a),
            "len_b": len(b),
            "len_inter": len(a & b),
            "len_union": len(a | b),
        })
    df = pd.DataFrame(recs)
    summary = pd.Series({
        "N_items": len(df),
        "exact_rate": df["exact"].mean(),
        "partial_rate": df["partial"].mean(),
        "none_rate": df["none"].mean(),
        "mean_jaccard": df["jaccard"].mean(),
        "mean_overlap": df["overlap"].mean(),
        "mean_f1_set": df["f1_set"].mean(),
        "micro_f1_set": (2 * df["len_inter"].sum()) / (df["len_a"].sum() + df["len_b"].sum()) if (df["len_a"].sum()+df["len_b"].sum())>0 else 1.0
    })
    return df, summary

# --- multi-label -> binary matrices ---
def flatten_binary_decisions(sets: List[Set[str]], categories: List[str]) -> np.ndarray:
    n, k = len(sets), len(categories)
    idx = {c: i for i, c in enumerate(categories)}
    M = np.zeros((n, k), dtype=int)
    for j, s in enumerate(sets):
        for lab in s:
            lab = normalize_label(lab)
            if lab in idx:
                M[j, idx[lab]] = 1
    return M

def pairwise_binary_agreement_stats(M1: np.ndarray, M2: np.ndarray, categories: List[str]) -> pd.DataFrame:
    n, k = M1.shape
    rows = []
    for c in range(k):
        y1, y2 = M1[:, c], M2[:, c]
        tp = int(((y1 == 1) & (y2 == 1)).sum())
        tn = int(((y1 == 0) & (y2 == 0)).sum())
        fp = int(((y1 == 0) & (y2 == 1)).sum())
        fn = int(((y1 == 1) & (y2 == 0)).sum())
        N = tp + tn + fp + fn
        if N == 0:
            rows.append(dict(category_ix=c, category=categories[c], N=0, percent_agreement=np.nan, cohen_kappa=np.nan, gwet_ac1=np.nan))
            continue
        Po = (tp + tn) / N
        p1, p2 = y1.mean(), y2.mean()
        Pe_kappa = p1 * p2 + (1 - p1) * (1 - p2)
        kappa = (Po - Pe_kappa) / (1 - Pe_kappa) if (1 - Pe_kappa) != 0 else np.nan
        pbar = (p1 + p2) / 2
        Pe_ac1 = 2 * pbar * (1 - pbar)
        ac1 = (Po - Pe_ac1) / (1 - Pe_ac1) if (1 - Pe_ac1) != 0 else np.nan
        rows.append(dict(
            category_ix=c, category=categories[c], N=N,
            tp=tp, tn=tn, fp=fp, fn=fn,
            prevalence_r3=p1, prevalence_nlp=p2,
            percent_agreement=Po, cohen_kappa=kappa, gwet_ac1=ac1
        ))
    return pd.DataFrame(rows)

def summarize_kappa_ac1(df_pw: pd.DataFrame) -> pd.Series:
    return pd.Series({
        "macro_cohen_kappa": df_pw["cohen_kappa"].mean(),
        "macro_gwet_ac1": df_pw["gwet_ac1"].mean(),
        "macro_percent_agreement": df_pw["percent_agreement"].mean(),
        "micro_percent_agreement": (df_pw["tp"].sum() + df_pw["tn"].sum()) / df_pw["N"].sum()
    })

# -----------------------------
# Load adjudication and NLP
# -----------------------------
# Adjudication (R3)
try:
    df  # from earlier cell
except NameError:
    df = pd.read_excel(EXCEL_PATH, sheet_name=SHEET)

r3_key_cols = ["hadm_id", "subject_id"]
r3_cols = r3_key_cols + [c for c in df.columns if c.startswith("annot3_rvs") and c.endswith("_cat")]
df_r3 = df[r3_cols].copy()

# NLP predictions
df_nlp = pd.read_excel(NLP_PATH, sheet_name=NLP_SHEET)
nlp_key_cols = ["hadm_id", "subject_id"]
nlp_name_cols = [f"RFV{i}_name" for i in range(1, 6)]
present_name_cols = [c for c in nlp_name_cols if c in df_nlp.columns]
if not set(nlp_key_cols).issubset(df_nlp.columns) or not present_name_cols:
    raise ValueError("NLP file must contain hadm_id, subject_id, and RFV*_name columns.")

# -----------------------------
# Join + build sets
# -----------------------------
df_join = pd.merge(df_r3, df_nlp, on=["hadm_id", "subject_id"], how="inner")
n_before = len(df_r3)
n_after = len(df_join)
n_dropped = n_before - n_after
print(f"Joined R3 to NLP on (hadm_id, subject_id): {n_after} matched; {n_dropped} adjudicated rows had no NLP match.")

# Build sets
r3_sets = extract_rater_sets_from_df(df_join, rater_col_prefix="annot3")
nlp_sets = extract_nlp_sets_from_df(df_join, min_sim=NLP_MIN_SIM)

# -----------------------------
# Set-level agreement (visit-level)
# -----------------------------
pair_df, pair_summary = summarize_set_agreement(r3_sets, nlp_sets)

# Diagnostics by visit (optional): missed vs added
def list_diff(a: Set[str], b: Set[str]) -> (List[str], List[str]):
    return sorted(a - b), sorted(b - a)

missed, added = [], []
for a, b in zip(r3_sets, nlp_sets):
    m, ad = list_diff(a, b)
    missed.append("; ".join(m))
    added.append("; ".join(ad))

visit_metrics = df_join[r3_key_cols].copy()
visit_metrics["exact"] = pair_df["exact"]
visit_metrics["partial"] = pair_df["partial"]
visit_metrics["none"] = pair_df["none"]
visit_metrics["jaccard"] = pair_df["jaccard"]
visit_metrics["overlap"] = pair_df["overlap"]
visit_metrics["f1_set"] = pair_df["f1_set"]
visit_metrics["r3_size"] = pair_df["len_a"]
visit_metrics["nlp_size"] = pair_df["len_b"]
visit_metrics["missed_by_nlp"] = missed
visit_metrics["added_by_nlp"] = added

# -----------------------------
# Chance-corrected per-category
# -----------------------------
M_r3  = flatten_binary_decisions(r3_sets, CANONICAL_CATS)
M_nlp = flatten_binary_decisions(nlp_sets, CANONICAL_CATS)

pw_stats = pairwise_binary_agreement_stats(M_r3, M_nlp, CANONICAL_CATS)
chance_summary = summarize_kappa_ac1(pw_stats)

# -----------------------------
# Outputs
# -----------------------------
outdir = WORK_DIR / "annotation_agreement_outputs_nlp"
outdir.mkdir(parents=True, exist_ok=True)

visit_metrics.to_csv(os.path.join(outdir, "R3_vs_NLP_set_metrics_by_visit.csv"), index=False)
pw_stats.to_csv(os.path.join(outdir, "R3_vs_NLP_binary_stats_by_category.csv"), index=False)

summary_lines = []
summary_lines.append("=== R3 (adjudicator) vs NLP: set-level (order-invariant) ===")
summary_lines.append(pair_summary.to_string())
summary_lines.append("\n=== R3 vs NLP: chance-corrected (binary per category) ===")
summary_lines.append(chance_summary.to_string())
summary_text = "\n\n".join(summary_lines)

with open(os.path.join(outdir, "R3_vs_NLP_summary.txt"), "w") as f:
    f.write(summary_text)

print(summary_text)
print("\nArtifacts written to:", os.path.abspath(outdir))


Joined R3 to NLP on (hadm_id, subject_id): 160 matched; 0 adjudicated rows had no NLP match.
=== R3 (adjudicator) vs NLP: set-level (order-invariant) ===

N_items         160.000000
exact_rate        0.543750
partial_rate      0.206250
none_rate         0.250000
mean_jaccard      0.638229
mean_overlap      0.713542
mean_f1_set       0.670982
micro_f1_set      0.700935


=== R3 vs NLP: chance-corrected (binary per category) ===

macro_cohen_kappa          0.496737
macro_gwet_ac1             0.959330
macro_percent_agreement    0.966544
micro_percent_agreement    0.966544

Artifacts written to: /Users/reblocke/Research/Hypercap-CC-NLP/annotation_agreement_outputs_nlp
