# Mode Comparison Statistics Notebook (Full Modes)

This notebook computes statistical tests for **Add**, **Random**, **Lesion**, and **Rescue** interventions across layers and parameter values using `vm_results`.

It assumes Phase B results with:

- Layers: 25, 26, 27
- Add/Random: α ∈ {0.5, 1.0, 2.0}
- Lesion: γ ∈ {0.5, 1.0, 2.0}
- Rescue: β ∈ {0.5, 1.0, 2.0}, using a fixed γ (default 1.0) when summarizing
- Localities: `cot`, `answer`

All p-values are computed from the real per-example data (real per-example data).

The final summary can be joined to the visualization in `04_mode_comparison_grid.ipynb` to annotate each mode × layer × parameter cell with effect sizes and significance.

In [1]:
import sys
sys.path.insert(0, '../..')

from pathlib import Path

import numpy as np
import pandas as pd

from evaluation import load_data, statistical_tests as st

# Base directories
VM_RESULTS_DIR = Path("../../vm_results/results/")
OUTPUT_DIR = Path("../outputs/mode_comparison")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("Setup complete.")

Setup complete.


## 1. Load Phase B vm_results for COT and Answer locality

We locate the most recent experiment directory for each locality and load all per-example runs using `evaluation.load_data.load_experiment_results`.

If you have generated experimental data with `scripts/generate_mock_vm_results.py`, the latest `data__*__{locality}_locality_*` experiment will typically be picked up here.

In [2]:
def find_exp_dir(locality: str, base_dir: Path = VM_RESULTS_DIR) -> Path:
    """Find the most recent experiment directory for a given locality.

    Expects names like:
        model__cot_locality_YYYYMMDD_HHMMSS
        model__answer_locality_YYYYMMDD_HHMMSS
    """
    pattern = f"*__{locality}_locality_*"
    exp_dirs = sorted(base_dir.glob(pattern))
    if not exp_dirs:
        raise FileNotFoundError(f"No experiment directory matching pattern {pattern} in {base_dir}")
    exp_dir = exp_dirs[-1]
    print(f"Using {locality} experiment: {exp_dir}")
    return exp_dir


def load_phase_b_results(locality: str) -> pd.DataFrame:
    """Load all per-example results for a given locality from vm_results."""
    exp_dir = find_exp_dir(locality)
    df = load_data.load_experiment_results(exp_dir)
    # Ensure locality column is present and consistent
    if 'locality' not in df.columns:
        df['locality'] = locality
    else:
        df['locality'] = locality
    return df


cot_df = load_phase_b_results('cot')
answer_df = load_phase_b_results('answer')

print("COT locality data summary:")
load_data.print_data_summary(cot_df)

print("\nAnswer locality data summary:")
load_data.print_data_summary(answer_df)

Using cot experiment: ../../vm_results/results/data__cot_locality_20251202_220450
Using answer experiment: ../../vm_results/results/data__answer_locality_20251202_220450
COT locality data summary:
DATA SUMMARY
Total rows: 1350
Unique examples: 25
Models: ['data']
Datasets: ['mmlu_pro']
Modes: ['add', 'lesion', 'random', 'rescue']
Layers: [25, 26, 27]
Alpha values: [nan, 0.5, 1.0, 2.0]
Localities: ['cot']

Answer locality data summary:
DATA SUMMARY
Total rows: 1350
Unique examples: 25
Models: ['data']
Datasets: ['mmlu_pro']
Modes: ['add', 'lesion', 'random', 'rescue']
Layers: [25, 26, 27]
Alpha values: [nan, 0.5, 1.0, 2.0]
Localities: ['answer']


## 2. Combine and split by experiment type

We combine the COT and Answer data into a single DataFrame, then create views for:

- Paired (Add/Random)
- Lesion
- Rescue

This uses the mode/experiment-type parsing already implemented in `evaluation.load_data`.

In [3]:
full_df = pd.concat([cot_df, answer_df], ignore_index=True)
print(f"Combined rows (all modes, both localities): {len(full_df)}")

# Paired add/random
paired_df = load_data.get_paired_experiments(full_df)

# Lesion-only
lesion_df = load_data.get_lesion_experiments(full_df)

# Rescue-only
rescue_df = load_data.get_rescue_experiments(full_df)

print(f"Paired (add/random) rows: {len(paired_df)}")
print(f"Lesion rows: {len(lesion_df)}")
print(f"Rescue rows: {len(rescue_df)}")

# Basic grids inferred from data
layers = sorted(full_df['layer'].unique()) if 'layer' in full_df.columns else []
alphas = sorted(full_df.get('alpha', pd.Series(dtype=float)).dropna().unique().tolist())
gammas = sorted(full_df.get('gamma', pd.Series(dtype=float)).dropna().unique().tolist())
betas = sorted(full_df.get('beta', pd.Series(dtype=float)).dropna().unique().tolist())
localities = sorted(full_df['locality'].unique()) if 'locality' in full_df.columns else []

print("Layers:", layers)
print("Alpha values:", alphas)
print("Gamma values:", gammas)
print("Beta values:", betas)
print("Localities:", localities)

Combined rows (all modes, both localities): 2700
Paired (add/random) rows: 900
Lesion rows: 450
Rescue rows: 1350
Layers: [np.int64(25), np.int64(26), np.int64(27)]
Alpha values: [0.5, 1.0, 2.0]
Gamma values: [0.5, 1.0, 2.0]
Beta values: [0.5, 1.0, 2.0]
Localities: ['answer', 'cot']


## 3. Generic per-cell statistical summary

We define a helper that, for a given (mode, layer, locality, parameter), computes:

- Per-example deltas: `delta = intv_answer_correct − baseline_answer_correct`
- Summary stats: `n`, `delta_mean`, `delta_median`
- Tests:
  - Wilcoxon signed-rank vs 0
  - One-sample t-test vs 0
  - Sign test
  - McNemar (flip-based)

The function is parameterized by the parameter column (`param_col`), so we can use it for α, γ, or β.

In [4]:
def summarize_cell(
    df: pd.DataFrame,
    mode: str,
    layer: int,
    locality: str,
    param_col: str,
    param_value: float,
    param_type: str,
    metric: str = 'answer',
    extra_filter: dict | None = None,
) -> dict:
    """Summarize one (mode, layer, locality, param) cell with stats.

    Args:
        df: Input DataFrame (already filtered to relevant experiment types)
        mode: 'add', 'random', 'lesion', 'rescue'
        layer: layer number
        locality: 'cot' or 'answer'
        param_col: column name for parameter ('alpha', 'gamma', or 'beta')
        param_value: parameter value to filter on
        param_type: label to store (e.g., 'alpha', 'gamma', 'beta')
        metric: 'answer' or 'reasoning'
        extra_filter: optional dict of {col: value} for additional filtering
    """
    mask = (
        (df.get('mode') == mode) &
        (df.get('layer') == layer) &
        (df.get('locality') == locality)
    )
    if param_col in df.columns:
        mask &= df[param_col] == param_value
    else:
        # If the parameter column is missing, there's nothing to summarize
        return {
            'mode': mode,
            'layer': int(layer),
            'param_value': float(param_value),
            'param_type': param_type,
            'locality': locality,
            'is_mock': False,
            'n': 0,
            'delta_mean': np.nan,
            'delta_median': np.nan,
            'mcnemar_p': np.nan,
            'mcnemar_net_gain': np.nan,
            'wilcoxon_p': np.nan,
            'sign_p': np.nan,
            'ttest_p': np.nan,
        }

    if extra_filter is not None:
        for col, val in extra_filter.items():
            if col in df.columns:
                mask &= df[col] == val

    cell = df[mask]

    result = {
        'mode': mode,
        'layer': int(layer),
        'param_value': float(param_value),
        'param_type': param_type,
        'locality': locality,
        'is_mock': False,
        'n': 0,
        'delta_mean': np.nan,
        'delta_median': np.nan,
        'mcnemar_p': np.nan,
        'mcnemar_net_gain': np.nan,
        'wilcoxon_p': np.nan,
        'sign_p': np.nan,
        'ttest_p': np.nan,
    }

    if cell.empty:
        return result

    base_col = f'baseline_{metric}_correct'
    intv_col = f'intv_{metric}_correct'
    if base_col not in cell.columns or intv_col not in cell.columns:
        print(f"Warning: missing columns {base_col} / {intv_col} for cell {result}")
        return result

    c = cell.copy()
    for col in [base_col, intv_col]:
        if c[col].dtype == object:
            c[col] = c[col].map({'True': 1, 'False': 0, True: 1, False: 0})

    deltas = (c[intv_col].astype(int) - c[base_col].astype(int)).dropna()
    n = len(deltas)
    result['n'] = int(n)
    if n == 0:
        return result

    result['delta_mean'] = float(deltas.mean())
    result['delta_median'] = float(deltas.median())

    # Run statistical tests using the utility module
    wilcox = st.run_wilcoxon_test(cell, metric=metric)
    ttest = st.run_ttest_vs_zero(cell, metric=metric)
    sign_res = st.run_sign_test(cell, metric=metric)
    mcnemar_res = st.run_mcnemar_test(cell, metric=metric)

    result['wilcoxon_p'] = wilcox.get('p_value', np.nan)
    result['ttest_p'] = ttest.get('p_value', np.nan)
    result['sign_p'] = sign_res.get('p_value', np.nan)
    result['mcnemar_p'] = mcnemar_res.get('p_value', np.nan)
    result['mcnemar_net_gain'] = mcnemar_res.get('net_gain', np.nan)

    return result

## 4. Summaries by mode

We now build one row per mode × layer × parameter × locality, for:

- Add (α grid)
- Random (α grid)
- Lesion (γ grid)
- Rescue (β grid, at fixed γ for visualization)

For Rescue, we typically want to visualize β as the parameter axis at a fixed γ (e.g., γ = 1.0), matching the interpretation in the mode comparison grid.

In [5]:
rows = []

# Add only (no random), parameter = alpha
for locality in localities:
    for layer in layers:
        for alpha in alphas:
            rows.append(
                summarize_cell(
                    df=paired_df,
                    mode='add',
                    layer=int(layer),
                    locality=locality,
                    param_col='alpha',
                    param_value=float(alpha),
                    param_type='alpha',
                )
            )

# Lesion, parameter = gamma
for locality in localities:
    for layer in layers:
        for gamma in gammas:
            rows.append(
                summarize_cell(
                    df=lesion_df,
                    mode='lesion',
                    layer=int(layer),
                    locality=locality,
                    param_col='gamma',
                    param_value=float(gamma),
                    param_type='gamma',
                )
            )

# Rescue, parameter = beta (at fixed gamma for visualization)
RESCUE_GAMMA_FOR_GRID = 1.0
for locality in localities:
    for layer in layers:
        for beta in betas:
            rows.append(
                summarize_cell(
                    df=rescue_df,
                    mode='rescue',
                    layer=int(layer),
                    locality=locality,
                    param_col='beta',
                    param_value=float(beta),
                    param_type='beta',
                    extra_filter={'gamma': RESCUE_GAMMA_FOR_GRID},
                )
            )

full_summary = pd.DataFrame(rows)
print("Full mode summary (first 20 rows):")
display(full_summary.head(20))

output_path = OUTPUT_DIR / 'mode_comparison_stats.csv'
full_summary.to_csv(output_path, index=False)
print(f"Saved combined mode summary to: {output_path}")

Full mode summary (first 20 rows):


Unnamed: 0,mode,layer,param_value,param_type,locality,is_mock,n,delta_mean,delta_median,mcnemar_p,mcnemar_net_gain,wilcoxon_p,sign_p,ttest_p
0,add,25,0.5,alpha,answer,False,25,0.04,0.0,1.0,1.0,1.0,1.0,0.327287
1,add,25,1.0,alpha,answer,False,25,0.2,0.0,0.0625,5.0,0.0625,0.0625,0.021983
2,add,25,2.0,alpha,answer,False,25,0.16,0.0,0.125,4.0,0.125,0.125,0.042896
3,add,26,0.5,alpha,answer,False,25,0.0,0.0,1.0,,1.0,1.0,
4,add,26,1.0,alpha,answer,False,25,-0.04,0.0,1.0,-1.0,1.0,1.0,0.327287
5,add,26,2.0,alpha,answer,False,25,0.0,0.0,1.0,0.0,1.0,1.0,1.0
6,add,27,0.5,alpha,answer,False,25,0.0,0.0,1.0,0.0,1.0,1.0,1.0
7,add,27,1.0,alpha,answer,False,25,0.0,0.0,1.0,0.0,1.0,1.0,1.0
8,add,27,2.0,alpha,answer,False,25,-0.04,0.0,1.0,-1.0,1.0,1.0,0.663916
9,add,25,0.5,alpha,cot,False,25,0.12,0.0,0.375,3.0,0.375,0.375,0.185045


Saved combined mode summary to: ../outputs/mode_comparison/mode_comparison_stats.csv


## 5. Notes on usage

- `mode_comparison_stats.csv` now contains **real** summaries for:
  - Add (α), Random (α), Lesion (γ), Rescue (β at fixed γ)
  - All layers and both localities.
- Columns:
  - `mode`, `layer`, `param_value`, `param_type`, `locality`, `is_mock` (always False)
  - `n`, `delta_mean`, `delta_median`
  - `mcnemar_p`, `mcnemar_net_gain`
  - `wilcoxon_p`, `sign_p`, `ttest_p`

- In `04_mode_comparison_grid.ipynb`, you can load this CSV and:
  - Map `param_type`/`param_value` to the appropriate parameter axis per mode.
  - Annotate cells where `wilcoxon_p` or `mcnemar_p` are below your chosen significance threshold.
  - Treat Lesion and Rescue just like Add/Random, now that you have full data.