# Phase B Scaled Statistics Notebook (Layer 25 Only)

This notebook computes statistical tests for **Add**, **Lesion**, and **Rescue** interventions at Layer 25 using `vm_results/phase_b_scaled`.

Data structure:
- Layer: **25 only**
- Add: α ∈ {0.5, 1.0, 2.0}
- Lesion: γ ∈ {0.5, 1.0, 2.0}
- Rescue: γ=1.0 with β ∈ {0.5, 1.0, 2.0} (TRIPLET format: original, lesion, rescue)
- Localities: `cot`, `answer`

All p-values are computed from real per-example data.

The output CSV can be loaded by `08_phase_b_scaled_grid.ipynb` for visualization.

In [7]:
import sys
sys.path.insert(0, '../..')

from pathlib import Path
import re

import numpy as np
import pandas as pd
from scipy import stats as scipy_stats

from evaluation import load_data, statistical_tests as st

# Base directories
VM_RESULTS_DIR = Path("../../vm_results/phase_b_scaled/")
OUTPUT_DIR = Path("../outputs/phase_b_scaled")
OUTPUT_DIR.mkdir(parents=True, exist_ok=True)

print("Setup complete.")
print(f"Looking for experiments in: {VM_RESULTS_DIR.resolve()}")

Setup complete.
Looking for experiments in: /Users/washieu/Desktop/MIT/F25/64610_project/rvs/vm_results/phase_b_scaled


## 1. Load Phase B Scaled Results for COT and Answer Locality

We locate experiment directories matching the pattern `*__{locality}_locality_L25_scaled_*`.

In [8]:
def parse_layer_value(val):
    """Convert layer value to integer, handling strings like 'L25'."""
    if pd.isna(val):
        return np.nan
    if isinstance(val, (int, float)):
        return int(val)
    if isinstance(val, str):
        # Handle 'L25' format
        match = re.search(r'(\d+)', str(val))
        if match:
            return int(match.group(1))
    return np.nan


def clean_dataframe(df: pd.DataFrame) -> pd.DataFrame:
    """Clean up the DataFrame, fixing column types."""
    df = df.copy()
    
    # Fix layer column - convert 'L25' strings to integers
    if 'layer' in df.columns:
        df['layer'] = df['layer'].apply(parse_layer_value)
    
    # Fix numeric columns that might be strings
    for col in ['alpha', 'gamma', 'beta']:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce')
    
    return df


def find_exp_dir(locality: str, base_dir: Path = VM_RESULTS_DIR) -> Path:
    """Find the experiment directory for a given locality in phase_b_scaled.

    Expects names like:
        Qwen2.5-7B-Instruct__cot_locality_L25_scaled_YYYYMMDD_HHMMSS
        Qwen2.5-7B-Instruct__answer_locality_L25_scaled_YYYYMMDD_HHMMSS
    """
    pattern = f"*__{locality}_locality_L25_scaled_*"
    exp_dirs = sorted(base_dir.glob(pattern))
    if not exp_dirs:
        raise FileNotFoundError(f"No experiment directory matching pattern {pattern} in {base_dir}")
    exp_dir = exp_dirs[-1]  # Take the most recent
    print(f"Using {locality} experiment: {exp_dir}")
    return exp_dir


def load_phase_b_results(locality: str) -> pd.DataFrame:
    """Load all per-example results for a given locality from vm_results/phase_b_scaled."""
    exp_dir = find_exp_dir(locality)
    df = load_data.load_experiment_results(exp_dir)
    
    # Clean up data types
    df = clean_dataframe(df)
    
    # Ensure locality column is present and consistent
    df['locality'] = locality
    return df


cot_df = load_phase_b_results('cot')
answer_df = load_phase_b_results('answer')

print("\nCOT locality data summary:")
load_data.print_data_summary(cot_df)

print("\nAnswer locality data summary:")
load_data.print_data_summary(answer_df)

Using cot experiment: ../../vm_results/phase_b_scaled/Qwen2.5-7B-Instruct__cot_locality_L25_scaled_20251203_174033
Using answer experiment: ../../vm_results/phase_b_scaled/Qwen2.5-7B-Instruct__answer_locality_L25_scaled_20251203_174033

COT locality data summary:
DATA SUMMARY
Total rows: 900
Unique examples: 100
Models: ['Qwen2.5-7B-Instruct']
Datasets: ['mmlu_pro']
Modes: ['add', 'lesion', 'rescue']
Layers: [25]
Alpha values: [nan, 0.5, 1.0, 2.0]
Localities: ['cot']

Answer locality data summary:
DATA SUMMARY
Total rows: 900
Unique examples: 100
Models: ['Qwen2.5-7B-Instruct']
Datasets: ['mmlu_pro']
Modes: ['add', 'lesion', 'rescue']
Layers: [25]
Alpha values: [nan, 0.5, 1.0, 2.0]
Localities: ['answer']


## 2. Combine and Split by Experiment Type

We combine the COT and Answer data into a single DataFrame, then create views for:
- Paired (Add only - no Random in this scaled dataset)
- Lesion
- Rescue (triplet format)

In [9]:
full_df = pd.concat([cot_df, answer_df], ignore_index=True)
print(f"Combined rows (all modes, both localities): {len(full_df)}")

# Verify layer column is numeric
print(f"\nLayer column dtype: {full_df['layer'].dtype}")
print(f"Layer unique values: {sorted(full_df['layer'].dropna().unique())}")

# Paired add (no random in scaled data)
paired_df = load_data.get_paired_experiments(full_df)

# Lesion-only
lesion_df = load_data.get_lesion_experiments(full_df)

# Rescue-only (triplet format - has different columns)
rescue_df = load_data.get_rescue_experiments(full_df)

print(f"\nPaired (add) rows: {len(paired_df)}")
print(f"Lesion rows: {len(lesion_df)}")
print(f"Rescue rows: {len(rescue_df)}")

# Basic grids inferred from data
layers = sorted([int(x) for x in full_df['layer'].dropna().unique()]) if 'layer' in full_df.columns else []
alphas = sorted(full_df.get('alpha', pd.Series(dtype=float)).dropna().unique().tolist())
gammas = sorted(full_df.get('gamma', pd.Series(dtype=float)).dropna().unique().tolist())
betas = sorted(full_df.get('beta', pd.Series(dtype=float)).dropna().unique().tolist())
localities = sorted(full_df['locality'].unique()) if 'locality' in full_df.columns else []

print("\nParameters found in data:")
print("Layers:", layers)
print("Alpha values:", alphas)
print("Gamma values:", gammas)
print("Beta values:", betas)
print("Localities:", localities)

# Check if rescue data has triplet format
if not rescue_df.empty:
    print("\nRescue columns:", list(rescue_df.columns))
    has_triplet = 'lesion_answer_correct' in rescue_df.columns and 'rescue_answer_correct' in rescue_df.columns
    print(f"Rescue triplet format detected: {has_triplet}")

Combined rows (all modes, both localities): 1800

Layer column dtype: int64
Layer unique values: [np.int64(25)]

Paired (add) rows: 600
Lesion rows: 600
Rescue rows: 600

Parameters found in data:
Layers: [25]
Alpha values: [0.5, 1.0, 2.0]
Gamma values: [0.5, 1.0, 2.0]
Beta values: [0.5, 1.0, 2.0]
Localities: ['answer', 'cot']

Rescue columns: ['example_id', 'baseline_answer_correct', 'intv_answer_correct', 'experiment_type', 'mode', 'layer', 'gamma', 'locality', 'source_file', 'alpha', 'lesion_answer_correct', 'rescue_answer_correct', 'beta', 'dataset', 'model', 'experiment_name']
Rescue triplet format detected: True


## 3. Statistical Summary Functions

### 3a. Paired data (Add/Lesion)
For Add and Lesion, we compute:
- Per-example deltas: `delta = intv_answer_correct − baseline_answer_correct`
- Tests: Wilcoxon, t-test, sign test, McNemar

### 3b. Triplet data (Rescue)
For Rescue triplets, we compute:
- `delta_lesion = lesion - original` (should be negative)
- `delta_rescue = rescue - lesion` (should be positive)
- `recovery_rate = (rescue - lesion) / (original - lesion)`
- P-values for: original vs lesion, lesion vs rescue

In [10]:
def summarize_paired_cell(
    df: pd.DataFrame,
    mode: str,
    layer: int,
    locality: str,
    param_col: str,
    param_value: float,
    param_type: str,
    metric: str = 'answer',
) -> dict:
    """Summarize one (mode, layer, locality, param) cell for paired data (Add/Lesion)."""
    # Build mask for filtering
    mask = (
        (df['mode'] == mode) &
        (df['layer'] == layer) &
        (df['locality'] == locality)
    )
    
    if param_col in df.columns:
        mask &= df[param_col] == param_value
    else:
        return _empty_paired_result(mode, layer, param_value, param_type, locality)

    cell = df[mask]

    if cell.empty:
        return _empty_paired_result(mode, layer, param_value, param_type, locality)

    base_col = f'baseline_{metric}_correct'
    intv_col = f'intv_{metric}_correct'
    if base_col not in cell.columns or intv_col not in cell.columns:
        print(f"Warning: missing columns {base_col} / {intv_col}")
        return _empty_paired_result(mode, layer, param_value, param_type, locality)

    c = cell.copy()
    for col in [base_col, intv_col]:
        if c[col].dtype == object:
            c[col] = c[col].map({'True': 1, 'False': 0, True: 1, False: 0})
        c[col] = pd.to_numeric(c[col], errors='coerce').fillna(0).astype(int)

    deltas = (c[intv_col] - c[base_col]).dropna()
    n = len(deltas)
    if n == 0:
        return _empty_paired_result(mode, layer, param_value, param_type, locality)

    # Run statistical tests
    wilcox = st.run_wilcoxon_test(c, metric=metric)
    ttest = st.run_ttest_vs_zero(c, metric=metric)
    sign_res = st.run_sign_test(c, metric=metric)
    mcnemar_res = st.run_mcnemar_test(c, metric=metric)

    return {
        'mode': mode,
        'layer': int(layer),
        'param_value': float(param_value),
        'param_type': param_type,
        'locality': locality,
        'n': int(n),
        'delta_mean': float(deltas.mean()),
        'delta_median': float(deltas.median()),
        'mcnemar_p': mcnemar_res.get('p_value', np.nan),
        'mcnemar_net_gain': mcnemar_res.get('net_gain', np.nan),
        'wilcoxon_p': wilcox.get('p_value', np.nan),
        'sign_p': sign_res.get('p_value', np.nan),
        'ttest_p': ttest.get('p_value', np.nan),
    }


def _empty_paired_result(mode, layer, param_value, param_type, locality):
    return {
        'mode': mode,
        'layer': int(layer),
        'param_value': float(param_value),
        'param_type': param_type,
        'locality': locality,
        'n': 0,
        'delta_mean': np.nan,
        'delta_median': np.nan,
        'mcnemar_p': np.nan,
        'mcnemar_net_gain': np.nan,
        'wilcoxon_p': np.nan,
        'sign_p': np.nan,
        'ttest_p': np.nan,
    }


def summarize_rescue_triplet(
    df: pd.DataFrame,
    layer: int,
    locality: str,
    gamma: float,
    beta: float,
) -> dict:
    """Summarize rescue triplet data (original, lesion, rescue).
    
    Computes:
    - acc_original, acc_lesion, acc_rescue
    - delta_lesion = lesion - original (should be negative)
    - delta_rescue = rescue - lesion (should be positive)
    - recovery_rate = (rescue - lesion) / (original - lesion)
    - P-values: original vs lesion (McNemar), lesion vs rescue (McNemar)
    """
    mask = (
        (df['mode'] == 'rescue') &
        (df['layer'] == layer) &
        (df['locality'] == locality) &
        (df['gamma'] == gamma) &
        (df['beta'] == beta)
    )
    
    cell = df[mask]
    
    if cell.empty:
        return _empty_rescue_result(layer, locality, gamma, beta)
    
    # Check for triplet columns
    orig_col = 'baseline_answer_correct'
    lesion_col = 'lesion_answer_correct'
    rescue_col = 'rescue_answer_correct'
    
    if lesion_col not in cell.columns or rescue_col not in cell.columns:
        print(f"Warning: Triplet columns not found. Available: {list(cell.columns)}")
        return _empty_rescue_result(layer, locality, gamma, beta)
    
    c = cell.copy()
    for col in [orig_col, lesion_col, rescue_col]:
        if c[col].dtype == object:
            c[col] = c[col].map({'True': 1, 'False': 0, True: 1, False: 0})
        c[col] = pd.to_numeric(c[col], errors='coerce').fillna(0).astype(int)
    
    n = len(c)
    
    acc_original = c[orig_col].mean()
    acc_lesion = c[lesion_col].mean()
    acc_rescue = c[rescue_col].mean()
    
    delta_lesion = acc_lesion - acc_original
    delta_rescue = acc_rescue - acc_lesion
    
    # Recovery rate
    if acc_original > acc_lesion:
        recovery_rate = (acc_rescue - acc_lesion) / (acc_original - acc_lesion)
    else:
        recovery_rate = np.nan
    
    # McNemar test: original vs lesion
    # Count transitions
    orig_right_lesion_wrong = ((c[orig_col] == 1) & (c[lesion_col] == 0)).sum()
    orig_wrong_lesion_right = ((c[orig_col] == 0) & (c[lesion_col] == 1)).sum()
    
    if orig_right_lesion_wrong + orig_wrong_lesion_right > 0:
        # Use exact binomial test for McNemar
        try:
            mcnemar_lesion_p = scipy_stats.binom_test(
                min(orig_right_lesion_wrong, orig_wrong_lesion_right),
                orig_right_lesion_wrong + orig_wrong_lesion_right,
                0.5
            )
        except AttributeError:
            # scipy >= 1.7 deprecated binom_test
            res = scipy_stats.binomtest(
                min(orig_right_lesion_wrong, orig_wrong_lesion_right),
                orig_right_lesion_wrong + orig_wrong_lesion_right,
                0.5
            )
            mcnemar_lesion_p = res.pvalue
    else:
        mcnemar_lesion_p = np.nan
    
    # McNemar test: lesion vs rescue
    lesion_wrong_rescue_right = ((c[lesion_col] == 0) & (c[rescue_col] == 1)).sum()
    lesion_right_rescue_wrong = ((c[lesion_col] == 1) & (c[rescue_col] == 0)).sum()
    
    if lesion_wrong_rescue_right + lesion_right_rescue_wrong > 0:
        try:
            mcnemar_rescue_p = scipy_stats.binom_test(
                min(lesion_wrong_rescue_right, lesion_right_rescue_wrong),
                lesion_wrong_rescue_right + lesion_right_rescue_wrong,
                0.5
            )
        except AttributeError:
            res = scipy_stats.binomtest(
                min(lesion_wrong_rescue_right, lesion_right_rescue_wrong),
                lesion_wrong_rescue_right + lesion_right_rescue_wrong,
                0.5
            )
            mcnemar_rescue_p = res.pvalue
    else:
        mcnemar_rescue_p = np.nan
    
    # Full recovery count
    full_recovery = ((c[orig_col] == 1) & (c[lesion_col] == 0) & (c[rescue_col] == 1)).sum()
    
    return {
        'mode': 'rescue',
        'layer': int(layer),
        'param_value': float(beta),
        'param_type': 'beta',
        'locality': locality,
        'gamma': float(gamma),
        'n': int(n),
        'acc_original': float(acc_original),
        'acc_lesion': float(acc_lesion),
        'acc_rescue': float(acc_rescue),
        'delta_lesion': float(delta_lesion),
        'delta_rescue': float(delta_rescue),
        'recovery_rate': float(recovery_rate) if not np.isnan(recovery_rate) else np.nan,
        'mcnemar_lesion_p': float(mcnemar_lesion_p) if not np.isnan(mcnemar_lesion_p) else np.nan,
        'mcnemar_rescue_p': float(mcnemar_rescue_p) if not np.isnan(mcnemar_rescue_p) else np.nan,
        'full_recovery': int(full_recovery),
    }


def _empty_rescue_result(layer, locality, gamma, beta):
    return {
        'mode': 'rescue',
        'layer': int(layer),
        'param_value': float(beta),
        'param_type': 'beta',
        'locality': locality,
        'gamma': float(gamma),
        'n': 0,
        'acc_original': np.nan,
        'acc_lesion': np.nan,
        'acc_rescue': np.nan,
        'delta_lesion': np.nan,
        'delta_rescue': np.nan,
        'recovery_rate': np.nan,
        'mcnemar_lesion_p': np.nan,
        'mcnemar_rescue_p': np.nan,
        'full_recovery': 0,
    }


print("Summary functions defined.")

Summary functions defined.


## 4. Compute Summaries by Mode

We build summaries for:
- **Add** (α grid) - paired format
- **Lesion** (γ grid) - paired format
- **Rescue** (β grid at γ=1.0) - triplet format

In [11]:
paired_rows = []
rescue_rows = []

# Since we only have layer 25, use that
target_layers = [25] if 25 in layers else layers
print(f"Target layers: {target_layers}")

# === Add mode (paired) ===
for locality in localities:
    for layer in target_layers:
        for alpha in alphas:
            paired_rows.append(
                summarize_paired_cell(
                    df=paired_df,
                    mode='add',
                    layer=int(layer),
                    locality=locality,
                    param_col='alpha',
                    param_value=float(alpha),
                    param_type='alpha',
                )
            )

# === Lesion mode (paired) ===
for locality in localities:
    for layer in target_layers:
        for gamma in gammas:
            paired_rows.append(
                summarize_paired_cell(
                    df=lesion_df,
                    mode='lesion',
                    layer=int(layer),
                    locality=locality,
                    param_col='gamma',
                    param_value=float(gamma),
                    param_type='gamma',
                )
            )

# === Rescue mode (triplet) ===
# Fixed gamma=1.0, sweep beta
RESCUE_GAMMA = 1.0
print(f"\nUsing gamma={RESCUE_GAMMA} for rescue triplet analysis")

for locality in localities:
    for layer in target_layers:
        for beta in betas:
            rescue_rows.append(
                summarize_rescue_triplet(
                    df=rescue_df,
                    layer=int(layer),
                    locality=locality,
                    gamma=RESCUE_GAMMA,
                    beta=float(beta),
                )
            )

paired_summary = pd.DataFrame(paired_rows)
rescue_summary = pd.DataFrame(rescue_rows)

print(f"\nPaired summary (Add/Lesion): {len(paired_summary)} rows")
print(f"Rescue triplet summary: {len(rescue_summary)} rows")

Target layers: [25]

Using gamma=1.0 for rescue triplet analysis

Paired summary (Add/Lesion): 12 rows
Rescue triplet summary: 6 rows


## 5. Display and Save Results

In [12]:
print("=== Paired Summary (Add/Lesion) ===")
display(paired_summary)

print("\n=== Rescue Triplet Summary ===")
display(rescue_summary)

# Save to CSV
paired_path = OUTPUT_DIR / 'phase_b_scaled_stats.csv'
paired_summary.to_csv(paired_path, index=False)
print(f"\nSaved paired summary to: {paired_path}")

rescue_path = OUTPUT_DIR / 'rescue_triplet_stats.csv'
rescue_summary.to_csv(rescue_path, index=False)
print(f"Saved rescue triplet summary to: {rescue_path}")

=== Paired Summary (Add/Lesion) ===


Unnamed: 0,mode,layer,param_value,param_type,locality,n,delta_mean,delta_median,mcnemar_p,mcnemar_net_gain,wilcoxon_p,sign_p,ttest_p
0,add,25,0.5,alpha,answer,100,0.05,0.0,0.1796875,5,0.1796875,0.1796875,0.09575263
1,add,25,1.0,alpha,answer,100,0.13,0.0,0.0009765625,13,0.000789113,0.0009765625,0.0006003468
2,add,25,2.0,alpha,answer,100,0.27,0.0,1.117587e-07,27,5.337264e-07,1.117587e-07,9.264669e-08
3,add,25,0.5,alpha,cot,100,0.12,0.0,0.004180908,12,0.002699796,0.004180908,0.002303966
4,add,25,1.0,alpha,cot,100,0.1,0.0,0.006347656,10,0.006347656,0.006347656,0.003415508
5,add,25,2.0,alpha,cot,100,0.27,0.0,1.490116e-08,27,2.034555e-07,1.490116e-08,2.575595e-08
6,lesion,25,0.5,gamma,answer,100,-0.12,0.0,0.0004882812,-12,0.0004882812,0.0004882812,0.0003873122
7,lesion,25,1.0,gamma,answer,100,-0.19,0.0,3.814697e-06,-19,1.307185e-05,3.814697e-06,5.210966e-06
8,lesion,25,2.0,gamma,answer,100,-0.3,0.0,1.862645e-09,-30,4.320463e-08,1.862645e-09,3.071123e-09
9,lesion,25,0.5,gamma,cot,100,-0.12,0.0,0.004180908,-12,0.002699796,0.004180908,0.002303966



=== Rescue Triplet Summary ===


Unnamed: 0,mode,layer,param_value,param_type,locality,gamma,n,acc_original,acc_lesion,acc_rescue,delta_lesion,delta_rescue,recovery_rate,mcnemar_lesion_p,mcnemar_rescue_p,full_recovery
0,rescue,25,0.5,beta,answer,1.0,100,0.66,0.46,0.6,-0.2,0.14,0.7,1.096725e-05,0.0005187988,1
1,rescue,25,1.0,beta,answer,1.0,100,0.59,0.38,0.76,-0.21,0.38,1.809524,9.536743e-07,7.457857e-11,11
2,rescue,25,2.0,beta,answer,1.0,100,0.63,0.43,0.82,-0.2,0.39,1.95,1.907349e-06,2.153229e-10,13
3,rescue,25,0.5,beta,cot,1.0,100,0.74,0.45,0.65,-0.29,0.2,0.689655,1.308508e-07,1.096725e-05,12
4,rescue,25,1.0,beta,cot,1.0,100,0.7,0.46,0.72,-0.24,0.26,1.083333,8.046627e-07,8.679926e-07,12
5,rescue,25,2.0,beta,cot,1.0,100,0.64,0.4,0.92,-0.24,0.52,2.166667,1.192093e-07,4.440892e-16,21



Saved paired summary to: ../outputs/phase_b_scaled/phase_b_scaled_stats.csv
Saved rescue triplet summary to: ../outputs/phase_b_scaled/rescue_triplet_stats.csv


## 6. Quick Data Inspection

Preview delta values by mode and locality.

In [13]:
print("\n=== Add/Lesion: Delta by Mode and Locality ===")

for loc in localities:
    print(f"\n{loc.upper()} Locality:")
    loc_data = paired_summary[paired_summary['locality'] == loc].copy()
    if not loc_data.empty:
        pivot = loc_data.pivot_table(
            index='param_value', 
            columns='mode', 
            values='delta_mean'
        )
        display(pivot)

print("\n=== Add/Lesion: P-values (McNemar's test) ===")

for loc in localities:
    print(f"\n{loc.upper()} Locality:")
    loc_data = paired_summary[paired_summary['locality'] == loc].copy()
    if not loc_data.empty:
        pivot_p = loc_data.pivot_table(
            index='param_value', 
            columns='mode', 
            values='mcnemar_p'
        )
        display(pivot_p)


=== Add/Lesion: Delta by Mode and Locality ===

ANSWER Locality:


mode,add,lesion
param_value,Unnamed: 1_level_1,Unnamed: 2_level_1
0.5,0.05,-0.12
1.0,0.13,-0.19
2.0,0.27,-0.3



COT Locality:


mode,add,lesion
param_value,Unnamed: 1_level_1,Unnamed: 2_level_1
0.5,0.12,-0.12
1.0,0.1,-0.22
2.0,0.27,-0.31



=== Add/Lesion: P-values (McNemar's test) ===

ANSWER Locality:


mode,add,lesion
param_value,Unnamed: 1_level_1,Unnamed: 2_level_1
0.5,0.1796875,0.0004882812
1.0,0.0009765625,3.814697e-06
2.0,1.117587e-07,1.862645e-09



COT Locality:


mode,add,lesion
param_value,Unnamed: 1_level_1,Unnamed: 2_level_1
0.5,0.004180908,0.004180908
1.0,0.006347656,2.980232e-06
2.0,1.490116e-08,9.313226e-10


In [14]:
print("\n=== Rescue Triplet: Accuracies by Beta and Locality ===")

for loc in localities:
    print(f"\n{loc.upper()} Locality:")
    loc_data = rescue_summary[rescue_summary['locality'] == loc].copy()
    if not loc_data.empty:
        display(loc_data[['param_value', 'acc_original', 'acc_lesion', 'acc_rescue', 'recovery_rate', 'mcnemar_rescue_p']])

print("\n=== Rescue Interpretation ===")
print("- acc_original: Performance without intervention (~65%)")
print("- acc_lesion: Performance after lesion (degraded)")
print("- acc_rescue: Performance after lesion + rescue (recovered)")
print("- recovery_rate: (rescue - lesion) / (original - lesion)")
print("- mcnemar_rescue_p: P-value testing if rescue significantly improves over lesion")


=== Rescue Triplet: Accuracies by Beta and Locality ===

ANSWER Locality:


Unnamed: 0,param_value,acc_original,acc_lesion,acc_rescue,recovery_rate,mcnemar_rescue_p
0,0.5,0.66,0.46,0.6,0.7,0.0005187988
1,1.0,0.59,0.38,0.76,1.809524,7.457857e-11
2,2.0,0.63,0.43,0.82,1.95,2.153229e-10



COT Locality:


Unnamed: 0,param_value,acc_original,acc_lesion,acc_rescue,recovery_rate,mcnemar_rescue_p
3,0.5,0.74,0.45,0.65,0.689655,1.096725e-05
4,1.0,0.7,0.46,0.72,1.083333,8.679926e-07
5,2.0,0.64,0.4,0.92,2.166667,4.440892e-16



=== Rescue Interpretation ===
- acc_original: Performance without intervention (~65%)
- acc_lesion: Performance after lesion (degraded)
- acc_rescue: Performance after lesion + rescue (recovered)
- recovery_rate: (rescue - lesion) / (original - lesion)
- mcnemar_rescue_p: P-value testing if rescue significantly improves over lesion


## 7. Notes on Usage

### Output Files

1. **`phase_b_scaled_stats.csv`** - Paired data (Add/Lesion)
   - Columns: `mode`, `layer`, `param_value`, `param_type`, `locality`
   - Stats: `n`, `delta_mean`, `delta_median`
   - P-values: `mcnemar_p`, `wilcoxon_p`, `sign_p`, `ttest_p`

2. **`rescue_triplet_stats.csv`** - Triplet data (Rescue)
   - Columns: `mode`, `layer`, `param_value`, `param_type`, `locality`, `gamma`
   - Accuracies: `acc_original`, `acc_lesion`, `acc_rescue`
   - Deltas: `delta_lesion`, `delta_rescue`, `recovery_rate`
   - P-values: `mcnemar_lesion_p`, `mcnemar_rescue_p`

### Loading in Notebook 08

```python
paired_stats = pd.read_csv('phase_b_scaled_stats.csv')
rescue_stats = pd.read_csv('rescue_triplet_stats.csv')
```