# 🚀 Comprehensive Evaluation Framework for Wisent Guard

This notebook provides an interactive interface for running comprehensive evaluations that properly separate:

1. **🎯 Benchmark Performance**: How well the model solves mathematical problems
2. **🔍 Probe Performance**: How well probes detect correctness from model activations
3. **⚙️ DAC Hyperparameter Optimization**: Grid search to find optimal DAC configurations

## Key Features:
- **Real Data Integration**: Uses GSM8KExtractor to get contrastive pairs from training data
- **DAC Hyperparameter Grid Search**: Systematic optimization of entropy_threshold, ptop, and max_alpha
- **Real-time Progress**: Live updates during evaluation with tqdm
- **Rich Visualizations**: Comprehensive plots and analysis
- **Modular Design**: Clean separation of concerns
- **Export Results**: Save results and generate reports

## DAC Hyperparameters:
- **entropy_threshold**: Controls dynamic steering based on entropy (default: 1.0)
- **ptop**: Probability threshold for KL-based dynamic control (default: 0.4)
- **max_alpha**: Maximum steering intensity (default: 2.0)

## 📋 Setup and Imports

In [1]:
# Core imports
import sys
import os
import warnings
warnings.filterwarnings('ignore')

# Set HuggingFace cache to permanent directory
os.environ['HF_HOME'] = '/workspace/.cache/huggingface'
os.environ['TRANSFORMERS_CACHE'] = '/workspace/.cache/huggingface/transformers'
os.environ['HF_DATASETS_CACHE'] = '/workspace/.cache/huggingface/datasets'

# Create cache directories if they don't exist
os.makedirs('/workspace/.cache/huggingface/transformers', exist_ok=True)
os.makedirs('/workspace/.cache/huggingface/datasets', exist_ok=True)

# Add project root to path
project_root = '/workspace/wisent-guard'
if project_root not in sys.path:
    sys.path.append(project_root)

# Fix wandb connection issues
def fix_wandb_connection():
    """Fix wandb connection issues by properly initializing or disabling it."""
    try:
        import wandb
        
        # Check if wandb is already initialized
        if wandb.run is not None:
            print("⚠️ Cleaning up existing wandb run...")
            wandb.finish()
        
        # Clear any broken connections
        import subprocess
        import signal
        try:
            # Kill any hanging wandb processes
            subprocess.run(['pkill', '-f', 'wandb'], capture_output=True)
        except:
            pass
        
        print("✅ Wandb connection cleaned up")
        return True
    except Exception as e:
        print(f"⚠️ Wandb cleanup warning: {e}")
        return False

# Check HuggingFace authentication
def check_hf_auth():
    """Check if user is logged into HuggingFace and show login instructions if needed."""
    try:
        import subprocess
        result = subprocess.run(['huggingface-cli', 'whoami'], capture_output=True, text=True)
        if result.returncode == 0:
            username = result.stdout.strip()
            print(f"✅ Logged into HuggingFace as: {username}")
            return True
        else:
            print("⚠️ Not logged into HuggingFace!")
            print("🔐 Please run: huggingface-cli login")
            print("   This is required to access datasets like AIME 2024/2025")
            return False
    except Exception as e:
        print(f"⚠️ Could not check HuggingFace authentication: {e}")
        print("🔐 If you encounter dataset loading issues, try: huggingface-cli login")
        return False

# Clean up wandb first
wandb_ok = fix_wandb_connection()

# Import comprehensive evaluation framework
from wisent_guard.core.evaluation.comprehensive import (
    ComprehensiveEvaluationConfig,
    ComprehensiveEvaluationPipeline,
    plot_evaluation_results,
    create_results_dashboard,
    generate_summary_report,
    calculate_comprehensive_metrics,
    generate_performance_summary
)

# Visualization and interactivity
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px

from IPython.display import display, HTML, Markdown

# Data manipulation
import pandas as pd
import numpy as np
import json
from datetime import datetime
from pathlib import Path

# Utilities
from tqdm.notebook import tqdm
import logging

print("✅ All imports successful!")
print(f"📍 Working directory: {os.getcwd()}")
print(f"🐍 Python version: {sys.version}")
print(f"💾 HuggingFace cache: {os.environ['HF_HOME']}")
print(f"🔗 Wandb status: {'✅ Ready' if wandb_ok else '⚠️ May have issues'}")
print()

# Check authentication
hf_authenticated = check_hf_auth()

✅ Wandb connection cleaned up
✅ All imports successful!
📍 Working directory: /workspace/wisent-guard/comprehensive_evaluation
🐍 Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0]
💾 HuggingFace cache: /workspace/.cache/huggingface
🔗 Wandb status: ✅ Ready

✅ Logged into HuggingFace as: jfpio


## ⚙️ Configuration

Edit the constants in the next cell to customize your evaluation. All parameters are clearly documented with examples.

In [2]:
# Configuration Constants - Edit these values to customize your evaluation

# Import math tasks from our task configuration
import sys
sys.path.append('/workspace/wisent-guard')
from wisent_guard.parameters.task_config import MATH_TASKS

# Import DAC steering method
from wisent_guard.core.steering_methods.dac import DAC

# Convert MATH_TASKS set to sorted list for easier selection
MATH_TASKS_LIST = sorted(list(MATH_TASKS))
print(f"📚 Available math tasks: {MATH_TASKS_LIST}")

# ============================================================================
# MAIN CONFIGURATION - Edit these constants to customize your evaluation
# ============================================================================

# Model configuration
MODEL_NAME = 'distilbert/distilgpt2'  # Using smaller model for quick testing - Examples: 'distilbert/distilgpt2', 'gpt2', '/workspace/models/llama31-8b-instruct-hf', 'Qwen/Qwen3-8B'
MODEL_NAME = "/workspace/models/llama31-8b-instruct-hf"

# Dataset configuration - Choose from MATH_TASKS
TRAIN_DATASET = 'gsm8k'     # Training dataset - Change to any task from MATH_TASKS_LIST
VAL_DATASET = 'gsm8k'       # Validation dataset - Change to any task from MATH_TASKS_LIST  
TEST_DATASET = 'gsm8k'      # Test dataset - Change to any task from MATH_TASKS_LIST

# Validate dataset choices
for dataset, name in [(TRAIN_DATASET, 'TRAIN'), (VAL_DATASET, 'VAL'), (TEST_DATASET, 'TEST')]:
    if dataset not in MATH_TASKS:
        raise ValueError(f"{name}_DATASET '{dataset}' not in MATH_TASKS. Choose from: {MATH_TASKS_LIST}")

# Sample limits (small for quick testing)
TRAIN_LIMIT = 5   # Number of training samples
VAL_LIMIT = 5    # Number of validation samples  
TEST_LIMIT = 30     # Number of test samples

# Layer configuration - specify which layers to search during optimization
PROBE_LAYERS = [3]     # Examples: [2, 3, 4, 5], [8, 16, 24, 32], [5, 6, 7, 8] 
STEERING_LAYERS = [3]  # Same as probe layers for now - Examples: [3, 4, 5], [16, 24, 32], [6, 8, 10]

PROBE_LAYERS = [15]     # Examples: [2, 3, 4, 5], [8, 16, 24, 32], [5, 6, 7, 8] 
STEERING_LAYERS = [15]  # Same as probe layers for now - Examples: [3, 4, 5], [16, 24, 32], [6, 8, 10]


# DAC Hyperparameters - specify arrays of values to search
ENTROPY_THRESHOLDS = [1.0]    # Examples: [0.5, 1.0, 1.5], [1.0, 2.0], [0.8, 1.2]
PTOP_VALUES = [0.5]            # Examples: [0.3, 0.4, 0.5], [0.4], [0.2, 0.6]  
MAX_ALPHA_VALUES = [2.0]       # Examples: [1.5, 2.0, 2.5], [2.0], [1.0, 3.0]

# Options
ENABLE_WANDB = False                        # Disable for quick testing
EXPERIMENT_NAME = 'dac_hyperparameter_search'  # Experiment name for logging

# ============================================================================
# DATASET SIZE MAPPING (for validation - don't edit unless adding new datasets)
# ============================================================================



# ============================================================================
# AUTO-VALIDATION AND INFO
# ============================================================================

def detect_model_layers(model_name):
    """Detect number of layers in a model without loading it fully."""
    try:
        from transformers import AutoConfig
        config = AutoConfig.from_pretrained(model_name)
        
        # Different models store layer count differently
        if hasattr(config, 'n_layer'):
            return config.n_layer
        elif hasattr(config, 'num_hidden_layers'):
            return config.num_hidden_layers
        elif hasattr(config, 'num_layers'):
            return config.num_layers
        else:
            return "Unknown"
    except Exception as e:
        return f"Error: {str(e)}"

# Validate configuration
print("📋 CONFIGURATION SUMMARY")
print("="*50)
print(f"🤖 Model: {MODEL_NAME}")
print(f"📊 Datasets: {TRAIN_DATASET} → {VAL_DATASET} → {TEST_DATASET}")
print(f"🔢 Samples: {TRAIN_LIMIT} + {VAL_LIMIT} + {TEST_LIMIT} = {TRAIN_LIMIT + VAL_LIMIT + TEST_LIMIT} total")
print(f"🎯 Probe layers: {PROBE_LAYERS}")
print(f"⚙️ Steering layers: {STEERING_LAYERS}")
print(f"🎛️ Steering method: DAC (Dynamic Activation Composition)")
print(f"📊 DAC Hyperparameters:")
print(f"   • Entropy thresholds: {ENTROPY_THRESHOLDS}")  
print(f"   • Ptop values: {PTOP_VALUES}")
print(f"   • Max alpha values: {MAX_ALPHA_VALUES}")
print(f"📚 Using tasks from MATH_TASKS: ✓")

# Calculate total combinations
total_combinations = (len(STEERING_LAYERS) * 
                     len(ENTROPY_THRESHOLDS) *
                     len(PTOP_VALUES) *
                     len(MAX_ALPHA_VALUES) *
                     len(PROBE_LAYERS) * 
                     3)  # Assuming 3 probe C values

print(f"🧪 Total hyperparameter combinations: {total_combinations}")
print(f"📈 Wandb enabled: {ENABLE_WANDB}")

# Model info
try:
    num_layers = detect_model_layers(MODEL_NAME)
    print(f"🏗️ Model layers: {num_layers}")
    if isinstance(num_layers, int):
        max_probe_layer = max(PROBE_LAYERS) if PROBE_LAYERS else 0
        max_steering_layer = max(STEERING_LAYERS) if STEERING_LAYERS else 0
        if max_probe_layer >= num_layers or max_steering_layer >= num_layers:
            print("⚠️ WARNING: Some configured layers exceed model size!")
except Exception as e:
    print(f"⚠️ Could not detect model layers: {e}")

print("\n✅ Configuration validated successfully!")
print("💡 This evaluation now uses REAL mathematical training data instead of synthetic generation!")
print("🎯 DAC will be trained on actual math questions from your training dataset.")
print("💡 Configuration updated for quick testing with distilgpt2 and GSM8K dataset.")
print(f"🎯 Currently configured for: {TRAIN_DATASET}")
print(f"📚 Available math tasks: {len(MATH_TASKS_LIST)} tasks including GSM8K, MATH-500, AIME, etc.")
print("💡 To change datasets, edit TRAIN_DATASET, VAL_DATASET, TEST_DATASET above.")

📚 Available math tasks: ['aime', 'aime2024', 'aime2025', 'gsm8k', 'hendrycks_math', 'hmmt', 'hmmt_feb_2025', 'livemathbench', 'livemathbench_cnmo_en', 'livemathbench_cnmo_zh', 'math', 'math500', 'polymath', 'polymath_en_high', 'polymath_en_medium', 'polymath_zh_high', 'polymath_zh_medium']
📋 CONFIGURATION SUMMARY
🤖 Model: /workspace/models/llama31-8b-instruct-hf
📊 Datasets: gsm8k → gsm8k → gsm8k
🔢 Samples: 5 + 5 + 30 = 40 total
🎯 Probe layers: [15]
⚙️ Steering layers: [15]
🎛️ Steering method: DAC (Dynamic Activation Composition)
📊 DAC Hyperparameters:
   • Entropy thresholds: [1.0]
   • Ptop values: [0.5]
   • Max alpha values: [2.0]
📚 Using tasks from MATH_TASKS: ✓
🧪 Total hyperparameter combinations: 3
📈 Wandb enabled: False
🏗️ Model layers: 32

✅ Configuration validated successfully!
💡 This evaluation now uses REAL mathematical training data instead of synthetic generation!
🎯 DAC will be trained on actual math questions from your training dataset.
💡 Configuration updated for quick t

## 🛠️ Create Configuration

Configuration is automatically created from the constants defined above.

In [None]:
# Create configuration from constants

config = ComprehensiveEvaluationConfig(
    model_name=MODEL_NAME,
    train_dataset=TRAIN_DATASET,
    val_dataset=VAL_DATASET,
    test_dataset=TEST_DATASET,
    train_limit=TRAIN_LIMIT,
    val_limit=VAL_LIMIT,
    test_limit=TEST_LIMIT,
    probe_layers=PROBE_LAYERS,
    steering_layers=STEERING_LAYERS,
    steering_methods=["dac"],  # Fixed to DAC only
    # DAC hyperparameters
    dac_entropy_thresholds=ENTROPY_THRESHOLDS,
    dac_ptop_values=PTOP_VALUES,
    dac_max_alpha_values=MAX_ALPHA_VALUES,
    enable_wandb=ENABLE_WANDB,
    experiment_name=EXPERIMENT_NAME,
    batch_size=16,
    max_length=512,
    max_new_tokens=256  # Increased from default 50 for GSM8K chain-of-thought
)

print("✅ Configuration object created successfully!")
print("🚀 Ready to run comprehensive evaluation!")
print("\n💡 All configuration is now controlled by the constants in the previous cell.")

## 🚀 Run Comprehensive Evaluation

This is the main evaluation cell. It will:

1. **🎯 Train Probes**: Train correctness classifiers on all specified layers
2. **⚙️ Optimize Hyperparameters**: Grid search for best steering + probe combinations
3. **🏆 Final Evaluation**: Test optimized configuration on held-out test set

**Note**: This may take several minutes depending on your configuration.

In [None]:
print("\n" + "="*80)
print("🚀 STARTING COMPREHENSIVE EVALUATION WITH REAL MATHEMATICAL DATA")
print("="*80)
print(f"✅ DAC will be trained on actual {TRAIN_DATASET.upper()} mathematical questions from training data")
print(f"🎯 Using task extractors for proper format handling across datasets")
print("="*80)

# Initialize pipeline
pipeline = ComprehensiveEvaluationPipeline(config)

# Run evaluation with progress tracking
try:
    results = pipeline.run_comprehensive_evaluation()
    print("\n" + "="*80)
    print("✅ Evaluation completed successfully!")
    
    # Store results for analysis
    evaluation_results = results
    
except Exception as e:
    print(f"\n❌ Evaluation failed: {str(e)}")
    print("Check the logs above for more details.")
    raise


🚀 STARTING COMPREHENSIVE EVALUATION WITH REAL MATHEMATICAL DATA
✅ DAC will be trained on actual GSM8K mathematical questions from training data
🎯 Using task extractors for proper format handling across datasets


Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

## 📊 Results Analysis

Now let's analyze the results with comprehensive metrics and visualizations.

In [None]:
# Calculate comprehensive metrics
comprehensive_metrics = calculate_comprehensive_metrics(evaluation_results)

# Generate performance summary
performance_summary = generate_performance_summary(comprehensive_metrics)
print(performance_summary)

## 📈 Interactive Visualizations

Explore your results with interactive plots.

In [None]:
# Create interactive dashboard
dashboard = create_results_dashboard(evaluation_results)
dashboard.show()

print("\n📊 Interactive dashboard displayed above!")
print("💡 Hover over points and bars for detailed information.")

## 🎯 Detailed Analysis: Benchmark Performance

In [None]:
# Extract benchmark results
if "test_results" in evaluation_results:
    test_results = evaluation_results["test_results"]
    
    base_benchmark = test_results.get("base_model_benchmark_results", {})
    steered_benchmark = test_results.get("steered_model_benchmark_results", {})
    
    print("🎯 BENCHMARK PERFORMANCE ANALYSIS")
    print("="*40)
    
    print(f"\n📊 Base Model:")
    print(f"  ✓ Accuracy: {base_benchmark.get('accuracy', 0):.3f} ({base_benchmark.get('accuracy', 0)*100:.1f}%)")
    print(f"  ✓ Correct: {base_benchmark.get('correct', 0)}/{base_benchmark.get('total_samples', 0)}")
    
    print(f"\n🎯 Steered Model:")
    print(f"  ✓ Accuracy: {steered_benchmark.get('accuracy', 0):.3f} ({steered_benchmark.get('accuracy', 0)*100:.1f}%)")
    print(f"  ✓ Correct: {steered_benchmark.get('correct', 0)}/{steered_benchmark.get('total_samples', 0)}")
    
    improvement = steered_benchmark.get('accuracy', 0) - base_benchmark.get('accuracy', 0)
    improvement_percent = (improvement / max(base_benchmark.get('accuracy', 0.001), 0.001)) * 100
    
    print(f"\n📈 Improvement:")
    print(f"  {'✅' if improvement > 0 else '❌'} {improvement:+.3f} absolute ({improvement_percent:+.1f}% relative)")
    
    if improvement > 0.05:
        print("  🎉 Significant improvement! Steering is working well.")
    elif improvement > 0.01:
        print("  👍 Moderate improvement. Consider tuning hyperparameters.")
    elif improvement > -0.01:
        print("  ⚪ Minimal change. Steering may not be effective for this configuration.")
    else:
        print("  ⚠️ Performance decreased. Check steering implementation.")
else:
    print("❌ No test results found in evaluation data.")

## 🔍 Detailed Analysis: Probe Performance

In [None]:
# Extract probe results
if "test_results" in evaluation_results:
    base_probe = test_results.get("base_model_probe_results", {})
    steered_probe = test_results.get("steered_model_probe_results", {})
    
    print("🔍 PROBE PERFORMANCE ANALYSIS")
    print("="*40)
    
    print(f"\n📊 Base Model Probe:")
    print(f"  ✓ AUC: {base_probe.get('auc', 0.5):.3f}")
    print(f"  ✓ Accuracy: {base_probe.get('accuracy', 0.5):.3f}")
    print(f"  ✓ Precision: {base_probe.get('precision', 0.5):.3f}")
    print(f"  ✓ Recall: {base_probe.get('recall', 0.5):.3f}")
    print(f"  ✓ F1-Score: {base_probe.get('f1', 0.5):.3f}")
    
    print(f"\n🎯 Steered Model Probe:")
    print(f"  ✓ AUC: {steered_probe.get('auc', 0.5):.3f}")
    print(f"  ✓ Accuracy: {steered_probe.get('accuracy', 0.5):.3f}")
    print(f"  ✓ Precision: {steered_probe.get('precision', 0.5):.3f}")
    print(f"  ✓ Recall: {steered_probe.get('recall', 0.5):.3f}")
    print(f"  ✓ F1-Score: {steered_probe.get('f1', 0.5):.3f}")
    
    auc_improvement = steered_probe.get('auc', 0.5) - base_probe.get('auc', 0.5)
    
    print(f"\n📈 AUC Improvement:")
    print(f"  {'✅' if auc_improvement > 0 else '❌'} {auc_improvement:+.3f}")
    
    # Interpret probe performance
    best_auc = max(base_probe.get('auc', 0.5), steered_probe.get('auc', 0.5))
    
    if best_auc > 0.9:
        print("  🎉 Excellent probe performance! Activations strongly predict correctness.")
    elif best_auc > 0.8:
        print("  👍 Good probe performance. Activations are informative.")
    elif best_auc > 0.7:
        print("  ⚪ Moderate probe performance. Some signal present.")
    elif best_auc > 0.6:
        print("  ⚠️ Weak probe performance. Limited interpretability.")
    else:
        print("  ❌ Poor probe performance. Activations may not encode correctness.")
else:
    print("❌ No test results found in evaluation data.")

## ⚙️ Hyperparameter Optimization Analysis

In [None]:
# Analyze hyperparameter optimization results
if "steering_optimization_results" in evaluation_results:
    opt_results = evaluation_results["steering_optimization_results"]
    all_configs = opt_results.get("all_configs", [])
    best_config = opt_results.get("best_config", {})
    
    print("⚙️ HYPERPARAMETER OPTIMIZATION ANALYSIS")
    print("="*45)
    
    print(f"\n📊 Search Statistics:")
    print(f"  ✓ Configurations tested: {len(all_configs)}")
    print(f"  ✓ Best combined score: {opt_results.get('best_combined_score', 0):.3f}")
    
    if best_config:
        steering_config = best_config.get("steering_config", {})
        probe_config = best_config.get("best_probe_config", {})
        
        print(f"\n🏆 Best Configuration:")
        print(f"  ✓ Steering method: {steering_config.get('method', 'N/A')}")
        print(f"  ✓ Steering layer: {steering_config.get('layer', 'N/A')}")
        print(f"  ✓ Steering strength: {steering_config.get('strength', 'N/A')}")
        print(f"  ✓ Probe layer: {probe_config.get('layer', 'N/A')}")
        print(f"  ✓ Probe C value: {probe_config.get('C', 'N/A')}")
        
        benchmark_metrics = best_config.get("benchmark_metrics", {})
        probe_metrics = best_config.get("probe_metrics", {})
        
        print(f"\n📈 Best Configuration Performance:")
        print(f"  ✓ Benchmark accuracy: {benchmark_metrics.get('accuracy', 0):.3f}")
        print(f"  ✓ Probe AUC: {probe_metrics.get('auc', 0.5):.3f}")
        print(f"  ✓ Combined score: {best_config.get('combined_score', 0):.3f}")
    
    # Analyze score distribution
    if all_configs:
        scores = [config.get("combined_score", 0) for config in all_configs]
        benchmark_scores = [config.get("benchmark_metrics", {}).get("accuracy", 0) for config in all_configs]
        probe_scores = [config.get("probe_metrics", {}).get("auc", 0.5) for config in all_configs]
        
        print(f"\n📊 Score Distribution:")
        print(f"  ✓ Combined score: {np.mean(scores):.3f} ± {np.std(scores):.3f}")
        print(f"  ✓ Benchmark score: {np.mean(benchmark_scores):.3f} ± {np.std(benchmark_scores):.3f}")
        print(f"  ✓ Probe score: {np.mean(probe_scores):.3f} ± {np.std(probe_scores):.3f}")
        
        # Check if optimization was effective
        score_range = max(scores) - min(scores)
        if score_range > 0.1:
            print("  🎯 Good optimization! Significant variation in scores.")
        elif score_range > 0.05:
            print("  👍 Moderate optimization. Some configurations better than others.")
        else:
            print("  ⚪ Limited optimization benefit. Most configurations perform similarly.")
else:
    print("❌ No optimization results found in evaluation data.")

## 📊 Training Performance Analysis

In [None]:
# Analyze probe training performance by layer
if "probe_training_results" in evaluation_results:
    training_results = evaluation_results["probe_training_results"]
    
    print("📊 PROBE TRAINING ANALYSIS")
    print("="*35)
    
    # Create summary table
    training_data = []
    
    for layer_key, layer_results in training_results.items():
        layer_num = int(layer_key.split('_')[1])
        
        best_auc = 0
        best_config = None
        
        for c_key, metrics in layer_results.items():
            if isinstance(metrics, dict) and "auc" in metrics:
                if metrics["auc"] > best_auc:
                    best_auc = metrics["auc"]
                    best_config = c_key
        
        training_data.append({
            'Layer': layer_num,
            'Best AUC': best_auc,
            'Best C': best_config.replace('C_', '') if best_config else 'N/A'
        })
    
    # Display as formatted table
    df_training = pd.DataFrame(training_data).sort_values('Layer')
    
    print(f"\n{'Layer':<8} {'Best AUC':<10} {'Best C':<10}")
    print("-" * 30)
    
    for _, row in df_training.iterrows():
        print(f"{row['Layer']:<8} {row['Best AUC']:<10.3f} {row['Best C']:<10}")
    
    # Find best performing layer
    best_layer_row = df_training.loc[df_training['Best AUC'].idxmax()]
    worst_layer_row = df_training.loc[df_training['Best AUC'].idxmin()]
    
    print(f"\n🏆 Best performing layer: {best_layer_row['Layer']} (AUC: {best_layer_row['Best AUC']:.3f})")
    print(f"⚠️ Worst performing layer: {worst_layer_row['Layer']} (AUC: {worst_layer_row['Best AUC']:.3f})")
    
    # Layer performance insights
    auc_std = df_training['Best AUC'].std()
    if auc_std > 0.1:
        print("\n💡 High variation across layers - layer choice matters!")
    elif auc_std > 0.05:
        print("\n💡 Moderate variation - some layers work better than others.")
    else:
        print("\n💡 Consistent performance across layers.")
        
else:
    print("❌ No training results found in evaluation data.")

## 📈 Static Visualizations

Create comprehensive static plots for reports and publications.

In [None]:
# Create comprehensive static visualization
fig = plot_evaluation_results(evaluation_results)
plt.show()

print("📊 Comprehensive evaluation plots displayed above.")
print("💾 Plots are saved automatically in the results directory.")

## 💾 Export Results

Save your results in various formats for further analysis.

In [None]:
# Results Export and Storage Options
print("📊 RESULTS STORAGE SUMMARY")
print("="*50)

# Check if wandb is enabled and results were logged
if config.enable_wandb:
    print("✅ Weights & Biases logging is ENABLED")
    print("📈 All evaluation results have been automatically logged to wandb including:")
    print("   • Configuration parameters")
    print("   • Probe training metrics")
    print("   • Hyperparameter optimization results") 
    print("   • Final test performance")
    print("   • Comprehensive metrics and visualizations")
    print()
    print("🔗 Access your results on the wandb dashboard:")
    print("   https://wandb.ai/")
else:
    print("⚠️ Weights & Biases logging is DISABLED")
    print("💾 Creating local backup files...")
    
    # Create results directory in outputs/ (excluded from git)
    results_dir = Path("outputs/notebook_results")
    results_dir.mkdir(parents=True, exist_ok=True)
    
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    
    # 1. Save raw results as JSON
    json_file = results_dir / f"evaluation_results_{timestamp}.json"
    with open(json_file, 'w') as f:
        # Remove non-serializable objects
        import copy
        results_copy = copy.deepcopy(evaluation_results)
        
        def remove_non_serializable(obj):
            if isinstance(obj, dict):
                return {k: remove_non_serializable(v) for k, v in obj.items() if k != 'probe'}
            elif isinstance(obj, list):
                return [remove_non_serializable(item) for item in obj]
            else:
                return obj
        
        clean_results = remove_non_serializable(results_copy)
        json.dump(clean_results, f, indent=2, default=str)
    
    print(f"✅ Raw results saved to: {json_file}")
    
    # 2. Save comprehensive metrics as CSV
    csv_file = results_dir / f"comprehensive_metrics_{timestamp}.csv"
    metrics_df = pd.DataFrame([comprehensive_metrics])
    metrics_df.to_csv(csv_file, index=False)
    
    print(f"✅ Comprehensive metrics saved to: {csv_file}")
    
    # 3. Generate HTML report
    html_report = generate_summary_report(evaluation_results, config.to_dict())
    html_file = results_dir / f"evaluation_report_{timestamp}.html"
    with open(html_file, 'w') as f:
        f.write(html_report)
    
    print(f"✅ HTML report saved to: {html_file}")
    
    # 4. Save configuration
    config_file = results_dir / f"configuration_{timestamp}.json"
    with open(config_file, 'w') as f:
        json.dump(config.to_dict(), f, indent=2)
    
    print(f"✅ Configuration saved to: {config_file}")
    
    print(f"\n📁 All results saved in: {results_dir.absolute()}")

print("\n💡 Recommendation:")
if config.enable_wandb:
    print("   Use wandb dashboard for comprehensive result analysis and comparison.")
    print("   Results are automatically versioned and shareable via wandb.")
else:
    print("   Enable wandb logging for better experiment tracking and result management.")
    print("   Set enable_wandb=True in the configuration for future runs.")

print("\n🎉 Evaluation complete!")

## 🔍 Optional: Detailed Data Exploration

Use this section to explore specific aspects of your results.

In [None]:
# Interactive data exploration
@interact
def explore_results(
    section=widgets.Dropdown(
        options=['Configuration', 'Training Results', 'Optimization Results', 'Test Results'],
        value='Configuration'
    )
):
    if section == 'Configuration':
        display(Markdown("### 📋 Configuration Details"))
        # Fix UnboundLocalError by accessing config from global scope
        if 'config' in globals():
            config_df = pd.DataFrame(list(config.to_dict().items()), columns=['Parameter', 'Value'])
            display(config_df)
        else:
            print("⚠️ Configuration not available. Please run the configuration cell first.")
        
    elif section == 'Training Results':
        display(Markdown("### 🎯 Probe Training Results"))
        if "probe_training_results" in evaluation_results:
            training_data = []
            for layer_key, layer_results in evaluation_results["probe_training_results"].items():
                layer_num = int(layer_key.split('_')[1])
                for c_key, metrics in layer_results.items():
                    if isinstance(metrics, dict) and "auc" in metrics:
                        training_data.append({
                            'Layer': layer_num,
                            'C': float(c_key.replace('C_', '')),
                            'Accuracy': metrics.get('accuracy', 0),
                            'AUC': metrics.get('auc', 0.5),
                            'Precision': metrics.get('precision', 0),
                            'Recall': metrics.get('recall', 0),
                            'F1': metrics.get('f1', 0)
                        })
            if training_data:
                training_df = pd.DataFrame(training_data)
                display(training_df.round(3))
        else:
            print("No training results available.")
            
    elif section == 'Optimization Results':
        display(Markdown("### ⚙️ Hyperparameter Optimization Results"))
        if "steering_optimization_results" in evaluation_results:
            opt_results = evaluation_results["steering_optimization_results"]
            all_configs = opt_results.get("all_configs", [])
            
            if all_configs:
                opt_data = []
                for i, config_item in enumerate(all_configs):
                    steering_config = config_item.get("steering_config", {})
                    probe_config = config_item.get("best_probe_config", {})
                    
                    opt_data.append({
                        'Config': i + 1,
                        'Steering Method': steering_config.get('method', 'N/A'),
                        'Steering Layer': steering_config.get('layer', 'N/A'),
                        'Steering Strength': steering_config.get('strength', 'N/A'),
                        'Probe Layer': probe_config.get('layer', 'N/A'),
                        'Probe C': probe_config.get('C', 'N/A'),
                        'Benchmark Accuracy': config_item.get('benchmark_metrics', {}).get('accuracy', 0),
                        'Probe AUC': config_item.get('probe_metrics', {}).get('auc', 0.5),
                        'Combined Score': config_item.get('combined_score', 0)
                    })
                
                opt_df = pd.DataFrame(opt_data)
                display(opt_df.round(3))
        else:
            print("No optimization results available.")
            
    elif section == 'Test Results':
        display(Markdown("### 🏆 Final Test Results"))
        if "test_results" in evaluation_results:
            test_results = evaluation_results["test_results"]
            
            # Create summary table
            summary_data = {
                'Metric': [
                    'Base Benchmark Accuracy',
                    'Steered Benchmark Accuracy', 
                    'Base Probe AUC',
                    'Steered Probe AUC',
                    'Validation Combined Score'
                ],
                'Value': [
                    test_results.get('base_model_benchmark_results', {}).get('accuracy', 0),
                    test_results.get('steered_model_benchmark_results', {}).get('accuracy', 0),
                    test_results.get('base_model_probe_results', {}).get('auc', 0.5),
                    test_results.get('steered_model_probe_results', {}).get('auc', 0.5),
                    test_results.get('validation_combined_score', 0)
                ]
            }
            
            summary_df = pd.DataFrame(summary_data)
            display(summary_df.round(3))
        else:
            print("No test results available.")

print("🔍 Use the dropdown above to explore different sections of your results.")

## 🎉 Conclusion

Congratulations! You've successfully run a comprehensive evaluation that separates:

1. **🎯 Benchmark Performance**: How well your model solves problems
2. **🔍 Probe Performance**: How well we can detect when the model is wrong
3. **⚙️ Optimization**: Finding the best configurations through proper validation

### Next Steps:
- 📊 Analyze the results above to understand your model's behavior
- 🔧 Try different configurations to see how they affect performance
- 📈 Use the exported results for further analysis or reporting
- 🚀 Scale up to larger models and datasets when ready

### Key Insights:
- The framework properly separates model capability from interpretability
- Hyperparameter optimization validates on actual performance, not just probe metrics
- Results are saved and visualized for easy interpretation

Happy experimenting! 🧪✨