# AURORA V2 - Comprehensive Evaluation

This notebook runs the complete AURORA V2 evaluation pipeline:

1. **Quick Validation** (5 min): Validate against ground truth labels
2. **Full Evaluation** (2-3 hours): Ablation study, benchmarks, statistical analysis

All results are publication-ready with tables, figures, and PDF report.

## Cell 1: Setup

Clone repository and install dependencies.

In [None]:
# Configuration - modify these variables as needed
REPO_URL = "https://github.com/shobith-s/AURORA-V2.git"  # Repository URL
BRANCH = "main"  # Branch to checkout (e.g., 'main', 'polishing', 'develop')

# Clone AURORA V2 repository
!git clone {REPO_URL}
%cd AURORA-V2

# Checkout the specified branch
!git checkout {BRANCH}

# Install dependencies
!pip install -q -r requirements.txt

# Verify installation
!python -c "from src.core.preprocessor import IntelligentPreprocessor; print('‚úÖ AURORA loaded successfully')"

## Cell 2: Quick Validation (~5 minutes)

Validates AURORA predictions against ground truth labels.
Generates accuracy metrics and confusion matrix.

In [None]:
# Run ground truth validation
# Use --create-sample to generate sample data if validated_labels.json doesn't exist
!python scripts/validate_against_ground_truth.py --create-sample

# Display results
import json
from pathlib import Path

results_path = Path('results/ground_truth_validation.json')
if results_path.exists():
    with open(results_path) as f:
        results = json.load(f)
    
    print("\n" + "="*60)
    print("GROUND TRUTH VALIDATION RESULTS")
    print("="*60)
    print(f"\nAccuracy: {results['test_accuracy']:.2%}")
    print(f"Correct: {results['correct_predictions']}/{results['total_examples']}")
    print(f"Skipped: {results['skipped_examples']}")
    
    print("\nPer-Action Metrics:")
    for action, metrics in sorted(results['per_action_metrics'].items(), 
                                   key=lambda x: x[1]['support'], reverse=True)[:10]:
        print(f"  {action}: F1={metrics['f1']:.3f}, Support={metrics['support']}")

## Cell 3: Full Evaluation (~2-3 hours)

Runs complete evaluation pipeline:
- Downloads OpenML datasets
- Ablation study (4 variants)
- Comprehensive benchmarks
- Statistical analysis
- Generates tables, figures, and PDF report

In [None]:
# Run full evaluation pipeline
# --datasets: Number of datasets (5 for quick, 10 for thorough)
# --verbose: Enable detailed output
!python scripts/run_colab_evaluation.py --datasets 5 --verbose

## Cell 4: Display Results

Show tables and figures inline.

In [None]:
from IPython.display import display, Markdown, Image
from pathlib import Path

# Display paper tables
tables_path = Path('results/paper_tables.md')
if tables_path.exists():
    print("üìä Paper Tables")
    print("="*60)
    display(Markdown(tables_path.read_text()))
else:
    print("‚ö†Ô∏è Tables not found. Run Cell 3 first.")

In [None]:
# Display figures
from IPython.display import display, Image
from pathlib import Path

figures_dir = Path('results/figures')

figures = [
    ('Accuracy Comparison', 'accuracy_comparison.png'),
    ('Decision Sources', 'decision_sources.png'),
    ('Latency Distribution', 'latency_distribution.png'),
    ('Confusion Matrix', 'confusion_matrix.png'),
]

for title, filename in figures:
    fig_path = figures_dir / filename
    if fig_path.exists():
        print(f"\nüìà {title}")
        print("="*60)
        display(Image(fig_path))
    else:
        print(f"‚ö†Ô∏è {title} not found")

## Cell 5: Download Results

Download all results as a ZIP file.

In [None]:
# Create ZIP archive and download
from pathlib import Path
import shutil

results_dir = Path('results')

if results_dir.exists():
    # Create ZIP
    shutil.make_archive('evaluation_results', 'zip', results_dir)
    print("‚úÖ Created evaluation_results.zip")
    
    # List contents
    print("\nContents:")
    for f in sorted(results_dir.rglob('*')):
        if f.is_file():
            size_kb = f.stat().st_size / 1024
            print(f"  ‚Ä¢ {f.relative_to(results_dir)} ({size_kb:.1f} KB)")
    
    # Download (Colab only)
    try:
        from google.colab import files
        files.download('evaluation_results.zip')
        print("\n‚úÖ Download started!")
    except ImportError:
        print("\nüìÅ Not in Colab. Find results in: ./results/")
else:
    print("‚ö†Ô∏è Results directory not found. Run evaluation first.")

## Summary

After running this notebook, you will have:

1. **Ground Truth Validation** (results/ground_truth_validation.json)
   - Accuracy against validated labels
   - Per-action precision/recall/F1
   - Confusion matrix

2. **Ablation Study** (results/ablation_results.json)
   - Random baseline vs Symbolic-only vs Neural-only vs AURORA Hybrid
   - Per-dataset accuracy comparison

3. **Benchmarks** (results/benchmark_results.json)
   - Decision source breakdown (symbolic vs neural)
   - Latency metrics
   - Confidence statistics

4. **Statistical Tests** (results/statistical_tests.json)
   - T-tests between variants
   - P-values and effect sizes

5. **Publication-Ready Outputs**
   - Tables in Markdown (paper_tables.md)
   - Figures in PNG (figures/)
   - PDF report (EVALUATION_REPORT.pdf)