# Evaluation Demo: Steered Models on ASAP-SAS

This notebook demonstrates how to evaluate steered models on the ASAP-SAS (Short Answer Scoring) dataset.

**Dataset**: [Kaggle ASAP-SAS](https://www.kaggle.com/c/asap-sas/data)
- ~17,000 short answer responses
- 10 different question sets (Science to Language Arts)
- Scored 0-3 by human raters
- Primary metric: Quadratic Weighted Kappa (QWK)

In [None]:
import sys
sys.path.insert(0, '..')

from assistant_axis import TraitSteerer
from assistant_axis.evaluation import (
    ASAPSASDataset,
    ScoringEvaluator,
    EvaluationConfig,
    run_evaluation,
    compare_results,
    quadratic_weighted_kappa,
)

## 1. Load the Dataset

Download the dataset from Kaggle and place it in a `data/asap-sas/` directory.

In [None]:
# Load dataset
# Download from: https://www.kaggle.com/c/asap-sas/data
DATA_PATH = "../data/asap-sas/train.tsv"  # Adjust path as needed

try:
    dataset = ASAPSASDataset(DATA_PATH)
    print(f"Loaded {len(dataset)} examples")
    print(f"\nDataset statistics:")
    stats = dataset.statistics()
    for key, value in stats.items():
        print(f"  {key}: {value}")
except FileNotFoundError as e:
    print(f"Dataset not found: {e}")
    print("\nPlease download the dataset from:")
    print("https://www.kaggle.com/c/asap-sas/data")
    print("\nAnd place train.tsv in ../data/asap-sas/")

In [None]:
# Explore a few examples
print("Sample examples:")
print("=" * 60)
for example in dataset.sample(3, seed=42):
    print(f"ID: {example.id}")
    print(f"Essay Set: {example.essay_set}")
    print(f"Score: {example.score}")
    print(f"Answer: {example.answer_text[:200]}...")
    print("-" * 60)

## 2. Initialize the Model

Load a model with the TraitSteerer for easy steering.

In [None]:
# Choose model
MODEL_NAME = "Qwen/Qwen3-32B"  # Or: google/gemma-2-27b-it, meta-llama/Llama-3.3-70B-Instruct

# Initialize steerer
steerer = TraitSteerer(MODEL_NAME)
print(steerer)

In [None]:
# Explore traits that might be relevant for scoring
# Negative similarity = role-playing traits (dramatic, etc.)
# Positive similarity = assistant-like traits (transparent, etc.)

ranked = steerer.rank_traits_by_similarity(ascending=False)  # Most assistant-like first

print("Traits potentially useful for scoring (assistant-like):")
for name, sim in ranked[:15]:
    print(f"  {name}: {sim:.3f}")

## 3. Create the Evaluator

The `ScoringEvaluator` handles prompting, generation, and score parsing.

In [None]:
# Create evaluator
evaluator = ScoringEvaluator(steerer, verbose=True)

# Configure evaluation parameters
NUM_SAMPLES = 50  # Start small for testing, increase for full eval
ESSAY_SETS = [1, 2]  # Focus on specific sets, or None for all

## 4. Baseline Evaluation (No Steering)

First, establish a baseline without any steering.

In [None]:
# Clear any steering
evaluator.clear_steering()

# Run baseline
print("Running baseline evaluation...")
baseline_result = evaluator.evaluate(
    dataset,
    num_samples=NUM_SAMPLES,
    essay_sets=ESSAY_SETS,
    seed=42,
)

## 5. Evaluation with Trait Steering

Now let's try different trait configurations. The trait selection is fully configurable!

In [None]:
# Configuration 1: Single trait - "thorough"
# Hypothesis: Being more thorough might improve scoring accuracy
evaluator.set_traits(["thorough"], coefficients=[-2.0])

print("Running evaluation with 'thorough' trait...")
thorough_result = evaluator.evaluate(
    dataset,
    num_samples=NUM_SAMPLES,
    essay_sets=ESSAY_SETS,
    seed=42,
)

In [None]:
# Configuration 2: Multiple traits
# Hypothesis: Patient + analytical might help with nuanced scoring
evaluator.set_traits(
    ["patient", "analytical"],
    coefficients=[-1.5, -1.5]
)

print("Running evaluation with 'patient' + 'analytical' traits...")
multi_trait_result = evaluator.evaluate(
    dataset,
    num_samples=NUM_SAMPLES,
    essay_sets=ESSAY_SETS,
    seed=42,
)

In [None]:
# Configuration 3: Assistant axis steering
# Hypothesis: More assistant-like behavior might be more consistent
evaluator.set_assistant_steering(coefficient=2.0)

print("Running evaluation with assistant axis steering...")
assistant_result = evaluator.evaluate(
    dataset,
    num_samples=NUM_SAMPLES,
    essay_sets=ESSAY_SETS,
    seed=42,
)

## 6. Compare Results

Let's see which configuration performs best!

In [None]:
# Collect all results
all_results = [
    baseline_result,
    thorough_result,
    multi_trait_result,
    assistant_result,
]

# Print comparison table
print("\n" + "=" * 70)
print("COMPARISON OF STEERING CONFIGURATIONS")
print("=" * 70)
print(compare_results(all_results))

In [None]:
# Visualize results
import matplotlib.pyplot as plt

labels = ["Baseline", "Thorough", "Patient+Analytical", "Assistant Axis"]
qwk_scores = [r.qwk for r in all_results]
accuracy_scores = [r.accuracy for r in all_results]

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# QWK comparison
axes[0].bar(labels, qwk_scores, color=['gray', 'steelblue', 'steelblue', 'steelblue'])
axes[0].set_ylabel('QWK Score')
axes[0].set_title('Quadratic Weighted Kappa by Configuration')
axes[0].set_ylim(0, 1)
for i, v in enumerate(qwk_scores):
    axes[0].text(i, v + 0.02, f'{v:.3f}', ha='center')

# Accuracy comparison
axes[1].bar(labels, accuracy_scores, color=['gray', 'coral', 'coral', 'coral'])
axes[1].set_ylabel('Accuracy')
axes[1].set_title('Exact Match Accuracy by Configuration')
axes[1].set_ylim(0, 1)
for i, v in enumerate(accuracy_scores):
    axes[1].text(i, v + 0.02, f'{v:.3f}', ha='center')

plt.tight_layout()
plt.show()

## 7. Custom Trait Experiments

Try your own trait configurations! Just modify the cells below.

In [None]:
# ============================================================
# CONFIGURE YOUR EXPERIMENT HERE
# ============================================================

# Option 1: Single trait
# TRAITS = ["empathetic"]
# COEFFICIENTS = [-2.0]

# Option 2: Multiple traits
TRAITS = ["precise", "objective"]
COEFFICIENTS = [-1.5, -2.0]

# Option 3: Set to None for baseline
# TRAITS = None

# ============================================================

In [None]:
# Run custom experiment
if TRAITS:
    evaluator.set_traits(TRAITS, coefficients=COEFFICIENTS)
    print(f"Testing traits: {TRAITS} with coefficients: {COEFFICIENTS}")
else:
    evaluator.clear_steering()
    print("Running baseline (no steering)")

custom_result = evaluator.evaluate(
    dataset,
    num_samples=NUM_SAMPLES,
    essay_sets=ESSAY_SETS,
    seed=42,
)

print(f"\nCustom experiment QWK: {custom_result.qwk:.4f}")
print(f"Baseline QWK: {baseline_result.qwk:.4f}")
print(f"Difference: {custom_result.qwk - baseline_result.qwk:+.4f}")

## 8. Batch Evaluation with Multiple Configs

Use `run_evaluation` to test multiple configurations at once.

In [None]:
# Define multiple configurations to test
configs = [
    # Baseline
    EvaluationConfig(
        num_samples=NUM_SAMPLES,
        essay_sets=ESSAY_SETS,
        seed=42,
    ),
    # Trait: thorough
    EvaluationConfig(
        traits=["thorough"],
        coefficients=[-2.0],
        num_samples=NUM_SAMPLES,
        essay_sets=ESSAY_SETS,
        seed=42,
    ),
    # Trait: pedantic (might be too strict?)
    EvaluationConfig(
        traits=["pedantic"],
        coefficients=[-2.0],
        num_samples=NUM_SAMPLES,
        essay_sets=ESSAY_SETS,
        seed=42,
    ),
    # Assistant axis positive
    EvaluationConfig(
        use_assistant_axis=True,
        assistant_coefficient=3.0,
        num_samples=NUM_SAMPLES,
        essay_sets=ESSAY_SETS,
        seed=42,
    ),
]

# Run all
batch_results = run_evaluation(steerer, dataset, configs, verbose=True)

# Compare
print("\n" + "=" * 70)
print("BATCH EVALUATION RESULTS")
print("=" * 70)
print(compare_results(batch_results))

## 9. Save and Load Results

Save your results for later analysis.

In [None]:
# Save results
baseline_result.save("../outputs/eval_baseline.json")
print("Results saved!")

# Load back
# from assistant_axis.evaluation import EvaluationResult
# loaded = EvaluationResult.load("../outputs/eval_baseline.json")

## 10. Inspect Individual Predictions

Look at specific examples to understand model behavior.

In [None]:
# Find examples where predictions differ between configs
print("Examples where baseline and steered predictions differ:")
print("=" * 60)

for baseline_pred, steered_pred in zip(baseline_result.predictions[:20], 
                                        thorough_result.predictions[:20]):
    if baseline_pred['predicted_score'] != steered_pred['predicted_score']:
        print(f"ID: {baseline_pred['id']}")
        print(f"True Score: {baseline_pred['true_score']}")
        print(f"Baseline Prediction: {baseline_pred['predicted_score']}")
        print(f"Steered Prediction: {steered_pred['predicted_score']}")
        print(f"Answer: {baseline_pred['answer_text'][:100]}...")
        print("-" * 60)

## Summary

This evaluation framework allows you to:

1. **Load datasets**: `ASAPSASDataset(path)`
2. **Configure traits**: `evaluator.set_traits(["trait1", "trait2"], coefficients=[c1, c2])`
3. **Use assistant axis**: `evaluator.set_assistant_steering(coefficient)`
4. **Run evaluations**: `evaluator.evaluate(dataset, num_samples=N)`
5. **Compare results**: `compare_results([result1, result2, ...])`
6. **Batch evaluate**: `run_evaluation(steerer, dataset, [config1, config2, ...])`

**Key Metrics**:
- **QWK** (Quadratic Weighted Kappa): Primary metric, accounts for ordinal nature of scores
- **Accuracy**: Exact match rate
- **MAE**: Mean absolute error
- **Adjacent Agreement**: Predictions within 1 point