# Advanced Steering Methods: PRISM, PULSE, and TITAN

This notebook demonstrates the three advanced steering methods in Wisent:

1. **PRISM** - Multi-directional gradient-optimized steering
2. **PULSE** - Conditional layer-adaptive steering  
3. **TITAN** - Joint optimization of manifold, gating, and intensity

Each method builds on insights from academic research on representation engineering.

In [None]:
import os
import torch
import json

# Configuration
MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"
TASK = "truthfulqa"
NUM_PAIRS = 50
OUTPUT_DIR = "./advanced_steering_outputs"

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/pairs", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/vectors", exist_ok=True)

print(f"Model: {MODEL_NAME}")
print(f"Task: {TASK}")
print(f"Output: {OUTPUT_DIR}")

## Step 1: Generate Contrastive Pairs

All methods use the same contrastive pairs format.

In [None]:
!python -m wisent.core.main generate-pairs \
    {MODEL_NAME} \
    --task {TASK} \
    --output {OUTPUT_DIR}/pairs/contrastive_pairs.json \
    --limit {NUM_PAIRS} \
    --verbose

---

## Method 1: CAA (Baseline)

Contrastive Activation Addition computes a single steering direction using mean(positive) - mean(negative).

In [None]:
!python -m wisent.core.main train-steering \
    {MODEL_NAME} \
    --pairs {OUTPUT_DIR}/pairs/contrastive_pairs.json \
    --output {OUTPUT_DIR}/vectors/caa \
    --steering-method CAA \
    --caa-normalize \
    --verbose

---

## Method 2: PRISM (Multi-Directional)

**PRISM** (Projected Representations for Independent Steering Manifolds) discovers multiple steering directions per layer through gradient optimization.

### Key Features:
- Multiple directions capture complex behaviors
- Gradient-optimized for maximum effectiveness
- Independence loss prevents redundant directions
- Cone constraint keeps directions in same behavioral half-space

In [None]:
# Train PRISM with 3 directions per layer
!python -m wisent.core.main train-steering \
    {MODEL_NAME} \
    --pairs {OUTPUT_DIR}/pairs/contrastive_pairs.json \
    --output {OUTPUT_DIR}/vectors/prism \
    --steering-method PRISM \
    --prism-num-directions 3 \
    --prism-optimization-steps 100 \
    --prism-learning-rate 0.01 \
    --prism-retain-weight 0.1 \
    --prism-independence-weight 0.05 \
    --prism-use-caa-init \
    --prism-normalize \
    --verbose

### Understanding PRISM Output

PRISM produces a manifold of directions per layer. The primary direction is typically the most effective.

In [None]:
# Load and inspect PRISM vectors
import glob

prism_files = glob.glob(f"{OUTPUT_DIR}/vectors/prism/*.pt")
print(f"PRISM vector files: {len(prism_files)}")

if prism_files:
    sample = torch.load(prism_files[0], weights_only=True)
    if isinstance(sample, dict):
        print(f"Keys: {sample.keys()}")
    else:
        print(f"Shape: {sample.shape}")

---

## Method 3: PULSE (Conditional Steering)

**PULSE** (Probabilistic Uncertainty-guided Layer Steering Engine) applies steering conditionally based on input content.

### Key Features:
- Learned gating decides WHEN to steer
- Per-layer scaling learns WHERE to steer most
- Entropy-based intensity modulation adjusts HOW MUCH to steer
- Only activates when input matches target behavior

In [None]:
# Train PULSE with conditional gating
!python -m wisent.core.main train-steering \
    {MODEL_NAME} \
    --pairs {OUTPUT_DIR}/pairs/contrastive_pairs.json \
    --output {OUTPUT_DIR}/vectors/pulse \
    --steering-method PULSE \
    --pulse-sensor-layer 12 \
    --pulse-steering-layers "8,9,10,11,12,13,14" \
    --pulse-per-layer-scaling \
    --pulse-condition-threshold 0.5 \
    --pulse-gate-temperature 0.1 \
    --pulse-learn-threshold \
    --pulse-use-entropy-scaling \
    --pulse-max-alpha 2.0 \
    --pulse-optimization-steps 100 \
    --pulse-normalize \
    --verbose

### Understanding PULSE Output

PULSE learns:
- **Condition vector**: Template for gating decisions
- **Per-layer scales**: Importance weights across layers
- **Optimal threshold**: Best cutoff for activation

In [None]:
# Load and inspect PULSE result
pulse_files = glob.glob(f"{OUTPUT_DIR}/vectors/pulse/*.pt")
print(f"PULSE vector files: {len(pulse_files)}")

# Check for metadata
metadata_file = f"{OUTPUT_DIR}/vectors/pulse/metadata.json"
if os.path.exists(metadata_file):
    with open(metadata_file) as f:
        metadata = json.load(f)
    print(f"\nPULSE Metadata:")
    print(f"  Optimal threshold: {metadata.get('optimal_threshold', 'N/A')}")
    print(f"  Per-layer scales: {metadata.get('per_layer_scales', 'N/A')}")

---

## Method 4: TITAN (Joint Optimization)

**TITAN** (Total Integrated Targeted Activation Navigation) is the most powerful method, jointly optimizing:

### Components:
1. **Direction Manifold**: Multi-directional steering (like PRISM)
2. **Gating Network**: Learned MLP that predicts when to steer
3. **Intensity Network**: Predicts per-layer steering strength
4. **Direction Weights**: Learns optimal combination of manifold directions

### Loss Functions:
- Behavior loss: Steering effectiveness on positive examples
- Retain loss: Minimal effect on negative examples
- Sparse loss: Encourage sparse layer activation
- Smooth loss: Penalize abrupt intensity changes
- Independence loss: Diverse directions in manifold
- Gate discrimination loss: Gate should discriminate pos/neg

In [None]:
# Train TITAN with full configuration
!python -m wisent.core.main train-steering \
    {MODEL_NAME} \
    --pairs {OUTPUT_DIR}/pairs/contrastive_pairs.json \
    --output {OUTPUT_DIR}/vectors/titan \
    --steering-method TITAN \
    --titan-num-directions 5 \
    --titan-steering-layers "8,9,10,11,12,13,14,15" \
    --titan-sensor-layer 12 \
    --titan-gate-hidden-dim 128 \
    --titan-intensity-hidden-dim 64 \
    --titan-optimization-steps 200 \
    --titan-learning-rate 0.005 \
    --titan-behavior-weight 1.0 \
    --titan-retain-weight 0.2 \
    --titan-sparse-weight 0.05 \
    --titan-max-alpha 3.0 \
    --titan-normalize \
    --verbose

---

## Programmatic Usage

For more control, use the Python API directly.

In [None]:
# Import steering methods
from wisent.core.steering_methods import (
    CAAMethod,
    PRISMMethod,
    PULSEMethod,
    TITANMethod,
)
from wisent.core.steering_methods.registry import SteeringMethodRegistry

# List all available methods
print("Available steering methods:")
for info in SteeringMethodRegistry.get_method_info():
    print(f"\n{info['name'].upper()}:")
    print(f"  {info['description']}")
    print(f"  Default strength: {info['default_strength']}")
    print(f"  Strength range: {info['strength_range']}")

In [None]:
# Create method instances programmatically
caa = SteeringMethodRegistry.create_method_instance("caa", normalize=True)
prism = SteeringMethodRegistry.create_method_instance(
    "prism",
    num_directions=3,
    optimization_steps=100,
)
pulse = SteeringMethodRegistry.create_method_instance(
    "pulse",
    sensor_layer=12,
    per_layer_scaling=True,
)
titan = SteeringMethodRegistry.create_method_instance(
    "titan",
    num_directions=5,
    optimization_steps=200,
)

print("Method instances created:")
print(f"  CAA: {caa.name}")
print(f"  PRISM: {prism.name}")
print(f"  PULSE: {pulse.name}")
print(f"  TITAN: {titan.name}")

---

## Steering Optimization

The `optimize-steering` command supports all methods (CAA, PRISM, PULSE, TITAN).
Use it to find optimal layer, strength, and method-specific parameters.

In [None]:
# Optimize CAA (baseline)
!python -m wisent.core.main optimize-steering comprehensive \
    {MODEL_NAME} \
    --tasks {TASK} \
    --methods CAA \
    --limit 30 \
    --save-best-vector {OUTPUT_DIR}/optimized/caa \
    --verbose

In [None]:
# Optimize PRISM with multi-directional steering
!python -m wisent.core.main optimize-steering comprehensive \
    {MODEL_NAME} \
    --tasks {TASK} \
    --methods PRISM \
    --prism-num-directions 3 \
    --prism-optimization-steps 50 \
    --limit 30 \
    --save-best-vector {OUTPUT_DIR}/optimized/prism \
    --verbose

In [None]:
# Optimize PULSE with conditional steering
!python -m wisent.core.main optimize-steering comprehensive \
    {MODEL_NAME} \
    --tasks {TASK} \
    --methods PULSE \
    --pulse-sensor-layer 12 \
    --pulse-per-layer-scaling \
    --limit 30 \
    --save-best-vector {OUTPUT_DIR}/optimized/pulse \
    --verbose

In [None]:
# Optimize TITAN with joint optimization
!python -m wisent.core.main optimize-steering comprehensive \
    {MODEL_NAME} \
    --tasks {TASK} \
    --methods TITAN \
    --titan-num-directions 3 \
    --titan-optimization-steps 100 \
    --limit 30 \
    --save-best-vector {OUTPUT_DIR}/optimized/titan \
    --verbose

In [None]:
# Compare all methods at once
!python -m wisent.core.main optimize-steering comprehensive \
    {MODEL_NAME} \
    --tasks {TASK} \
    --methods CAA PRISM PULSE TITAN \
    --prism-num-directions 3 \
    --titan-num-directions 3 \
    --pulse-sensor-layer 12 \
    --limit 30 \
    --verbose

---

## TITAN Weight Modification

TITAN supports three application modes:

1. **Static**: Bake directions into weights (fast inference, no dynamic behavior)
2. **Dynamic**: Use runtime hooks only (flexible, some overhead)
3. **Hybrid**: Bake + hooks (best of both worlds)

In [None]:
# Example: TITAN weight modification (requires trained TITAN result)
print("""
# TITAN Weight Modification Example

from wisent.core.weight_modification import apply_titan_steering, TITANRuntimeHooks

# After training TITAN:
# titan_result = titan_method.train_titan(pair_set)

# Option 1: Static mode (bake into weights)
result = apply_titan_steering(
    model=model,
    titan_result=titan_result,
    mode="static",
    base_strength=1.0,
)
# model.save_pretrained("./titan_modified_model")

# Option 2: Dynamic mode (runtime hooks only)
result = apply_titan_steering(
    model=model,
    titan_result=titan_result,
    mode="dynamic",
    base_strength=1.0,
)
hooks = result["hooks"]

# Generate with dynamic gating
output = model.generate(input_ids)

# Check gate value
print(f"Gate: {hooks.get_current_gate()}")
print(f"Intensities: {hooks.get_current_intensities()}")

# Clean up
hooks.remove()

# Option 3: Hybrid mode (recommended for production)
result = apply_titan_steering(
    model=model,
    titan_result=titan_result,
    mode="hybrid",
    base_strength=1.0,
)
""")

---

## Method Comparison

| Method | Speed | Expressiveness | Conditional | Best Use Case |
|--------|-------|----------------|-------------|---------------|
| CAA | Fast | Low | No | Quick experiments, simple behaviors |
| PRISM | Medium | Medium | No | Complex behaviors, multiple aspects |
| PULSE | Medium | Medium-High | Yes | Context-dependent steering |
| TITAN | Slow | High | Yes | Production, maximum control |

---

## Tips and Best Practices

### Choosing a Method

1. **Start with CAA** for initial experiments - it's fast and gives you a baseline
2. **Use PRISM** when CAA isn't capturing all aspects of the behavior
3. **Use PULSE** when you only want steering on certain types of inputs
4. **Use TITAN** for production when you need maximum control

### Hyperparameter Tuning

**PRISM:**
- Start with `num_directions=3`, increase if behavior is complex
- Use `retain_weight=0.1` to prevent over-steering
- `independence_weight=0.05` prevents redundant directions

**PULSE:**
- Set `sensor_layer` to middle-to-late layer (e.g., 60-75% of total layers)
- Lower `gate_temperature` for sharper on/off behavior
- Enable `use_entropy_scaling` for uncertainty-aware intensity

**TITAN:**
- More `optimization_steps` = better convergence (200-300 recommended)
- Balance `behavior_weight` and `retain_weight` (default ratio 5:1)
- `sparse_weight` encourages steering at fewer layers (better for side effects)

In [None]:
print("\n" + "="*60)
print("Advanced Steering Methods Demo Complete!")
print("="*60)
print(f"\nOutputs saved to: {OUTPUT_DIR}")
print("\nNext steps:")
print("1. Compare steering effectiveness across methods")
print("2. Tune hyperparameters for your specific use case")
print("3. Deploy using weight modification for production")