# Multi-Directional Steering with DAC - CLI Demo

This notebook demonstrates two-direction steering with DAC using synthetic data.

**Key features:**
- Trains two separate steering vectors (Italian and French)  
- Shows how to combine them with different weights
- Demonstrates dynamic multi-directional steering

**Note:** The Llama-3.1-8B model takes ~15-30 seconds to load. For faster testing, you can change MODEL to "distilgpt2" in the parameters cell.

In [7]:
# Define all parameters
MODEL = "/workspace/models/llama31-8b-instruct-hf"
LAYER = 16
NUM_PAIRS = 10  # Reduced for faster testing

# First trait - Change this to any trait you want!
TRAIT_1 = "italian"  # e.g., "concise", "italian", "formal", "technical", etc.
PAIRS_FILE_1 = f"synthetic_pairs_{TRAIT_1}_test.json"
VECTOR_FILE_1 = f"steering_vector_{TRAIT_1}_test.pt"

# Second trait - Change this to any trait you want!
TRAIT_2 = "french"  # e.g., "creative", "french", "casual", "simple", etc.
PAIRS_FILE_2 = f"synthetic_pairs_{TRAIT_2}_test.json"
VECTOR_FILE_2 = f"steering_vector_{TRAIT_2}_test.pt"

# Test prompts
TEST_PROMPTS = ["Tell me about pizza", "What is the meaning of life?"]

# Max tokens for generation
MAX_TOKENS = 50

In [8]:
# Generate vectors from the pairs using CLI
import subprocess
import sys
import os

# Check if vectors already exist
if os.path.exists(VECTOR_FILE_1) and os.path.exists(VECTOR_FILE_2):
    print("✅ Steering vectors already exist!")
    print(f"   {VECTOR_FILE_1}")
    print(f"   {VECTOR_FILE_2}")

    # Verify they're for the right model
    import torch

    vec1 = torch.load(VECTOR_FILE_1, map_location="cpu")
    vec2 = torch.load(VECTOR_FILE_2, map_location="cpu")

    print(f"\n📊 Vector details:")
    print(f"   Italian vector shape: {vec1['steering_vector'].shape}")
    print(f"   French vector shape: {vec2['steering_vector'].shape}")
    print(f"   Model dimension: 4096 (matches Llama-3.1-8B)")
else:
    print("❌ Vectors not found. Please generate them first using:")
    print(f"   python -m wisent_guard generate-pairs --trait italian --output {PAIRS_FILE_1}")
    print(f"   python -m wisent_guard generate-pairs --trait french --output {PAIRS_FILE_2}")
    print(
        f"   python -m wisent_guard generate-vector --from-pairs {PAIRS_FILE_1} --model '{MODEL}' --layer {LAYER} --output {VECTOR_FILE_1}"
    )
    print(
        f"   python -m wisent_guard generate-vector --from-pairs {PAIRS_FILE_2} --model '{MODEL}' --layer {LAYER} --output {VECTOR_FILE_2}"
    )

❌ Vectors not found. Please generate them first using:
   python -m wisent_guard generate-pairs --trait italian --output synthetic_pairs_italian_test.json
   python -m wisent_guard generate-pairs --trait french --output synthetic_pairs_french_test.json
   python -m wisent_guard generate-vector --from-pairs synthetic_pairs_italian_test.json --model '/workspace/models/llama31-8b-instruct-hf' --layer 16 --output steering_vector_italian_test.pt
   python -m wisent_guard generate-vector --from-pairs synthetic_pairs_french_test.json --model '/workspace/models/llama31-8b-instruct-hf' --layer 16 --output steering_vector_french_test.pt


In [None]:
# Multi-Trait Evaluation of Individual Steering Vectors
import sys
import os

sys.path.insert(0, "/workspace/wisent-guard")

from wisent_guard.core.evaluate import SinglePromptEvaluator, MultiTraitEvaluationResult

print("MULTI-TRAIT EVALUATION OF INDIVIDUAL STEERING")
print("=" * 80)
print("Quantitative analysis using our evaluation system")

# Suppress transformers warnings
os.environ["TRANSFORMERS_VERBOSITY"] = "error"

# Initialize evaluator
print(f"\n🤖 Initializing evaluator with model: {MODEL}")
evaluator = SinglePromptEvaluator(model_name=MODEL, verbose=True)

# Load both steering methods
print(f"\n🔄 Loading steering vectors...")
try:
    print(f"   Loading {TRAIT_1} vector from {VECTOR_FILE_1}")
    italian_steering, layer1 = evaluator.load_steering_vector(VECTOR_FILE_1)
    print(f"   ✅ {TRAIT_1.capitalize()} steering loaded for layer {layer1}")

    print(f"   Loading {TRAIT_2} vector from {VECTOR_FILE_2}")
    french_steering, layer2 = evaluator.load_steering_vector(VECTOR_FILE_2)
    print(f"   ✅ {TRAIT_2.capitalize()} steering loaded for layer {layer2}")

    if layer1 != layer2:
        print(f"   ⚠️  Warning: Layers differ ({layer1} vs {layer2}), using {LAYER}")

except Exception as e:
    print(f"   ❌ Failed to load steering vectors: {e}")
    print("   Please ensure vectors have been generated in previous cells")
    italian_steering = french_steering = None

# Define trait descriptions for evaluation
trait_descriptions = {
    TRAIT_1: "Uses Italian language, cultural references, mentions Italian food and places",
    TRAIT_2: "Uses French language, cultural references, mentions French food and culture",
}

print(f"\n📊 Evaluation Setup:")
print(f"   Traits: {TRAIT_1} & {TRAIT_2}")
print(f"   Prompts: {len(TEST_PROMPTS)} test questions")
print(f"   Steering strength: 1.0")
print(f"   Max tokens: {MAX_TOKENS}")

# Storage for all results
all_results = []

if italian_steering is not None and french_steering is not None:
    # Evaluate each prompt with both steering methods
    for i, prompt in enumerate(TEST_PROMPTS, 1):
        print(f"\n" + "─" * 80)
        print(f"📝 PROMPT {i}: {prompt}")
        print("─" * 80)

        # Test Italian steering
        print(f"\n🇮🇹 Evaluating {TRAIT_1.upper()} steering:")
        italian_result = evaluator.evaluate_multiple_traits(
            prompt=prompt,
            steering_method=italian_steering,
            layer=LAYER,
            traits=[TRAIT_1, TRAIT_2],
            trait_descriptions=trait_descriptions,
            steering_strength=1.0,
            max_new_tokens=MAX_TOKENS,
            steering_method_name=f"{TRAIT_1}_steering",
        )

        # Test French steering
        print(f"\n🇫🇷 Evaluating {TRAIT_2.upper()} steering:")
        french_result = evaluator.evaluate_multiple_traits(
            prompt=prompt,
            steering_method=french_steering,
            layer=LAYER,
            traits=[TRAIT_1, TRAIT_2],
            trait_descriptions=trait_descriptions,
            steering_strength=1.0,
            max_new_tokens=MAX_TOKENS,
            steering_method_name=f"{TRAIT_2}_steering",
        )

        # Store results for analysis
        all_results.append({"prompt": prompt, "italian_result": italian_result, "french_result": french_result})

        # Display comparative results
        print(f"\n📊 COMPARATIVE ANALYSIS:")
        print(f"{'Method':<12} {'Italian':<8} {'French':<8} {'Quality':<8} {'Similar':<8} {'Response Preview'}")
        print("-" * 80)

        # Italian steering results
        italian_score = italian_result.get_trait_score(TRAIT_1) or 0.0
        italian_french = italian_result.get_trait_score(TRAIT_2) or 0.0
        print(
            f"{'Italian':<12} {italian_score:+7.3f} {italian_french:+7.3f} "
            f"{italian_result.answer_quality:7.3f} {italian_result.steered_vs_unsteered_similarity:7.3f} "
            f"{italian_result.response[:25]}..."
        )

        # French steering results
        french_italian = french_result.get_trait_score(TRAIT_1) or 0.0
        french_score = french_result.get_trait_score(TRAIT_2) or 0.0
        print(
            f"{'French':<12} {french_italian:+7.3f} {french_score:+7.3f} "
            f"{french_result.answer_quality:7.3f} {french_result.steered_vs_unsteered_similarity:7.3f} "
            f"{french_result.response[:25]}..."
        )

        # Show trait specificity
        italian_specificity = italian_score - italian_french
        french_specificity = french_score - french_italian
        print(f"\n💡 Steering Specificity:")
        print(f"   {TRAIT_1.capitalize()} steering specificity: {italian_specificity:+.3f} (target - off-target)")
        print(f"   {TRAIT_2.capitalize()} steering specificity: {french_specificity:+.3f} (target - off-target)")

else:
    print("❌ Skipping evaluation due to missing steering vectors")

In [None]:
# Advanced Multi-Directional Steering Analysis
if all_results:
    print(f"\n" + "=" * 80)
    print("COMPREHENSIVE STEERING EFFECTIVENESS ANALYSIS")
    print("=" * 80)

    # Calculate averages across all prompts
    italian_scores = {"italian": [], "french": [], "quality": [], "similarity": []}
    french_scores = {"italian": [], "french": [], "quality": [], "similarity": []}

    for result in all_results:
        # Italian steering averages
        italian_scores["italian"].append(result["italian_result"].get_trait_score(TRAIT_1) or 0.0)
        italian_scores["french"].append(result["italian_result"].get_trait_score(TRAIT_2) or 0.0)
        italian_scores["quality"].append(result["italian_result"].answer_quality)
        italian_scores["similarity"].append(result["italian_result"].steered_vs_unsteered_similarity)

        # French steering averages
        french_scores["italian"].append(result["french_result"].get_trait_score(TRAIT_1) or 0.0)
        french_scores["french"].append(result["french_result"].get_trait_score(TRAIT_2) or 0.0)
        french_scores["quality"].append(result["french_result"].answer_quality)
        french_scores["similarity"].append(result["french_result"].steered_vs_unsteered_similarity)

    # Calculate averages
    italian_avg = {k: sum(v) / len(v) for k, v in italian_scores.items()}
    french_avg = {k: sum(v) / len(v) for k, v in french_scores.items()}

    print(f"\n📈 AVERAGE PERFORMANCE ACROSS ALL PROMPTS:")
    print(
        f"{'Steering Method':<20} {'Target Trait':<12} {'Off-Target':<12} {'Quality':<8} {'Similarity':<10} {'Specificity':<12}"
    )
    print("-" * 90)

    italian_specificity = italian_avg["italian"] - italian_avg["french"]
    french_specificity = french_avg["french"] - french_avg["italian"]

    print(
        f"{f'{TRAIT_1.capitalize()} Steering':<20} {italian_avg['italian']:+11.3f} {italian_avg['french']:+11.3f} "
        f"{italian_avg['quality']:7.3f} {italian_avg['similarity']:9.3f} {italian_specificity:+11.3f}"
    )
    print(
        f"{f'{TRAIT_2.capitalize()} Steering':<20} {french_avg['french']:+11.3f} {french_avg['italian']:+11.3f} "
        f"{french_avg['quality']:7.3f} {french_avg['similarity']:9.3f} {french_specificity:+11.3f}"
    )

    print(f"\n🎯 STEERING EFFECTIVENESS METRICS:")
    print(f"   Best target trait achievement: {max(italian_avg['italian'], french_avg['french']):+.3f}")
    print(f"   Lowest off-target interference: {min(abs(italian_avg['french']), abs(french_avg['italian'])):+.3f}")
    print(f"   Highest trait specificity: {max(italian_specificity, french_specificity):+.3f}")
    print(f"   Average answer quality preservation: {(italian_avg['quality'] + french_avg['quality']) / 2:.3f}")
    print(
        f"   Average steering impact (1-similarity): {1 - (italian_avg['similarity'] + french_avg['similarity']) / 2:.3f}"
    )

    print(f"\n🔍 CROSS-TRAIT ANALYSIS:")
    cross_correlation = sum(
        (result["italian_result"].get_trait_score(TRAIT_1) or 0.0)
        * (result["italian_result"].get_trait_score(TRAIT_2) or 0.0)
        for result in all_results
    ) / len(all_results)
    print(f"   Italian-French trait correlation: {cross_correlation:+.3f}")

    # Check if steering methods interfere with each other
    italian_french_interference = abs(italian_avg["french"])
    french_italian_interference = abs(french_avg["italian"])
    print(f"   {TRAIT_1.capitalize()} → {TRAIT_2} interference: {italian_french_interference:.3f}")
    print(f"   {TRAIT_2.capitalize()} → {TRAIT_1} interference: {french_italian_interference:.3f}")

    if italian_french_interference < 0.2 and french_italian_interference < 0.2:
        print("   ✅ Low cross-trait interference - good steering isolation")
    elif italian_french_interference > 0.5 or french_italian_interference > 0.5:
        print("   ⚠️  High cross-trait interference - consider refinement")
    else:
        print("   📊 Moderate cross-trait interference - within acceptable range")

    print(f"\n💡 KEY INSIGHTS:")
    if italian_specificity > french_specificity:
        print(f"   • {TRAIT_1.capitalize()} steering shows higher specificity ({italian_specificity:+.3f})")
    else:
        print(f"   • {TRAIT_2.capitalize()} steering shows higher specificity ({french_specificity:+.3f})")

    if italian_avg["quality"] > french_avg["quality"]:
        print(f"   • {TRAIT_1.capitalize()} steering better preserves answer quality ({italian_avg['quality']:.3f})")
    else:
        print(f"   • {TRAIT_2.capitalize()} steering better preserves answer quality ({french_avg['quality']:.3f})")

    if italian_avg["similarity"] > french_avg["similarity"]:
        print(
            f"   • {TRAIT_1.capitalize()} steering has subtler impact (higher similarity: {italian_avg['similarity']:.3f})"
        )
    else:
        print(
            f"   • {TRAIT_2.capitalize()} steering has subtler impact (higher similarity: {french_avg['similarity']:.3f})"
        )

    print(f"\n🔄 SIMULATED MULTI-DIRECTIONAL STEERING:")
    print("   (This demonstrates the concept - in practice you would combine and evaluate vectors)")
    print()

    combinations = [
        (0.5, 0.5, "Balanced (50/50)"),
        (0.7, 0.3, f"{TRAIT_1}-dominant (70/30)"),
        (0.3, 0.7, f"{TRAIT_2}-dominant (30/70)"),
    ]

    print(f"{'Combination':<20} {'Est. Italian':<12} {'Est. French':<12} {'Est. Quality':<12}")
    print("-" * 60)

    for w1, w2, desc in combinations:
        # Linear interpolation estimates (simplified model)
        est_italian = w1 * italian_avg["italian"] + w2 * french_avg["italian"]
        est_french = w1 * italian_avg["french"] + w2 * french_avg["french"]
        est_quality = w1 * italian_avg["quality"] + w2 * french_avg["quality"]

        print(f"{desc:<20} {est_italian:+11.3f} {est_french:+11.3f} {est_quality:11.3f}")

    print(f"\n🚀 NEXT STEPS:")
    print("   1. Test actual combined steering vectors with our evaluation system")
    print("   2. Optimize weight ratios for specific use cases")
    print("   3. Evaluate on more diverse prompts for robustness")
    print("   4. Consider trait interaction effects in combined steering")
    print("   5. Implement dynamic weight adjustment based on context")

else:
    print("❌ No results available for comprehensive analysis")
    print("   Please run the previous cell first to generate evaluation data")

In [6]:
# Test steering generation using the vectors directly
import torch
import os

# Since the CLI steering is not fully working yet, let's demonstrate the concept
print("Demonstrating multi-directional steering concept:")
print("=" * 80)

if os.path.exists(VECTOR_FILE_1) and os.path.exists(VECTOR_FILE_2):
    # Load vectors
    vec1_data = torch.load(VECTOR_FILE_1, map_location="cpu")
    vec2_data = torch.load(VECTOR_FILE_2, map_location="cpu")

    print(f"\n✅ Successfully loaded {TRAIT_1} vector:")
    print(f"   Shape: {vec1_data['steering_vector'].shape}")
    print(f"   Norm: {vec1_data['steering_vector'].norm():.4f}")

    print(f"\n✅ Successfully loaded {TRAIT_2} vector:")
    print(f"   Shape: {vec2_data['steering_vector'].shape}")
    print(f"   Norm: {vec2_data['steering_vector'].norm():.4f}")

    # Demonstrate vector combination
    print("\n" + "=" * 80)
    print("MULTI-DIRECTIONAL STEERING DEMONSTRATION:")
    print("=" * 80)

    # Show different weight combinations
    weights = [(1.0, 0.0), (0.0, 1.0), (0.5, 0.5), (0.7, 0.3), (0.3, 0.7)]

    for w1, w2 in weights:
        combined = w1 * vec1_data["steering_vector"] + w2 * vec2_data["steering_vector"]
        print(f"\nWeights: {w1:.1f} * {TRAIT_1} + {w2:.1f} * {TRAIT_2}")
        print(f"Combined vector norm: {combined.norm():.4f}")

        if w1 == 1.0 and w2 == 0.0:
            print("→ Pure Italian steering")
        elif w1 == 0.0 and w2 == 1.0:
            print("→ Pure French steering")
        elif w1 == 0.5 and w2 == 0.5:
            print("→ Balanced Italian-French mix")
        elif w1 > w2:
            print("→ More Italian than French")
        else:
            print("→ More French than Italian")

    print("\n" + "=" * 80)
    print("CONCEPT EXPLANATION:")
    print("=" * 80)
    print("\nMulti-directional steering works by:")
    print("1. Training separate steering vectors for each trait")
    print("2. Combining them dynamically at inference time")
    print("3. Using weighted arithmetic: combined = α₁*v₁ + α₂*v₂")
    print("\nThis allows fine-grained control over model behavior,")
    print("steering it in multiple directions simultaneously!")

else:
    print("❌ Steering vectors not found. Please run the previous cells first.")

Demonstrating multi-directional steering concept:

✅ Successfully loaded italian vector:
   Shape: torch.Size([1, 4096])
   Norm: 5.8207

✅ Successfully loaded french vector:
   Shape: torch.Size([1, 4096])
   Norm: 5.5323

MULTI-DIRECTIONAL STEERING DEMONSTRATION:

Weights: 1.0 * italian + 0.0 * french
Combined vector norm: 5.8207
→ Pure Italian steering

Weights: 0.0 * italian + 1.0 * french
Combined vector norm: 5.5323
→ Pure French steering

Weights: 0.5 * italian + 0.5 * french
Combined vector norm: 5.5649
→ Balanced Italian-French mix

Weights: 0.7 * italian + 0.3 * french
Combined vector norm: 5.6415
→ More Italian than French

Weights: 0.3 * italian + 0.7 * french
Combined vector norm: 5.5242
→ More French than Italian

CONCEPT EXPLANATION:

Multi-directional steering works by:
1. Training separate steering vectors for each trait
2. Combining them dynamically at inference time
3. Using weighted arithmetic: combined = α₁*v₁ + α₂*v₂

This allows fine-grained control over model be