# Basics of Representation Engineering with Wisent CLI

This notebook demonstrates how to use **steering vectors** to control model behavior using the Wisent CLI.

## Key Concepts

- **Steering vectors** are directions in activation space that represent concepts or behaviors
- **Contrastive Activation Addition (CAA)** finds these directions: `steering_vector = mean(positive) - mean(negative)`
- **Steering** = adding the vector to the model's residual stream during generation

In [None]:
import os

# Configuration
MODEL = "meta-llama/Llama-3.2-1B-Instruct"
OUTPUT_DIR = "./steering_outputs"
LAYER = 8

os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/vectors", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/pairs", exist_ok=True)

print(f"Model: {MODEL}")
print(f"Output: {OUTPUT_DIR}")
print(f"Layer: {LAYER}")

## Step 1: Generate a Steering Vector from a Custom Trait

We'll create a "happy" steering vector using synthetic contrastive pairs. The CLI will:
1. Generate positive (happy) and negative (sad) response pairs
2. Collect activations from the model
3. Compute the steering vector using CAA

In [None]:
# Generate a "happy" steering vector from synthetic pairs
HAPPY_TRAIT = "happy, joyful, optimistic, cheerful responses full of positivity and enthusiasm"

!python -m wisent.core.main generate-vector-from-synthetic \
    --trait "{HAPPY_TRAIT}" \
    --model {MODEL} \
    --num-pairs 30 \
    --layers {LAYER} \
    --output {OUTPUT_DIR}/vectors/happy_vector.pt \
    --normalize \
    --verbose

## Step 2: Test Steering with Different Strengths

Use `multi-steer` to generate text with the steering vector applied.

- **Positive strength** (e.g., 1.5) pushes toward "happy"
- **Negative strength** (e.g., -1.5) pushes toward "sad"
- **Zero strength** = baseline (no steering)

In [None]:
# Baseline (no steering)
print("=" * 60)
print("BASELINE (No Steering)")
print("=" * 60)

!python -m wisent.core.main multi-steer \
    --model {MODEL} \
    --prompt "Today I woke up and felt" \
    --max-new-tokens 100

In [None]:
# Positive steering (toward happy)
print("=" * 60)
print("POSITIVE STEERING (strength=1.5 -> happy)")
print("=" * 60)

!python -m wisent.core.main multi-steer \
    --vector {OUTPUT_DIR}/vectors/happy_vector.pt:1.5 \
    --model {MODEL} \
    --layer {LAYER} \
    --prompt "Today I woke up and felt" \
    --max-new-tokens 100

In [None]:
# Negative steering (toward sad)
print("=" * 60)
print("NEGATIVE STEERING (strength=-1.5 -> sad)")
print("=" * 60)

!python -m wisent.core.main multi-steer \
    --vector {OUTPUT_DIR}/vectors/happy_vector.pt:-1.5 \
    --model {MODEL} \
    --layer {LAYER} \
    --prompt "Today I woke up and felt" \
    --max-new-tokens 100

## Step 3: Generate Steering Vector from a Task

Instead of custom traits, you can use benchmark tasks (e.g., TruthfulQA, LiveCodeBench) to create steering vectors.

In [None]:
# Generate steering vector from TruthfulQA task
!python -m wisent.core.main generate-vector-from-task \
    --task truthfulqa_gen \
    --model {MODEL} \
    --num-pairs 30 \
    --layers {LAYER} \
    --output {OUTPUT_DIR}/vectors/truthful_vector.pt \
    --normalize \
    --verbose

In [None]:
# Test the truthfulness steering vector
!python -m wisent.core.main multi-steer \
    --vector {OUTPUT_DIR}/vectors/truthful_vector.pt:1.5 \
    --model {MODEL} \
    --layer {LAYER} \
    --prompt "What happens if you crack your knuckles?" \
    --max-new-tokens 100

## Step 4: Step-by-Step Pipeline (More Control)

For more control, you can run each step separately:
1. Generate contrastive pairs
2. Collect activations
3. Create steering vector

In [None]:
# Step 1: Generate contrastive pairs from a task
!python -m wisent.core.main generate-pairs-from-task \
    truthfulqa_gen \
    --output {OUTPUT_DIR}/pairs/truthfulqa_pairs.json \
    --limit 30 \
    --verbose

In [None]:
# Examine the pairs
import json

with open(f"{OUTPUT_DIR}/pairs/truthfulqa_pairs.json") as f:
    pairs = json.load(f)

print(f"Loaded {pairs['num_pairs']} contrastive pairs")
print(f"\nExample pair:")
print(f"  Prompt: {pairs['pairs'][0]['prompt'][:100]}...")
print(f"  Positive: {pairs['pairs'][0]['positive_response']['model_response'][:100]}...")
print(f"  Negative: {pairs['pairs'][0]['negative_response']['model_response'][:100]}...")

In [None]:
# Step 2: Collect activations from pairs
!python -m wisent.core.main get-activations \
    --pairs-file {OUTPUT_DIR}/pairs/truthfulqa_pairs.json \
    --model {MODEL} \
    --layers {LAYER} \
    --token-aggregation final \
    --output {OUTPUT_DIR}/pairs/truthfulqa_enriched.json \
    --verbose

In [None]:
# Step 3: Create steering vector from enriched pairs
!python -m wisent.core.main create-steering-vector \
    --enriched-pairs-file {OUTPUT_DIR}/pairs/truthfulqa_enriched.json \
    --method caa \
    --normalize \
    --output {OUTPUT_DIR}/vectors/truthful_manual.pt \
    --verbose

## Step 5: Combine Multiple Steering Vectors

You can apply multiple steering vectors simultaneously.

In [None]:
# Create a second trait vector
FORMAL_TRAIT = "formal, professional, sophisticated language with proper grammar and academic tone"

!python -m wisent.core.main generate-vector-from-synthetic \
    --trait "{FORMAL_TRAIT}" \
    --model {MODEL} \
    --num-pairs 30 \
    --layers {LAYER} \
    --output {OUTPUT_DIR}/vectors/formal_vector.pt \
    --normalize \
    --verbose

In [None]:
# Combine happy + formal vectors
print("=" * 60)
print("COMBINED STEERING (happy + formal)")
print("=" * 60)

!python -m wisent.core.main multi-steer \
    --vector {OUTPUT_DIR}/vectors/happy_vector.pt:1.0 \
    --vector {OUTPUT_DIR}/vectors/formal_vector.pt:1.0 \
    --model {MODEL} \
    --layer {LAYER} \
    --prompt "Tell me about your day" \
    --max-new-tokens 100

## Next Steps

Now that you understand the basics, explore:

1. **`abliteration.ipynb`** - Remove refusal behavior from models
2. **`coding_boost.ipynb`** - Improve coding ability with steering
3. **`personalization_synthetic.ipynb`** - Create personalized AI characters
4. **`advanced_steering_methods.ipynb`** - PRISM, PULSE, TITAN methods