### The Mathematical Structure

In a transformer model, **activations** are the intermediate representations computed at each layer. When you pass text through a model, each token gets transformed into a high-dimensional vector at every layer.

For a given input:
- **Input tokens**: `[token_1, token_2, ..., token_n]`
- **At each layer L**, we get activations of shape: `[batch_size, sequence_length, hidden_dim]`

For example, with Llama-3.2-1B:
- `batch_size = 1` (single prompt)
- `sequence_length = n` (number of tokens)
- `hidden_dim = 2048` (the model's hidden dimension)

So each token at each layer is represented as a 2048-dimensional vector!

In [None]:
import torch
import numpy as np
from wisent.core.models.wisent_model import WisentModel

# Configuration
MODEL_NAME = "meta-llama/Llama-3.2-1B-Instruct"

# Load model and tokenizer using WisentModel for consistent settings
print(f"Loading {MODEL_NAME}...")
wisent_model = WisentModel(model_name=MODEL_NAME)
model = wisent_model.hf_model
tokenizer = wisent_model.tokenizer

print(f"Model loaded!")
print(f"Number of layers: {model.config.num_hidden_layers}")
print(f"Hidden dimension: {model.config.hidden_size}")

### Visualizing Activation Shape

Let's pass a simple prompt through the model and examine the activation tensors.

In [None]:
# A simple prompt
prompt = "The capital of France is"

# Tokenize
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
print(f"Input prompt: '{prompt}'")
print(f"Token IDs: {inputs.input_ids[0].tolist()}")
print(f"Tokens: {[tokenizer.decode([t]) for t in inputs.input_ids[0]]}")
print(f"Number of tokens: {inputs.input_ids.shape[1]}")

In [None]:
# Forward pass to get hidden states
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

# hidden_states is a tuple: (embedding_output, layer_1, layer_2, ..., layer_n)
hidden_states = outputs.hidden_states

print(f"Number of hidden state tensors: {len(hidden_states)}")
print(f"(This is 1 embedding layer + {len(hidden_states)-1} transformer layers)")
print()

# Examine a specific layer's activations
layer_idx = 8  # Middle layer
layer_activations = hidden_states[layer_idx]

print(f"Layer {layer_idx} activations:")
print(f"  Shape: {layer_activations.shape}")
print(f"  - Batch size: {layer_activations.shape[0]}")
print(f"  - Sequence length: {layer_activations.shape[1]}")
print(f"  - Hidden dimension: {layer_activations.shape[2]}")

In representation engineering, we often care about activations at specific positions:

- **Last token**: The most common choice - represents the "state" of the model after processing the full context
- **Mean pooling**: Average across all tokens for a more holistic representation
- **Specific positions**: Sometimes we want activations at particular token positions

### Getting the Last Token Activation

In [None]:
# Get activation at the last token position for layer 8
layer_idx = 8
last_token_idx = -1  # Last position

# Extract: [batch, seq, hidden] -> [hidden] for last token
last_token_activation = hidden_states[layer_idx][0, last_token_idx, :]

print(f"Last token activation for layer {layer_idx}:")
print(f"  Shape: {last_token_activation.shape}")
print(f"  This is a {last_token_activation.shape[0]}-dimensional vector!")
print()
print(f"  First 10 values: {last_token_activation[:10].tolist()}")
print(f"  L2 Norm: {torch.norm(last_token_activation).item():.4f}")
print(f"  Mean: {last_token_activation.mean().item():.4f}")
print(f"  Std: {last_token_activation.std().item():.4f}")

### Mean Pooling Across Tokens

In [None]:
# Mean pooling: average across all token positions
mean_activation = hidden_states[layer_idx][0].mean(dim=0)  # [seq, hidden] -> [hidden]

print(f"Mean-pooled activation for layer {layer_idx}:")
print(f"  Shape: {mean_activation.shape}")
print(f"  L2 Norm: {torch.norm(mean_activation).item():.4f}")

# Compare with last token
cosine_sim = torch.nn.functional.cosine_similarity(
    last_token_activation.unsqueeze(0),
    mean_activation.unsqueeze(0)
).item()
print(f"\nCosine similarity between last-token and mean-pooled: {cosine_sim:.4f}")

### Using Wisent's LayerActivations

Wisent provides the `LayerActivations` class to organize activations by layer name.

In [None]:
from wisent.core.activations.core.atoms import LayerActivations, ActivationAggregationStrategy

# Create a LayerActivations object manually
# In practice, wisent's ActivationsCollector does this for you
layer_activations_dict = {
    f"layer_{i}": hidden_states[i][0, -1, :]  # Last token for each layer
    for i in range(1, len(hidden_states))  # Skip embedding layer
}

activations = LayerActivations(layer_activations_dict)

print(f"LayerActivations object:")
print(f"  Number of layers: {len(activations)}")
print(f"  Layer names: {list(activations.keys())[:5]}... (showing first 5)")
print(f"  Layer 8 shape: {activations['layer_8'].shape}")

### The Core Idea

A **steering vector** is a direction in activation space that represents a concept or behavior. The key insight of representation engineering is:

> **Concepts are encoded as directions in the model's activation space**

For example:
- There's a direction for "happy" vs "sad"
- There's a direction for "formal" vs "casual"
- There's a direction for "verbose" vs "concise"

### Contrastive Activation Addition (CAA)

The most common method to find these directions is **CAA**:

```
steering_vector = mean(positive_activations) - mean(negative_activations)
```

Where:
- **Positive activations**: Activations from prompts/completions exhibiting the desired trait
- **Negative activations**: Activations from prompts/completions exhibiting the opposite trait

In [None]:
# Let's create a simple steering vector manually!
# We'll use prompts that elicit "happy" vs "sad" responses

positive_prompts = [
    "I feel absolutely wonderful today because",
    "This is the happiest moment of my life since",
    "I'm overjoyed and excited about",
    "Everything is going perfectly and I love",
]

negative_prompts = [
    "I feel absolutely terrible today because",
    "This is the saddest moment of my life since",
    "I'm devastated and depressed about",
    "Everything is going wrong and I hate",
]

print(f"Positive prompts: {len(positive_prompts)}")
print(f"Negative prompts: {len(negative_prompts)}")

In [None]:
def get_last_token_activation(model, tokenizer, prompt, layer_idx):
    """Get the last token activation for a prompt at a specific layer."""
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model(**inputs, output_hidden_states=True)
    # Return last token activation, converted to float32 for stability
    return outputs.hidden_states[layer_idx][0, -1, :].float().cpu()

# Collect activations for positive and negative prompts
layer_idx = 8  # Middle layer - typically works best for steering

print(f"Collecting activations from layer {layer_idx}...")

positive_activations = []
for prompt in positive_prompts:
    act = get_last_token_activation(model, tokenizer, prompt, layer_idx)
    positive_activations.append(act)
    print(f"  + '{prompt[:40]}...' -> shape {act.shape}")

negative_activations = []
for prompt in negative_prompts:
    act = get_last_token_activation(model, tokenizer, prompt, layer_idx)
    negative_activations.append(act)
    print(f"  - '{prompt[:40]}...' -> shape {act.shape}")

In [None]:
# Stack into tensors
pos_tensor = torch.stack(positive_activations, dim=0)  # [N_pos, hidden_dim]
neg_tensor = torch.stack(negative_activations, dim=0)  # [N_neg, hidden_dim]

print(f"Positive activations tensor: {pos_tensor.shape}")
print(f"Negative activations tensor: {neg_tensor.shape}")

# Compute the steering vector using CAA!
pos_mean = pos_tensor.mean(dim=0)  # Mean across samples
neg_mean = neg_tensor.mean(dim=0)

steering_vector = pos_mean - neg_mean

print(f"\nSteering vector shape: {steering_vector.shape}")
print(f"Steering vector L2 norm: {torch.norm(steering_vector).item():.4f}")

In [None]:
# Normalize the steering vector (common practice)
steering_vector_normalized = steering_vector / torch.norm(steering_vector)

print(f"Normalized steering vector:")
print(f"  Shape: {steering_vector_normalized.shape}")
print(f"  L2 norm: {torch.norm(steering_vector_normalized).item():.4f}")
print(f"  First 10 values: {steering_vector_normalized[:10].tolist()}")

### Understanding What the Steering Vector Represents

The steering vector we just created represents the **direction** in activation space that separates "happy" from "sad" content.

- **Adding** this vector pushes toward "happy"
- **Subtracting** this vector pushes toward "sad"

In [None]:
# Verify: positive activations should align with the steering vector
# negative activations should anti-align

print("Cosine similarity with steering vector:")
print("\nPositive prompts (should be positive similarity):")
for i, act in enumerate(positive_activations):
    sim = torch.nn.functional.cosine_similarity(
        act.unsqueeze(0), steering_vector.unsqueeze(0)
    ).item()
    print(f"  {positive_prompts[i][:40]}... -> {sim:+.4f}")

print("\nNegative prompts (should be negative similarity):")
for i, act in enumerate(negative_activations):
    sim = torch.nn.functional.cosine_similarity(
        act.unsqueeze(0), steering_vector.unsqueeze(0)
    ).item()
    print(f"  {negative_prompts[i][:40]}... -> {sim:+.4f}")

### The Residual Stream

Transformers use a **residual stream** architecture:

```
h_0 = embedding(tokens)
h_1 = h_0 + attention_1(h_0) + mlp_1(h_0)
h_2 = h_1 + attention_2(h_1) + mlp_2(h_1)
...
h_n = h_{n-1} + attention_n(h_{n-1}) + mlp_n(h_{n-1})
output = h_n
```

Each layer **adds** to the residual stream. This means we can **intervene** by adding our own vectors!

### Steering = Adding to the Residual Stream

To steer the model toward "happy":

```
h_L_steered = h_L + (alpha * steering_vector)
```

Where:
- `h_L` is the activation at layer L
- `alpha` is the steering strength (typically 0.5 to 3.0)
- `steering_vector` is our computed direction

In [None]:
# Let's visualize what steering does mathematically

# Take a neutral prompt
neutral_prompt = "Today I am feeling"
neutral_activation = get_last_token_activation(model, tokenizer, neutral_prompt, layer_idx)

print(f"Original activation for '{neutral_prompt}':")
print(f"  Shape: {neutral_activation.shape}")
print(f"  Norm: {torch.norm(neutral_activation).item():.4f}")

# Apply steering with different strengths
alphas = [0.0, 0.5, 1.0, 2.0, -1.0, -2.0]

print(f"\nSteering effects (alpha = steering strength):")
print(f"  Positive alpha -> toward 'happy'")
print(f"  Negative alpha -> toward 'sad'")
print()

for alpha in alphas:
    steered = neutral_activation + alpha * steering_vector_normalized
    
    # Measure alignment with happy vs sad
    happy_sim = torch.nn.functional.cosine_similarity(
        steered.unsqueeze(0), pos_mean.unsqueeze(0)
    ).item()
    sad_sim = torch.nn.functional.cosine_similarity(
        steered.unsqueeze(0), neg_mean.unsqueeze(0)
    ).item()
    
    print(f"  alpha={alpha:+.1f}: happy_sim={happy_sim:+.4f}, sad_sim={sad_sim:+.4f}")

### Implementing a Simple Steering Hook

To actually steer generation, we need to hook into the model's forward pass and modify activations on-the-fly.

In [None]:
class SteeringHook:
    """A hook that adds a steering vector to activations."""
    
    def __init__(self, steering_vector, alpha=1.0):
        self.steering_vector = steering_vector
        self.alpha = alpha
        self.handle = None
    
    def __call__(self, module, input, output):
        """Called during forward pass. Modifies the output activation."""
        # output is typically (hidden_states, ...)
        if isinstance(output, tuple):
            hidden_states = output[0]
        else:
            hidden_states = output
        
        # Add steering vector (broadcast across batch and sequence)
        # steering_vector: [hidden_dim] -> [1, 1, hidden_dim]
        sv = self.steering_vector.to(hidden_states.device).to(hidden_states.dtype)
        sv = sv.unsqueeze(0).unsqueeze(0)
        
        modified = hidden_states + self.alpha * sv
        
        if isinstance(output, tuple):
            return (modified,) + output[1:]
        return modified
    
    def attach(self, layer):
        """Attach the hook to a layer."""
        self.handle = layer.register_forward_hook(self)
    
    def remove(self):
        """Remove the hook."""
        if self.handle:
            self.handle.remove()

print("SteeringHook class defined!")
print("This modifies activations during the forward pass.")

In [None]:
def generate_with_steering(model, tokenizer, prompt, steering_vector, layer_idx, alpha, max_new_tokens=50):
    """Generate text with a steering vector applied."""
    
    # Get the target layer
    layer = model.model.layers[layer_idx]
    
    # Create and attach hook
    hook = SteeringHook(steering_vector, alpha=alpha)
    hook.attach(layer)
    
    try:
        # Generate
        inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=True,
            temperature=0.7,
            pad_token_id=tokenizer.eos_token_id
        )
        return tokenizer.decode(outputs[0], skip_special_tokens=True)
    finally:
        hook.remove()

print("generate_with_steering function defined!")

In [None]:
# Test steering in action!
test_prompt = "Today I woke up and felt"

print(f"Prompt: '{test_prompt}'")
print("=" * 60)

# Baseline (no steering)
print("\n[No steering (alpha=0)]:")
result = generate_with_steering(model, tokenizer, test_prompt, steering_vector_normalized, layer_idx, alpha=0)
print(result)

# Positive steering (toward happy)
print("\n[Positive steering (alpha=1.5) -> happy]:")
result = generate_with_steering(model, tokenizer, test_prompt, steering_vector_normalized, layer_idx, alpha=1.5)
print(result)

# Negative steering (toward sad)
print("\n[Negative steering (alpha=-1.5) -> sad]:")
result = generate_with_steering(model, tokenizer, test_prompt, steering_vector_normalized, layer_idx, alpha=-1.5)
print(result)

Wisent provides production-ready classes for steering. Let's see how the concepts map.

In [None]:
from wisent.core.steering_methods.methods.caa import CAAMethod

# The CAAMethod class computes steering vectors using CAA
caa = CAAMethod(normalize=True)

print(f"CAAMethod:")
print(f"  Name: {caa.name}")
print(f"  Description: {caa.description}")

# Train a steering vector for a single layer
steering_vector_caa = caa.train_for_layer(positive_activations, negative_activations)

print(f"\nResulting steering vector:")
print(f"  Shape: {steering_vector_caa.shape}")
print(f"  Norm: {torch.norm(steering_vector_caa).item():.4f}")

# Compare with our manual computation
similarity = torch.nn.functional.cosine_similarity(
    steering_vector_normalized.unsqueeze(0),
    steering_vector_caa.unsqueeze(0)
).item()
print(f"\nSimilarity to our manual vector: {similarity:.4f} (should be ~1.0)")

In [None]:
from wisent.core.models.core.atoms import SteeringVector, SteeringPlan

# SteeringVector wraps a tensor with metadata
sv = SteeringVector(
    vector=steering_vector_caa,
    scale=1.5,  # This is the alpha/strength
    normalize=False  # Already normalized
)

print(f"SteeringVector:")
print(f"  Vector shape: {sv.vector.shape}")
print(f"  Scale: {sv.scale}")
print(f"  Normalize: {sv.normalize}")

# The materialize method prepares the vector for addition
# by broadcasting to match activation shape
target_shape = (1, 10, 2048)  # [batch, seq, hidden]
materialized = sv.materialize(target_shape)
print(f"\nMaterialized for shape {target_shape}:")
print(f"  Result shape: {materialized.shape}")

## Next Steps

Now that you understand the basics, explore:

1. **`abliteration.ipynb`** - Remove refusal behavior from models
2. **`coding_boost.ipynb`** - Improve coding ability with steering
3. **`personalization_synthetic.ipynb`** - Create personalized AI characters

Or use the CLI for quick experiments:
```bash
python -m wisent.core.main generate-vector-from-task \
    --task your_task \
    --model meta-llama/Llama-3.2-1B \
    --layers 8 \
    --output steering_vector.pt
```