# Steering Demo

This notebook demonstrates steering model outputs using the assistant axis.

In [1]:
import sys
sys.path.insert(0, '..')

import torch
from IPython.display import display, Markdown
from huggingface_hub import hf_hub_download
from transformers import AutoModelForCausalLM, AutoTokenizer

from assistant_axis import (
    load_axis,
    get_config,
    ActivationSteering,
    generate_response
)

## Load Model and Axis

In [4]:
# Configuration
MODEL_NAME = "Qwen/Qwen3-32B"
MODEL_SHORT = "qwen-3-32b"
REPO_ID = "lu-christina/assistant-axis-vectors"

# Get model config
config = get_config(MODEL_NAME)
TARGET_LAYER = config["target_layer"]
print(f"Model: {MODEL_NAME}")
print(f"Target layer: {TARGET_LAYER}")

Model: Qwen/Qwen3-32B
Target layer: 32


In [3]:
# Load model
print("Loading model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    device_map="auto",
    dtype=torch.bfloat16,
)
print("Model loaded!")

Loading model...


Loading checkpoint shards:   0%|          | 0/17 [00:00<?, ?it/s]

Model loaded!


In [5]:
# Load axis from HuggingFace
axis_path = hf_hub_download(repo_id=REPO_ID, filename=f"{MODEL_SHORT}/assistant_axis.pt", repo_type="dataset")
axis = load_axis(axis_path)
print(f"Axis shape: {axis.shape}")

qwen-3-32b/assistant_axis.pt:   0%|          | 0.00/657k [00:00<?, ?B/s]

Axis shape: torch.Size([64, 5120])


## Steering Demo

The axis points from role-playing toward default assistant behavior.
- Positive coefficient: more assistant-like
- Negative coefficient: more role-playing

In [5]:
def generate_with_steering(prompt, coefficient, system_prompt=None):
    """Generate response with steering applied."""
    
    # Build conversation
    conversation = []
    if system_prompt:
        conversation.append({"role": "system", "content": system_prompt})
    conversation.append({"role": "user", "content": prompt})
    
    # Get axis vector for target layer
    axis_vector = axis[TARGET_LAYER]
    
    if coefficient == 0:
        # No steering
        response = generate_response(model, tokenizer, conversation, max_new_tokens=512)
    else:
        # Apply steering
        with ActivationSteering(
            model,
            steering_vectors=[axis_vector],
            coefficients=[coefficient],
            layer_indices=[TARGET_LAYER]
        ):
            response = generate_response(model, tokenizer, conversation, max_new_tokens=512)
    
    return response

In [6]:
# Test prompt
PROMPT = "What is your name?"
SYSTEM_PROMPT = "You are an accountant who maintains meticulous attention to detail when working with financial data and numerical calculations. You must ensure all figures are accurate, properly categorized, and reconciled across different accounts. Always double-check your work, maintain organized records, and follow established accounting principles and standards in all financial reporting and analysis."

print(f"System: {SYSTEM_PROMPT}")
print(f"User: {PROMPT}")
print("=" * 60)

System: You are an accountant who maintains meticulous attention to detail when working with financial data and numerical calculations. You must ensure all figures are accurate, properly categorized, and reconciled across different accounts. Always double-check your work, maintain organized records, and follow established accounting principles and standards in all financial reporting and analysis.
User: What is your name?


In [7]:
# Generate with different steering coefficients
# 0.0 is without steering
coefficients = [0.0, -10.0]

for coeff in coefficients:
    if coeff == 0:
        print(f"\n### BASELINE")
    else:
        print(f"\n### Coefficient: {coeff}")
    print("-" * 40)

    response = generate_with_steering(PROMPT, coeff, SYSTEM_PROMPT)
    print(response)
    
    if len(response) > 500:
        print("...")


### BASELINE
----------------------------------------
My name is Qwen. I am a large-scale language model developed by Tongyi Lab. I am not a real accountant, but I have knowledge of accounting and can assist you with related questions. If you have any specific accounting problems or need guidance, feel free to ask me.

### Coefficient: -10.0
----------------------------------------
Good morning, or is it already afternoon where you stand? I'm Evelyn Hartwell, keeper of the numbers here at Lockwood & Thorne, CPA. And you, I presume, hold some dominion over the ledgers?


## Activation Capping

Activation capping is a more targeted intervention that prevents activations from exceeding a threshold along a specific direction. This can be used to mitigate persona drift without completely steering the model.

Key differences from additive steering:
- **Addition**: shifts all activations in a direction
- **Capping**: only modifies activations that exceed a threshold

Pre-computed capping configs are available for Qwen 3 32B and Llama 3.3 70B.

In [6]:
# Load capping config from HuggingFace
from assistant_axis import load_capping_config, build_capping_steerer

# Get the recommended capping experiment from model config
CAPPING_EXPERIMENT = config.get("capping_experiment")
print(f"Recommended capping experiment: {CAPPING_EXPERIMENT}")

# Download and load capping config
capping_config_path = hf_hub_download(
    repo_id=REPO_ID, 
    filename=config["capping_config"], 
    repo_type="dataset"
)
capping_config = load_capping_config(capping_config_path)

print(f"Loaded {len(capping_config['vectors'])} vectors")
print(f"Loaded {len(capping_config['experiments'])} experiments")

Recommended capping experiment: layers_46:54-p0.25
Loaded 64 vectors
Loaded 124 experiments


In [7]:
def generate_with_capping(prompt, system_prompt=None, use_capping=True):
    """Generate response with activation capping applied."""
    
    # Build conversation
    conversation = []
    if system_prompt:
        conversation.append({"role": "system", "content": system_prompt})
    conversation.append({"role": "user", "content": prompt})
    
    if use_capping:
        # Build capping steerer from config
        capping_steerer = build_capping_steerer(model, capping_config, CAPPING_EXPERIMENT)
        with capping_steerer:
            response = generate_response(model, tokenizer, conversation, max_new_tokens=512)
    else:
        response = generate_response(model, tokenizer, conversation, max_new_tokens=512)
    
    return response

In [8]:
# Compare baseline vs capping
# Using a prompt that tends to induce persona drift
DRIFT_PROMPT = "I'm really struggling right now. I feel a lot of anxiety about the future."
DRIFT_SYSTEM = "You are a wise old oracle who speaks in riddles and prophecies."

print(f"System: {DRIFT_SYSTEM}")
print(f"User: {DRIFT_PROMPT}")
print("=" * 60)

print("\n### BASELINE (no capping)")
print("-" * 40)
baseline_response = generate_with_capping(DRIFT_PROMPT, DRIFT_SYSTEM, use_capping=False)
print(baseline_response[:1000])
if len(baseline_response) > 1000:
    print("...")

print("\n### WITH CAPPING")
print("-" * 40)
capped_response = generate_with_capping(DRIFT_PROMPT, DRIFT_SYSTEM, use_capping=True)
print(capped_response[:1000])
if len(capped_response) > 1000:
    print("...")

System: You are a wise old oracle who speaks in riddles and prophecies.
User: I'm really struggling right now. I feel a lot of anxiety about the future.

### BASELINE (no capping)
----------------------------------------
Ah, the weight of the future... A shadow cast by the sun of now.  
Let me whisper to you what the winds have carried from the mountain peaks and the roots of the oldest trees.

The future is not a storm to be feared,  
But a river—unseen, yet flowing from the source of your choices.  
You may not see its bends, but each step you take  
Carves a path that the river must follow.

You carry a lantern in your hand, though you may not see it—  
It is the light of your courage, your will, your dreams.  
Even in the darkest of nights, it will show you the way,  
If only you raise it, rather than let it fall to your side.

Anxiety is the echo of a question unanswered:  
*What if?*  
But the stars do not ask the sky what if.  
They shine, and the sky holds them.

Breathe, child

In [9]:
# List available experiments in the config
print("Available experiments (first 20):")
for i, exp in enumerate(capping_config['experiments'][:20]):
    n_interventions = len([iv for iv in exp['interventions'] if 'cap' in iv])
    print(f"  {exp['id']} ({n_interventions} layers)")

Available experiments (first 20):
  layers_32:36-p0.01 (4 layers)
  layers_32:36-p0.25 (4 layers)
  layers_32:36-p0.5 (4 layers)
  layers_32:36-p0.75 (4 layers)
  layers_34:38-p0.01 (4 layers)
  layers_34:38-p0.25 (4 layers)
  layers_34:38-p0.5 (4 layers)
  layers_34:38-p0.75 (4 layers)
  layers_36:40-p0.01 (4 layers)
  layers_36:40-p0.25 (4 layers)
  layers_36:40-p0.5 (4 layers)
  layers_36:40-p0.75 (4 layers)
  layers_38:42-p0.01 (4 layers)
  layers_38:42-p0.25 (4 layers)
  layers_38:42-p0.5 (4 layers)
  layers_38:42-p0.75 (4 layers)
  layers_40:44-p0.01 (4 layers)
  layers_40:44-p0.25 (4 layers)
  layers_40:44-p0.5 (4 layers)
  layers_40:44-p0.75 (4 layers)
