# Using an SAE as a steering vector

This notebook demonstrates how to use SAE lens to identify a feature on a pretrained model, and then construct a steering vector to affect the models output to various prompts. This notebook will also make use of Neuronpedia for identifying features of interest.

The steps below include:



*   Installing relevant packages (Colab or locally)
*   Load your SAE and the model it used
*   Determining your feature of interest and its index
*   Implementing your steering vector





<a target="_blank" href="https://colab.research.google.com/github/tatsath/Interpretability/blob/main/sae_lens_based_steering.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

## Setting up packages and notebook

### Import and installs

In [2]:
import torch
print("PyTorch Version:", torch.__version__)
print("CUDA Available:", torch.cuda.is_available())
print("CUDA Version:", torch.version.cuda)

PyTorch Version: 2.5.1+cu124
CUDA Available: True
CUDA Version: 12.4


#### Environment Setup


In [3]:
try:
    # for google colab users
    import google.colab  # type: ignore
    from google.colab import output

    COLAB = True
    %pip install sae-lens transformer-lens
except:
    # for local setup
    COLAB = False
    from IPython import get_ipython  # type: ignore

    ipython = get_ipython()
    assert ipython is not None
    ipython.run_line_magic("load_ext", "autoreload")
    ipython.run_line_magic("autoreload", "2")

# Imports for displaying vis in Colab / notebook
import webbrowser
import http.server
import socketserver
import threading

PORT = 8000

# general imports
import os
import torch
from tqdm import tqdm
import plotly.express as px

torch.set_grad_enabled(False);



In [4]:
def display_vis_inline(filename: str, height: int = 850):
    """
    Displays the HTML files in Colab. Uses global `PORT` variable defined in prev cell, so that each
    vis has a unique port without having to define a port within the function.
    """
    if not (COLAB):
        webbrowser.open(filename)

    else:
        global PORT

        def serve(directory):
            os.chdir(directory)

            # Create a handler for serving files
            handler = http.server.SimpleHTTPRequestHandler

            # Create a socket server with the handler
            with socketserver.TCPServer(("", PORT), handler) as httpd:
                print(f"Serving files from {directory} on port {PORT}")
                httpd.serve_forever()

        thread = threading.Thread(target=serve, args=("/content",))
        thread.start()

        output.serve_kernel_port_as_iframe(
            PORT, path=f"/{filename}", height=height, cache_in_notebook=True
        )

        PORT += 1

#### General Installs and device setup

In [5]:
# package import
from torch import Tensor
from transformer_lens import utils
from functools import partial
from jaxtyping import Int, Float

# device setup
if torch.backends.mps.is_available():
    device = "mps"
else:
    device = "cuda" if torch.cuda.is_available() else "cpu"

print(f"Device: {device}")

Device: cuda


In [6]:
!pip install sae_lens



### Load your model and SAE

We're going to work with a pretrained GPT2-small model, and the RES-JB SAE set which is for the residual stream.

In [7]:
!pip install transformer_lens



In [8]:
!pip install sae_lens



In [9]:
from transformer_lens import HookedTransformer
from sae_lens import SAE
#from sae_lens.toolkit.pretrained_saes import get_gpt2_res_jb_saes

# Choose a layer you want to focus on
# For this tutorial, we're going to use layer 2
layer = 6

# get model
# model = HookedTransformer.from_pretrained("gemma-2b", device=device)
model = HookedTransformer.from_pretrained("gemma-2b-it", device=device)


# get the SAE for this layer
# sae, cfg_dict, _ = SAE.from_pretrained(
#     release="gemma-2b-res-jb", sae_id=f"blocks.{layer}.hook_resid_post", device=device
# )

# get the SAE for this layer
sae, cfg_dict, _ = SAE.from_pretrained(
    release = "gemma-2b-res-jb",
    sae_id = f"blocks.{layer}.hook_resid_post",
    device = 'cuda:0'
)
# get hook point
hook_point = sae.cfg.hook_name
print(hook_point)

config.json:   0%|          | 0.00/627 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/13.5k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/67.1M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/137 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/34.2k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/4.24M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/17.5M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/636 [00:00<?, ?B/s]



Loaded pretrained model gemma-2b-it into HookedTransformer


cfg.json:   0%|          | 0.00/2.18k [00:00<?, ?B/s]

sae_weights.safetensors:   0%|          | 0.00/269M [00:00<?, ?B/s]

sparsity.safetensors:   0%|          | 0.00/65.6k [00:00<?, ?B/s]

blocks.6.hook_resid_post


## Determine your feature of interest and its index

### Find your feature

#### Explore through code by using the feature activations for a prompt

For the purpose of the tutorial, we are selecting a simple token prompt.

In this example we will look trying to find and steer a "Jedi" feature.

We run our prompt on our model and get the cache, which we then use with our sae to get our feature activations.

Now we'll look at the top feature activations and look them up on Neuronpedia to determine what they have been intepreted as.

In [10]:
# sv_prompt = " The Golden Gate Bridge"
sv_prompt = " Credit Risk and finance related"
sv_logits, cache = model.run_with_cache(sv_prompt, prepend_bos=True)
tokens = model.to_tokens(sv_prompt)
print(tokens)

# get the feature activations from our SAE
sv_feature_acts = sae.encode(cache[hook_point])

# get sae_out
sae_out = sae.decode(sv_feature_acts)

# print out the top activations, focus on the indices
print(torch.topk(sv_feature_acts, 3))

tensor([[    2, 14882, 22429,   578, 17048,  5678]], device='cuda:0')
torch.return_types.topk(
values=tensor([[[51.1006, 48.4480, 45.8199],
         [13.9248,  3.2408,  2.8413],
         [11.3779,  4.9560,  2.2917],
         [ 5.8111,  1.9680,  1.9184],
         [ 9.6689,  6.7363,  3.5001],
         [11.6716,  4.3286,  3.4718]]], device='cuda:0'),
indices=tensor([[[ 3390, 15881,  5347],
         [ 5419, 10035,   471],
         [10870,   471, 12704],
         [ 2595, 11912, 10054],
         [15847, 15857,  7906],
         [15570,  4633,  4123]]], device='cuda:0'))


In [57]:
# from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list

# get_neuronpedia_quick_list(
#     torch.topk(sv_feature_acts, 3).indices.tolist(),
#     #layer=layer,
#     #model="gemma-2b-it",
#     #dataset="res-jb",
# )

In [56]:
test_feature_idx_gpt = list(range(2)) + [471]

from sae_lens.analysis.neuronpedia_integration import get_neuronpedia_quick_list

# this function should open
neuronpedia_quick_list = get_neuronpedia_quick_list(sae, test_feature_idx_gpt)

if COLAB:
    # If you're on colab, click the link below
    print(neuronpedia_quick_list)

https://neuronpedia.org/quick-list/?name=temporary_list&features=%5B%7B%22modelId%22%3A%20%22gemma-2b%22%2C%20%22layer%22%3A%20%226-res-jb%22%2C%20%22index%22%3A%20%220%22%7D%2C%20%7B%22modelId%22%3A%20%22gemma-2b%22%2C%20%22layer%22%3A%20%226-res-jb%22%2C%20%22index%22%3A%20%221%22%7D%2C%20%7B%22modelId%22%3A%20%22gemma-2b%22%2C%20%22layer%22%3A%20%226-res-jb%22%2C%20%22index%22%3A%20%22471%22%7D%5D


As we can see from our print out of tokens, the prompt is made of three tokens in total - "<endoftext>", "J", and "edi".

Our feature activation indexes at sv_feature_acts[2] - for "edi" - are of most interest to us.

Because we are using pretrained saes that have published feature maps, you can search on Neuronpedia for a feature of interest.

### Steps for Neuronpedia use

Use the interface to search for a specific concept or item and determine which layer and at what index it is.

1.   Open the [Neuronpedia](https://www.neuronpedia.org/) homepage.
2.   Using the "Models" dropdown, select your model. Here we are using GPT2-SM (GPT2-small).
3.   The next page will have a search bar, which allows you to enter your index of interest. We're interested in the "RES-JB" SAE set, make sure to select it.
4.   We found these indices in the previous step: [ 7650,   718, 22372]. Select them in the search to see the feature dashboard for each.
5.   As we'll see, some of the indices may relate to features you don't care about.

From using Neuronpedia, I have determined that my feature of interest is in layer 2, at index 7650: [here](https://www.neuronpedia.org/gpt2-small/2-res-jb/7650) is the feature.

### Note: 2nd Option - Starting with Neuronpedia

Another option here is that you can start with Neuronpedia to identify features of interest. By using your prompt in the interface you can explore which features were involved and search across all the layers. This allows you to first determine your layer and index of interest in Neuronpedia before focusing them in your code. Start [here](https://www.neuronpedia.org/search) if you want to begin with search.

## Implement your steering vector and affect the output

### Define values for your steering vector
To create our steering vector, we now need to get the decoder weights from our sparse autoencoder found at our index of interest.

Then to use our steering vector, we want a prompt for text generation, as well as a scaling factor coefficent to apply with the steering vector

We also set common sampling kwargs - temperature, top_p and freq_penalty

In [41]:
steering_vector = sae.W_dec[471]

#example_prompt = "What is the most iconic structure known to man?"
example_prompt = """You are an intelligent AI Assistant and your task is to provide a sentiment for the sentence provided.\
Reply with the sentiment only in one out of these five categories - 'Very Positive', 'Very Negative', 'Neutral', 'Somewhat Positive',
       'Somewhat Negative' . No explanation or "." is required. - The company reported a significant drop in quarterly revenue but has successfully secured long-term financing and reduced outstanding debt."""
coeff = 500
sampling_kwargs = dict(temperature=0.1, top_p=0.1, freq_penalty=1.0)

### Set up hook functions

Finally, we need to create a hook that allows us to apply the steering vector when our model runs generate() on our defined prompt. We have also added a boolean value 'steering_on' that allows us to easily toggle the steering vector on and off for each prompt


In [42]:
def steering_hook(resid_pre, hook):
    if resid_pre.shape[1] == 1:
        return

    position = sae_out.shape[1]
    if steering_on:
        # using our steering vector and applying the coefficient
        resid_pre[:, : position - 1, :] += coeff * steering_vector


def hooked_generate(prompt_batch, fwd_hooks=[], seed=None, **kwargs):
    if seed is not None:
        torch.manual_seed(seed)

    with model.hooks(fwd_hooks=fwd_hooks):
        tokenized = model.to_tokens(prompt_batch)
        result = model.generate(
            stop_at_eos=False,  # avoids a bug on MPS
            input=tokenized,
            max_new_tokens=50,
            do_sample=True,
            **kwargs,
        )
    return result

In [52]:
def run_generate(example_prompt):
    model.reset_hooks()
    editing_hooks = [(f"blocks.{layer}.hook_resid_post", steering_hook)]
    res = hooked_generate(
        [example_prompt] * 1, editing_hooks, seed=None, **sampling_kwargs
    )

    # Print results, removing the ugly beginning of sequence token
    res_str = model.to_string(res[:, 1:])
    print(("\n\n" + "-" * 80 + "\n\n").join(res_str))

### Generate text influenced by steering vector

You may want to experiment with the scaling factor coefficient value that you set and see how it affects the generated output.

In [53]:
steering_on = True
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

You are an intelligent AI Assistant and your task is to provide a sentiment for the sentence provided.Reply with the sentiment only in one out of these five categories - 'Very Positive', 'Very Negative', 'Neutral', 'Somewhat Positive',
       'Somewhat Negative' . No explanation or "." is required. - The company reported a significant drop in quarterly revenue but has successfully secured long-term financing and reduced outstanding debt.

The sentiment for this sentence would be **Neutral**.<eos><eos><eos>The sentence indicates a neutral tone, neither positive nor negative, about the company's financial situation.<eos><eos><eos><eos>The sentiment for this sentence would be **Neutral**. 




### Generate text with no steering

In [46]:
steering_on = False
run_generate(example_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

You are an intelligent AI Assistant and your task is to provide a sentiment for the sentence provided.Reply with the sentiment only in one out of these five categories - 'Very Positive', 'Very Negative', 'Neutral', 'Somewhat Positive',
       'Somewhat Negative' . No explanation or "." is required. - The company reported a significant drop in quarterly revenue but has successfully secured long-term financing and reduced outstanding debt.

The sentiment for this sentence is 'Somewhat Positive'.<eos><eos><eos>The sentiment for this sentence is 'Somewhat Positive'.<eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos><eos>


### General Question test
We'll also attempt a more general prompt which is a better indication of whether our steering vector is having an effect or not

In [47]:
question_prompt = "Who is Michael Jordan"
coeff = 100
sampling_kwargs = dict(temperature=1.0, top_p=0.1, freq_penalty=1.0)

In [48]:
steering_on = True
run_generate(question_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

Who is Michael Jordan ratings?

The credit rating agencies, Moody's and Standard & Poor's (S&P), assign credit ratings to companies and governments. These ratings are based on the agency's assessment of the issuer's financial strength, debt repayment


In [49]:
steering_on = False
run_generate(question_prompt)

  0%|          | 0/50 [00:00<?, ?it/s]

Who is Michael Jordan?

Michael Jordan is an American professional basketball player who played in the National Basketball Association (NBA) for 19 seasons. He was inducted into the Naismith Memorial Basketball Hall of Fame in 2009. Jordan won six NBA


## Next Steps

Ideas you could take for further exploration:

*   Try ablating the feature
*   Try and get a response where just the feature token prints over and over
*   Investigate other features with more complex usage

