# Tutorial

AxBench introduces two supervised dictionary-learning (SDL) methods that scale to thousands of concepts and outperform existing dictionary-learning approaches for LLMs. In this tutorial, we demonstrate one of these methods, ReFT-r1, which is built on the representation finetuning (ReFT) framework. ReFT-r1 provides a single dictionary of subspaces, with each subspace corresponding to a high-level concept. These subspaces can be used as a "microscope" to analyze model internals and to steer model behavior.

**We will be using [pyvene](https://github.com/stanfordnlp/pyvene) to build interventions that load our SDLs.**

**More about the ReFT-r1 with Concept16K** 
- It does not have an encoder-decoder structure. It is a big matrix where each row is a subspace.
- The subspace serves two purposes: detection and steering.
- The first version we release provides a dictionary of 16K subspaces.
- These 16K concepts are adapted from Gemma model's SAEs.

## Loading the Model

In [2]:
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download, notebook_login
import numpy as np
import torch, json, einops

def load_jsonl(jsonl_path):
    jsonl_data = []
    with open(jsonl_path, 'r') as f:
        for line in f:
            data = json.loads(line)
            jsonl_data += [data]
    return jsonl_data

In this tutorial, we will load `Gemma-2-2B-it` as well as our ReFT-r1 trained on the residual stream of layer 20. You will first need to log in to HugginFace so we can download related weights and data. Note that we are not using the pretrained model as ReFT-r1 is trained on the instruction-tuned one directly.

In [2]:
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [4]:
torch.set_grad_enabled(False)
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b-it", device_map="auto")

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [5]:
tokenizer =  AutoTokenizer.from_pretrained("google/gemma-2-2b-it")

## Download our open ReFT-r1 SDL

We provide the raw weights as well as the annotated concept metadata.

In [7]:
steering_vector = torch.load('../layer_10_addition/train/GemmaScopeSAE.pt')

  steering_vector = torch.load('../layer_10_addition/train/GemmaScopeSAE.pt')


In [9]:
inputs = self.tokenizer(
                input_strings, return_tensors="pt", padding=True, truncation=True
            ).to(self.device)

_, generations = self.ax_model.generate(
                inputs, 
                unit_locations=None, intervene_on_prompt=True, 
                subspaces=[{"idx": idx, "mag": mag, "max_act": max_acts, 
                            "prefix_length": kwargs["prefix_length"]}]*self.num_of_layers,
                max_new_tokens=eval_output_length, do_sample=True, 
                temperature=temperature,
            )

NameError: name 'self' is not defined

In [5]:
print(steering_vector['W_dec'].shape)

torch.Size([10, 2304])


In [6]:
md = load_jsonl("../layer_10_addition/generate/metadata.jsonl")
md[0]

{'concept_id': 0,
 'concept': 'the main thing this neuron does is respond to mathematical concepts focused on derivatives, activating with phrases that specify derivatives of mathematical functions, and then outputs a range of terms related to derivatives and their properties.',
 'ref': 'https://www.neuronpedia.org/gemma-2-2b/10-gemmascope-res-65k/20527',
 'concept_genres_map': {'the main thing this neuron does is respond to mathematical concepts focused on derivatives, activating with phrases that specify derivatives of mathematical functions, and then outputs a range of terms related to derivatives and their properties.': ['math']}}

In [8]:
# The input text
prompt = "Would you be able to travel through time using a wormhole?"

# Use the tokenizer to convert it to tokens. Note that this implicitly adds a special "Beginning of Sequence" or <bos> token to the start
inputs = tokenizer.encode(prompt, return_tensors="pt", add_special_tokens=True).to("cuda")
print(inputs)

# Pass it in to the model and generate text
outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0]))

tensor([[     2,  18925,    692,    614,   3326,    577,   5056,   1593,   1069,
           2177,    476,  47420,  18216, 235336]], device='cuda:0')


  0%|          | 0/50 [00:00<?, ?it/s]

<bos>Would you be able to travel through time using a wormhole?

This question dives into scientific exploration, fantastical possibilities, and the limitations of our current understanding. 

**Let's analyze the scenario**:

* **Wormholes:** These are theoretical tunnels through spacetime that could potentially connect two distant points in


In [9]:
def steering(
    activations, hook, steering_strength=1.0, steering_vector=None, max_act=1.0
):
    # Note if the feature fires anyway, we'd be adding to that here.
    return activations + max_act * steering_strength * steering_vector

In [31]:
print(steering_vector['W_dec'][0].shape)

torch.Size([2304])


In [25]:
def generate_with_steering(
    model,
    sae_sub,
    prompt,
    steering_feature,
    max_act,
    steering_strength=1.0,
    max_new_tokens=95,
):
    input_ids = model.to_tokens(prompt, prepend_bos=True)

    steering_vector = sae_sub['W_dec'][steering_feature].to('cuda')

    steering_hook = partial(
        steering,
        steering_vector=steering_vector,
        steering_strength=steering_strength,
        max_act=max_act,
    )

    # standard transformerlens syntax for a hook context for generation
    with model.hooks(fwd_hooks=[('blocks.10.hook_resid_post', steering_hook)]):
        output = model.generate(
            input_ids,
            max_new_tokens=max_new_tokens,
            temperature=0.7,
            top_p=0.9,
            stop_at_eos=True,
            prepend_bos=True,
        )

    return model.tokenizer.decode(output[0])

In [22]:
steering_feature = 0

In [23]:
normal_text = model.generate(
    prompt,
    max_new_tokens=95,
    stop_at_eos=True,
    prepend_bos=True,
)

print("\nNormal text (without steering):")
print(normal_text)

  0%|          | 0/95 [00:00<?, ?it/s]


Normal text (without steering):
Would you be able to travel through time using a wormhole? 

Let's look at the science:

1. **Wormholes: Theoretical** Our current understanding of physics doesn't allow us to create wormholes, these are proposed theoretical solutions that allow for connecting two distant points in spacetime. 
2. **Time Travel: Circular logic.** Some physicists suggest time travel could be theoretically possible, but time travel is still highly theoretical. 
3. **The Grandfather Paradox.** This famously problematic example is a logical inconsistency


In [28]:
# Generate text with steering
from functools import partial
steered_text = generate_with_steering(
    model, steering_vector, "Once upon a time", 0, 1.0, steering_strength=5.0
)
print("Steered text:")
print(steered_text)

  0%|          | 0/95 [00:00<?, ?it/s]

Steered text:
<bos>Once upon a time, in a land far away, lived a little firefly named Flicker. Unlike other fireflies, Flicker's light wasn't a vibrant glow, but a dim, flickering flame. He felt shy and different, and he wished he could be like the other fireflies who shone bright and beautiful.

One night, a wise old owl named Hoot saw Flicker struggling to light up. "Why are you so sad, little one?" Hoot asked
