# SAE Intro Project

The following project and notebook is a simple introduction to the Sparse Auto Encoder (SAE).
The project is split into two parts:
1. A simple SAE implementation from scratch using Pytorch.
2. A new SAE variant.


In [None]:
# imports and globals setup
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import matplotlib.pyplot as plt
from datasets import load_dataset

# The LLM from which we shall build this SAE from
model_name = "EleutherAI/pythia-70m-deduped"

# The dataset we shall use to prompt the SAE while training
dataset_name = "monology/pile-uncopyrighted"

# Store activations here, from the layer which we are interested in
# Might be a better way to do this but this is the simplest
stored_activations = []

# This is the layer we are interested in, in the model.
# NB 0-based index, so 5 is the 6th layer
# Choice of layer is kinda arbitrary, however we want to choose something that is not too deep
# and not too shallow. This is a good trade-off in interpretability plus the Anthropic paper
# suggests that these "middle" layers are a decent choice to start.
chosen_layer = 5

# The maximum number of tokens the input is truncated to.
# This is important for VRAM safety, especially on smaller GPUs
MAX_LEN_TRUNC = 128

# This is the hook function, we shall call this during the forward pass
# So we can  "inspect" the activations of the layer
# by simply storing them in the global activations array
def hook_fn(module, input, output):
    stored_activations.append(output.detach().cpu())

## Part 1: Simple SAE Implementation

The following chunk of code is just pre-requisite code for setting up the SAE. This step 1.

In [None]:
# Load the dataset and training prompts
dataset = load_dataset(dataset_name, split="train")
prompts = [example["text"] for example in dataset]

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Move model to GPU if available, using MPS cause I have a Mac
# Hopefully this should auto-detect which device to use
device = torch.device("mps"
                        if torch.backends.mps.is_available() else "cuda"
                        if torch.cuda.is_available() else "cpu")
print("Using device:", device) # print for sanity

# Move model to device and set to eval mode
model.to(device).eval()

# Choose a layer — here, MLP from block 5
target_layer = model.gpt_neox.layers[chosen_layer].mlp
hook = target_layer.register_forward_hook(hook_fn)

EXPLAIN NEXT STEP HERE

In [None]:
# Loop through all the prompts and tokenize them, saving the required activations
for prompt in prompts:
    pass