# Generating explanations after caching the latents

Note from Eric: This is likely not going to be a part of the visualization yet. The important part is accessing the activations by feature in the other notebook.

Here we will show a simple example of how to generate explanations for a SAE after caching the latents.

In [8]:
#!pip install orjson sae_auto_interp 
!pip install aiofiles 
#!pip install https://vllm-wheels.s3.us-west-2.amazonaws.com/nightly/vllm-1.0.0.dev-cp38-abi3-manylinux1_x86_64.whl

Collecting aiofiles
  Downloading aiofiles-24.1.0-py3-none-any.whl.metadata (10 kB)
Downloading aiofiles-24.1.0-py3-none-any.whl (15 kB)
Installing collected packages: aiofiles
Successfully installed aiofiles-24.1.0


In [13]:
from functools import partial
from os import getenv

# API_KEY = getenv("OPENROUTER_API_KEY")
import torch
import orjson
import os
#from sae_auto_interp.clients import OpenRouter
from sae_auto_interp.config import ExperimentConfig, FeatureConfig
from sae_auto_interp.explainers import DefaultExplainer
from sae_auto_interp.features import (
    FeatureDataset,
    FeatureLoader
)
from sae_auto_interp.features.constructors import default_constructor
from sae_auto_interp.features.samplers import sample
from sae_auto_interp.pipeline import Pipeline, process_wrapper


In [14]:
feature_cfg = FeatureConfig(
    width=131072, # The number of latents of your SAE
    min_examples=200, # The minimum number of examples to consider for the feature to be explained
    max_examples=10000, # The maximum number of examples to be sampled from
    n_splits=5 # How many splits was the cache split into
)


In [12]:
module = ".model.layers.10" # The layer to explain
feature_dict = {module: torch.arange(0,10)} # The what latents to explain

dataset = FeatureDataset(
        raw_dir="latents", # The folder where the cache is stored
        cfg=feature_cfg,
        modules=[module],
        features=feature_dict,
)


FileNotFoundError: [Errno 2] No such file or directory: 'latents/.model.layers.10/config.json'

We need to define the config for the examples shown to the explainer model.
When selecting the examples to be shown to the explainer model we can select them from:
- "top", which gets the most activating examples
- "random" which gets random examples from the whole activation distribution
- "quantiles" which gets examples from the quantiles of the data


In [11]:

experiment_cfg = ExperimentConfig(
    n_examples_train=40, # Number of examples to sample for training
    example_ctx_len=32, # Length of each example
    train_type="random", # Type of sampler to use for training. 
)


The constructor defines the window of tokens to be used for the examples. We have a default constructor that builds examples of size ctx_len (should be a divisor of the ctx_len used for caching the latents).
The sampler defines how the examples are selected. The sampler will always generate a train and test set, but here we only care about the train set.


In [12]:
constructor=partial(
            default_constructor,
            tokens=dataset.tokens,
            n_random=experiment_cfg.n_random, 
            ctx_len=experiment_cfg.example_ctx_len, 
            max_examples=feature_cfg.max_examples
        )
sampler=partial(sample,cfg=experiment_cfg)
loader = FeatureLoader(dataset, constructor=constructor, sampler=sampler)
    

We use pipes to generate the explanations. Each pipe starts with loading the examples from the corresponding latent and then passes the examples to the explainer. It used a client (here OpenRouter) to generate the explanations.

In [13]:
client = OpenRouter("anthropic/claude-3.5-sonnet",api_key=API_KEY)

# The function that saves the explanations
def explainer_postprocess(result):
        with open(f"results/explanations/{result.record.feature}.txt", "wb") as f:
            f.write(orjson.dumps(result.explanation))
        del result
        return None

explainer_pipe = process_wrapper(
        DefaultExplainer(
            client, 
            tokenizer=dataset.tokenizer,
        ),
        postprocess=explainer_postprocess,
    )

os.makedirs("results/explanations", exist_ok=True)

Here we are generating only explanations, show our pipeline only has two steps.

In [14]:
pipeline = Pipeline(
    loader,
    explainer_pipe,
)
number_of_parallel_latents = 10
await pipeline.run(number_of_parallel_latents) # This will start generating the explanations.

Processing items: 0it [00:00, ?it/s]

RuntimeError: Failed to generate text after multiple attempts.