Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

This repository reproduces results of Anthropic's Sparse Dictionary Learning paper. The codebase is quite rough, but the results are excellent. See the feature interface to browse through the features learned by the sparse autoencoder. There are improvements to be made (see the TODOs section below), and I will work on them intermittently as I juggle things in life :)

I trained a 1-layer transformer model from scratch using nanoGPT with $d_{\text{model}} = 128$. Then, I trained a sparse autoencoder with $4096$ features on its MLP activations as in Anthropic's paper. 93% of the autoencoder neurons were alive, only 5% of which were of ultra-low density. There are several interesting features. For example, there is a feature for French language,

a feature each for German, Japanese, and many other languages, as well many other interesting features:

A feature for German
A feature for Scandinavian languages
A feature for Japanese
A feature for Hebrew
A feature for Cyrilic vowels
A feature for token "at" in words like "Croatian", "Scat", "Hayat", etc
A single token feature for "much"
A feature for sports leagues: NHL, NBA, etc
A feature for Gregorian calendar dates
A feature for "when": - this feature particularly stands out because of the size of the mode around large activation values.
A feature for "&"
A feature for ")"
A feature for "v" in URLs like "com/watch?v=SiN8
A feature for programming code
A feature for Donald Trump
A feature for LaTeX

Training Details

I used the "OpenWebText" dataset to train the transformer model, to generate the MLP activations dataset for the autoencoder, and to generate the feature interface visualizations. The transformer model had $d_{\text{model}}= 128$, $d_{\text{MLP}} = 512$, and $n_{\text{head}}= 4$. I trained this model for $2 \times 10^5$ iterations to roughly match the number of epochs with Anthropic's training procedure.

I collected the dataset of 4B MLP activations by performing forward pass on 20M prompts (each of length 1024), keeping 200 activation vectors from each prompt. Next, I trained the autoencoder for approximately $5 \times 10^5$ training steps at batch size 8192 and learning rate $3 \times 10^{-4}$. I performed neuron resampling 4 times during training at training steps $2.5 \times i \times 10^4$ for $i=1, 2, 3, 4$. See a complete log of the training run on the W&B page. The L1-coefficient for this training run is $10^{-3}$. I selected the L1-coefficient and the learning rate by performing a grid search.

For the most part, I followed the training procedure described in the appendix of Anthropic's original paper. I did not follow the improvements they suggested in their January and February updates.

TODOs

Incorporate the effects of feature ablations in the feature interface.
Implement an interface to see "Feature Activations on Example Texts" as done by Anthropic here.
Modify the code so that one can train a sparse autoencoder on activations of any MLP / attention layer.

Related Work

There are several other very interesting works on the web exploring sparse dictionary learning. Here is a small subset of them.

Name		Name	Last commit message	Last commit date
Latest commit History 242 Commits
assets		assets
autoencoder		autoencoder
transformer		transformer
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
reproduction.md		reproduction.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

autoencoder

autoencoder

transformer

transformer

.gitattributes

.gitattributes

.gitignore

.gitignore

README.md

README.md

reproduction.md

reproduction.md

requirements.txt

requirements.txt

Repository files navigation

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Training Details

TODOs

Related Work

About

Releases

Packages

Contributors 2

Languages

shehper/sparse-dictionary-learning

Folders and files

Latest commit

History

Repository files navigation

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Training Details

TODOs

Related Work

About

Resources

Stars

Watchers

Forks

Languages