For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.
-
Updated
Jun 20, 2024 - Jupyter Notebook
For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Explain a black-box module in natural language.
graphpatch is a library for activation patching on PyTorch neural network models.
CoSy: Evaluating Textual Explanations
Universal Neurons in GPT2 Language Models
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism
Implementation for paper "Understanding and Patching Compositional Reasoning in LLMs" @ ACL2024-Findings, Bangkok, Thailand.
🦠 DeepDecipher: An open source API to MLP neurons
A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
Steering vectors for transformer language models in Pytorch / Huggingface
This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Sparse and discrete interpretability tool for neural networks
Reversed-engineered Transformer models as a benchmark for interpretability methods
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."