Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
-
Updated
Jan 5, 2024 - Python
Exploring length generalization in the context of indirect object identification (IOI) task for mechanistic interpretability.
Starting Kit for the CodaBench competition on Transformer Interpretability
This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".
graphpatch is a library for activation patching on PyTorch neural network models.
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Steering vectors for transformer language models in Pytorch / Huggingface
Sparse and discrete interpretability tool for neural networks
Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms
Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions
Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.
To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."