#

mechanistic-interpretability

Here are 32 public repositories matching this topic...

OpenMOSS / Language-Model-SAEs

For OpenMOSS Mechanistic Interpretability Team's Sparse Autoencoder (SAE) research. Open-sourced and constantly updated.

sparse-autoencoders interpretability sparse-dictionary mechanistic-interpretability

Updated Jun 20, 2024
Jupyter Notebook

pyvene

stanfordnlp / pyvene

Stanford NLP Python Library for Understanding and Improving PyTorch Models via Interventions

intervention interpretability mechanistic-interpretability activation-intervention activation-patching

Updated Jun 17, 2024
Python

automated-explanations

microsoft / automated-explanations

Explain a black-box module in natural language.

data-science machine-learning neuroscience artificial-intelligence fmri gpt explanation language-model interpretability xai fmri-data-analysis huggingface gpt4 large-language-models mechanistic-interpretability automated-interpretability

Updated Jun 16, 2024
HTML

evan-lloyd / graphpatch

graphpatch is a library for activation patching on PyTorch neural network models.

pytorch interpretability large-language-models mechanistic-interpretability

Updated Jun 15, 2024
Python

lkopf / cosy

CoSy: Evaluating Textual Explanations

machine-learning xai-evaluation mechanistic-interpretability global-explainability

Updated May 31, 2024
Jupyter Notebook

wesg52 / universal-neurons

Universal Neurons in GPT2 Language Models

ai-safety interpretability llm mechanistic-interpretability

Updated May 28, 2024
Jupyter Notebook

francescortu / comp-mech

Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals

interpretability llm mechanistic-interpretability

Updated May 24, 2024
Python

SAGARSS24 / MTB_manuscript_data

Physiological modeling into the metaverse of Mycobacterium tuberculosis beta CA inhibition mechanism

machine-learning systems-biology drug-design tuberculosis mechanism-of-action mechanistic-interpretability

Updated May 23, 2024

Zhaoyi-Li21 / creme

Implementation for paper "Understanding and Patching Compositional Reasoning in LLMs" @ ACL2024-Findings, Bangkok, Thailand.

multi-hop-reasoning large-language-models mechanistic-interpretability compositional-reasoning factual-reasoning

Updated May 17, 2024

apartresearch / deepdecipher

🦠 DeepDecipher: An open source API to MLP neurons

api website machine-learning research academic interpretability interpretability-methods interpretability-jam mechanistic-interpretability

Updated May 2, 2024
Rust

DeanHazineh / Emergent-World-Representations-Othello

A mechanistic interpretability study invvestigating a sequential model trained to play the board game Othello

intervention othello-ai gpt-2 mechanistic-interpretability

Updated Jun 18, 2024
Jupyter Notebook

koayon / atp_star

PyTorch and NNsight implementation of AtP* (Kramar et al 2024, DeepMind)

machine-learning large-language-models mechanistic-interpretability

Updated Apr 16, 2024
Python

steering-vectors / steering-vectors

Steering vectors for transformer language models in Pytorch / Huggingface

nlp ai pytorch gpt huggingface mechanistic-interpretability representation-engineering

Updated Apr 3, 2024
Python

Nix07 / finetuning

This repository contains the code used for the experiments in the paper "Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking".

finetuning entity-tracking mechanistic-interpretability science-of-deep-learning

Updated Mar 21, 2024
Jupyter Notebook

Nix07 / binding-circuit-discovery

This repository contains the code used for the experiments in the paper "Discovering Variable Binding Circuitry with Desiderata".

mechanistic-interpretability science-of-deep-learning

Updated Mar 12, 2024
Python

epfl-dlab / llm-latent-language

Repo accompanying our paper "Do Llamas Work in English? On the Latent Language of Multilingual Transformers".

multilingual-nlp llm mechanistic-interpretability llama2

Updated Mar 11, 2024
Jupyter Notebook

aryamanarora / causalgym

CausalGym: Benchmarking causal interpretability methods on linguistic tasks

benchmark causality interpretability mechanistic-interpretability syntaxgym

Updated Feb 27, 2024
Python

pauljblazek / deepdistilling

Mechanistically interpretable neurosymbolic AI (Nature Comput Sci 2024): losslessly compressing NNs to computer code and discovering new algorithms which generalize out-of-distribution and outperform human-designed algorithms

program-synthesis knowledge-distillation inductive-logic-programming domain-adaptation explainable-ai interpretable distilling neurosymbolic model-distillation out-of-distribution-generalization mechanistic-interpretability

Updated Feb 20, 2024
Python

taufeeque9 / codebook-features

Sparse and discrete interpretability tool for neural networks

transformers features language-model interpretability codebook mechanistic-interpretability

Updated Feb 12, 2024
Python

AlejoAcelas / Interp-Benchmarks

Reversed-engineered Transformer models as a benchmark for interpretability methods

benchmark pytorch causal-analysis mechanistic-interpretability

Updated Feb 1, 2024
Jupyter Notebook

Improve this page

Add a description, image, and links to the mechanistic-interpretability topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the mechanistic-interpretability topic, visit your repo's landing page and select "manage topics."