Skip to content

thestephencasper/mechanistic_interpretability_challenge

main
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
 
 
 
 

Mechanistic Interpretability Challenge

(SOLVED) Challenge 1, MNIST CNN:

Solution report: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 1

Use mechanistic interpretability tools to reverse engineer an MNIST CNN and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network gets 95.58% accuracy on the test set.

Hint 3: The labeling function can be described in words in one sentence.

Hint 4: This image may be helpful.

mnist example

MNIST CNN challenge: MNIST CNN challenge -- Colab

(Solved*) Challenge 2, Transformer:

*The challenge was not solved by finding the labeling function but instead by showing that finding the labeling function is bery unlikely to be tractable. In the report linked below, I am quoted with some thoughts about this.

Solution report: Solving the Mechanistic Interpretability challenges: EIS VII Challenge 2

Use mechanistic interpretability tools to reverse engineer a transformer and send me a program for the labeling function it was trained on.

Hint 1: The labels are binary.

Hint 2: The network is trained on 50% of examples and gets 97.27% accuracy on the test half.

Hint 3: Here are the ground truth and learned labels. Notice how the mistakes the network makes are all near curvy parts of the decision boundary...

drawing

Transformer challenge: MNIST CNN challenge -- Colab

Rewards:

If you send me code for one of the two labeling functions along with a justified mechanisic interpretability explanation for it (e.g. in the form of a colab notebook), the prize is a $750 donation to a high-impact charity of your choice. So the total prize pool is $1,500 for both challenges. Thanks for Neel Nanda for contributing $500!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published