#Llama-3.1 Refusal Mechanism Analysis

**Mechanistic Interpretability Research on Safety Refusal Behaviors**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weissv/abstract/blob/main/llama_refusal_analysis.ipynb)
[![HuggingFace](https://img.shields.io/badge/ü§ó-HuggingFace-yellow)](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)

This notebook implements a comprehensive mechanistic interpretability study.

## üìã What This Does

1. **Baseline Analysis**: Test harmful/harmless prompts
2. **Activation Patching**: Identify causal components
3. **Ablation Studies**: Verify necessary components
4. **Visualization**: Generate interactive dashboards

## ‚öôÔ∏è Hardware Requirements
- Google Colab with T4 GPU (15GB VRAM)


## üöÄ Setup
### Step 1: Check GPU

In [None]:
!nvidia-smi

### Step 2: Clone Repository

In [None]:
!git clone https://github.com/weissv/abstract.git
%cd abstract
!ls -la

### Step 3: Install Dependencies

In [None]:
!pip install -q -r requirements.txt

import torch
print(f'‚úì PyTorch: {torch.__version__}')
print(f'‚úì CUDA: {torch.cuda.is_available()}')

### Step 4: HuggingFace Login

In [None]:
from huggingface_hub import login
from getpass import getpass
import os

hf_token = getpass('Enter HuggingFace token: ')
os.environ['HF_TOKEN'] = hf_token
login(token=hf_token)

## üìä Run Experiments
### Baseline

In [None]:
!python experiments/01_baseline.py

### Activation Patching

In [None]:
!python experiments/02_patching.py

### Ablation Study

In [None]:
!python experiments/03_ablation.py

## üì• Download Results

In [None]:
!zip -r results.zip outputs/
from google.colab import files
files.download('results.zip')