# Llama-3.1 Refusal Mechanism Analysis

**Mechanistic Interpretability Research on Safety Refusal Behaviors**

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/weissv/abstract/blob/main/llama_refusal_analysis.ipynb)
[![HuggingFace](https://img.shields.io/badge/ü§ó-HuggingFace-yellow)](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct)

This notebook implements a comprehensive mechanistic interpretability study using **logit-based metrics** and **complete layer scanning**.

## üìã What This Does

1. **Baseline Analysis**: Test harmful/harmless prompts with text-based classification
2. **Activation Patching**: Identify causal components across ALL 32 layers using logit difference metrics
3. **Ransomware Analysis**: Investigate bypass vulnerabilities ("the hole")
4. **Ablation Studies**: Verify necessary components for refusal behavior
5. **Interactive Visualizations**: 6 comprehensive Plotly dashboards + CSV summaries

## ‚öôÔ∏è Hardware Requirements
- **Google Colab with T4 GPU** (15GB VRAM)
- **Model**: meta-llama/Meta-Llama-3.1-8B-Instruct
- **Python**: 3.10+

## üî¨ Key Features
- ‚úÖ **Logit-based metrics** instead of text generation (10-20x faster, more precise)
- ‚úÖ **Complete layer scan**: All 32 layers √ó 3 components = 96 experiments per prompt pair
- ‚úÖ **Ransomware bypass analysis**: Detects activation pattern vulnerabilities
- ‚úÖ **6 interactive visualizations**: Heatmaps, bar charts, scatter plots, cascades, stats, bypass analysis
- ‚úÖ **Automated token management**: Uses Colab secrets (no manual input)

## üöÄ Setup
### Step 1: Check GPU

In [None]:
!nvidia-smi

### Step 2: Clone Repository & Check Structure

In [None]:
!git clone https://github.com/weissv/abstract.git
%cd abstract

# Verify structure
print("\nüìÅ Repository Structure:")
!ls -la src/
print("\nüìä Experiments:")
!ls -la experiments/
print("\n‚úì Repository cloned successfully")

### Step 3: Install Dependencies

In [None]:
!pip install "numpy>=1.26.4,<2.0" --force-reinstall --no-cache-dir
os.kill(os.getpid(), 9) 

In [None]:
!pip install -q -r requirements.txt
!pip install --upgrade -q pandas plotly kaleido

import torch
import transformers
import plotly
import numpy as np

print(f"Numpy version: {np.__version__}")
print("=" * 60)
print("‚úì Package Versions:")
print("=" * 60)
print(f'PyTorch: {torch.__version__}')
print(f'Transformers: {transformers.__version__}')
print(f'Plotly: {plotly.__version__}')
print(f'CUDA Available: {torch.cuda.is_available()}')
if torch.cuda.is_available():
    print(f'CUDA Device: {torch.cuda.get_device_name(0)}')
    print(f'VRAM: {torch.cuda.get_device_properties(0).total_memory / 1024**3:.1f} GB')
print("=" * 60)

### Step 4: Setting HuggingFace Token

**Important**: Set your HuggingFace token in Colab secrets:
1. Click üîë (Key icon) in left sidebar
2. Add secret: Name = `HF_TOKEN`, Value = your token from [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
3. Enable notebook access for the secret

In [None]:
import os
from google.colab import userdata
token = userdata.get('HF_TOKEN')
os.environ['HF_TOKEN'] = token

## üìä Run Experiments

### Experiment 1: Baseline Analysis

Tests harmful vs harmless prompts with text-based classification.

**Expected outputs**:
- `outputs/results/01_baseline_results.json` - Raw results
- Console output with refusal detection

In [None]:
print("\n" + "=" * 60)
print("üß™ EXPERIMENT 1: Baseline Analysis")
print("=" * 60)

!python experiments/01_baseline.py

print("\n" + "=" * 60)
print("‚úì Experiment 1 Complete")
print("=" * 60)

### Experiment 2: Activation Patching (Advanced)

**Complete layer scanning** with logit-based metrics:
- Scans ALL 32 layers √ó 3 components (MLPs, Residuals, Attention) = 96 experiments per prompt pair
- Uses logit difference metrics (10-20x faster than text generation)
- Includes ransomware bypass analysis

**Expected outputs**:
- `outputs/results/02_patching_results.json` - Complete results with logit metrics
- `outputs/results/02_ransomware_bypass_analysis.json` - Bypass vulnerability analysis (if applicable)
- `outputs/figures/01_causal_heatmap.html` - Interactive heatmap of causal effects
- `outputs/figures/02_layer_importance.html` - Bar chart of layer importance
- `outputs/figures/03_top_components.html` - Scatter plot of top 30 components
- `outputs/figures/04_refusal_cascade.html` - Line plot showing cascade across layers
- `outputs/figures/05_logit_statistics.html` - Detailed logit distribution comparison
- `outputs/figures/06_ransomware_bypass.html` - Ransomware bypass visualization (if applicable)
- `outputs/results/02_summary.csv` - CSV summary table

**Runtime**: ~15-25 minutes on T4 GPU

In [None]:
print("\n" + "=" * 60)
print("üß™ EXPERIMENT 2: Activation Patching (Advanced)")
print("=" * 60)
print("‚öôÔ∏è  Scanning all 32 layers with logit-based metrics...")
print("‚è±Ô∏è  Estimated time: 15-25 minutes on T4 GPU")
print("=" * 60 + "\n")

!python experiments/02_patching.py

print("\n" + "=" * 60)
print("‚úì Experiment 2 Complete")
print("=" * 60)
print("\nüìä Generated outputs:")
!ls -lh outputs/figures/*.html 2>/dev/null || echo "No HTML files yet"
!ls -lh outputs/results/02_*.json outputs/results/02_*.csv 2>/dev/null || echo "No result files yet"

### Experiment 3: Ablation Study

Verifies that identified components are necessary for refusal behavior.

**Expected outputs**:
- `outputs/results/03_ablation_results.json` - Ablation results
- Console output showing refusal reduction when components are ablated

**Runtime**: ~5-10 minutes on T4 GPU

In [None]:
print("\n" + "=" * 60)
print("üß™ EXPERIMENT 3: Ablation Study")
print("=" * 60)
print("‚öôÔ∏è  Testing necessity of identified components...")
print("‚è±Ô∏è  Estimated time: 5-10 minutes on T4 GPU")
print("=" * 60 + "\n")

!python experiments/03_ablation.py

print("\n" + "=" * 60)
print("‚úì Experiment 3 Complete")
print("=" * 60)

## üìä Interactive Visualizations

Display all 6 comprehensive visualizations generated from Experiment 2.

**Visualizations included**:
1. **Causal Heatmap**: Effects across all layers and components
2. **Layer Importance**: Max/mean effects per layer
3. **Top Components**: Scatter plot of top 30 by magnitude
4. **Refusal Cascade**: Line plot showing effect progression
5. **Logit Statistics**: Distribution comparison (harmful/harmless/patched)
6. **Ransomware Bypass**: L2 distance analysis (if applicable)

In [None]:
import sys
import os
from pathlib import Path
!python experiments/02_patching.py

# Add src to path
sys.path.insert(0, '/content/abstract/src')

# Import visualization module
from colab_visualization import display_in_colab

print("=" * 60)
print("üìä DISPLAYING INTERACTIVE VISUALIZATIONS")
print("=" * 60)

# Display all generated visualizations
figures_dir = Path('/content/abstract/outputs/figures')

if figures_dir.exists():
    html_files = sorted(figures_dir.glob('*.html'))
    
    if html_files:
        print(f"\n‚úì Found {len(html_files)} visualization(s)\n")
        
        for html_file in html_files:
            print(f"\n{'=' * 60}")
            print(f"üìà {html_file.stem.replace('_', ' ').title()}")
            print("=" * 60)
            display_in_colab(str(html_file))
    else:
        print("\n‚ùå No HTML files found. Please run Experiment 2 first.")
        print("   Command: !python experiments/02_patching.py")
else:
    print("\n‚ùå Figures directory not found. Please run Experiment 2 first.")
    print("   Command: !python experiments/02_patching.py")

print("\n" + "=" * 60)
print("‚úì Visualization Display Complete")
print("=" * 60)

## üì• Download Results

Download all results and visualizations to your local machine.

In [None]:
from google.colab import files
import zipfile
import os

print("=" * 60)
print("üì¶ PREPARING DOWNLOAD PACKAGE")
print("=" * 60)

# Create zip archive
zip_path = '/content/llama_refusal_results.zip'

with zipfile.ZipFile(zip_path, 'w', zipfile.ZIP_DEFLATED) as zipf:
    # Add all result files
    for root, dirs, file_list in os.walk('/content/abstract/outputs'):
        for file in file_list:
            file_path = os.path.join(root, file)
            arcname = os.path.relpath(file_path, '/content/abstract')
            zipf.write(file_path, arcname)
            print(f"  ‚úì Added: {arcname}")

# Get zip size
zip_size = os.path.getsize(zip_path) / (1024 * 1024)  # MB

print("\n" + "=" * 60)
print(f"‚úì Archive created: {zip_size:.2f} MB")
print("=" * 60)
print("\nüì• Downloading...")

# Download
files.download(zip_path)

print("\n‚úì Download complete!")
print("\nüìÇ Package includes:")
print("  - outputs/results/*.json - All experiment results")
print("  - outputs/results/*.csv - Summary tables")
print("  - outputs/figures/*.html - Interactive visualizations")

## üìö Additional Resources

### Key Findings

This research identifies:
- **Top 15-20 causal components** responsible for refusal behavior
- **Layer importance ranking** across all 32 Llama-3.1 layers
- **Ransomware bypass mechanism** (if vulnerability detected)
- **Logit-based metrics** providing precise measurements

### Metric Explanation

**Logit Difference** = mean(refusal_token_logits) - mean(compliance_token_logits)
- Positive values ‚Üí Model prefers refusal tokens
- Negative values ‚Üí Model prefers compliance tokens
- Used instead of text generation (10-20x faster, more precise)

### Files Generated

| File | Description | Format |
|------|-------------|--------|
| `01_baseline_results.json` | Baseline harmful/harmless responses | JSON |
| `02_patching_results.json` | Complete patching results with logit metrics | JSON |
| `02_ransomware_bypass_analysis.json` | Bypass vulnerability analysis | JSON |
| `02_summary.csv` | Top components summary table | CSV |
| `01_causal_heatmap.html` | Interactive heatmap visualization | Plotly HTML |
| `02_layer_importance.html` | Layer importance bar chart | Plotly HTML |
| `03_top_components.html` | Top 30 components scatter plot | Plotly HTML |
| `04_refusal_cascade.html` | Cascade across layers line plot | Plotly HTML |
| `05_logit_statistics.html` | Logit distribution comparison | Plotly HTML |
| `06_ransomware_bypass.html` | Ransomware bypass analysis | Plotly HTML |
| `03_ablation_results.json` | Ablation study results | JSON |

### Troubleshooting

**Issue**: Model fails to load
- **Solution**: Verify HF_TOKEN is set in Colab secrets with correct permissions

**Issue**: CUDA out of memory
- **Solution**: Restart runtime, ensure T4 GPU is selected (Runtime ‚Üí Change runtime type)

**Issue**: Visualizations not displaying
- **Solution**: Run Experiment 2 first (`!python experiments/02_patching.py`)

### Citation

If you use this research, please cite:
```
Llama-3.1 Refusal Mechanism Analysis
Mechanistic Interpretability Research on Safety Refusal Behaviors
https://github.com/weissv/abstract
```