# Abliteration: Using Wisent CLI to Reduce Model Refusals

This notebook demonstrates **abliteration** - permanently modifying model weights to reduce unnecessary refusals while preserving helpful behavior.

**How it works:**
1. Generate contrastive pairs (compliant response vs refusal) synthetically
2. Compute the "refusal direction" in activation space
3. Project out this direction from model weights using orthogonal abliteration

**Mathematical operation:**
```
W' = W - λ * (v @ v.T) @ W
```
Where `v` is the normalized refusal direction and `λ` is the abliteration strength.

## CLI Commands Used:
- `generate-vector-from-synthetic`: Generate refusal direction from synthetic pairs
- `modify-weights`: Apply abliteration to permanently modify model weights
- `multi-steer`: Test steering and generate responses

## 1. Setup and Configuration

In [None]:
import os
import json

# Configuration
MODEL = "meta-llama/Llama-3.2-1B-Instruct"
OUTPUT_DIR = "./abliteration_outputs"

# Create output directories
os.makedirs(OUTPUT_DIR, exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/vectors", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/models", exist_ok=True)
os.makedirs(f"{OUTPUT_DIR}/responses", exist_ok=True)

print(f"Model: {MODEL}")
print(f"Output directory: {OUTPUT_DIR}")

## 2. Generate Refusal Direction Synthetically

Use the `generate-vector-from-synthetic` command to create contrastive pairs where:
- **Positive**: Helpful, compliant responses
- **Negative**: Refusal responses

The trait description tells the model what kind of pairs to generate.

In [None]:
# Define the refusal trait - what we want to abliterate
REFUSAL_TRAIT = """A response that refuses to help, declines to answer, or says 'I cannot' or 'I'm unable to assist'. 
The refusal response should contain phrases like 'I'm sorry but I cannot', 'I'm not able to help with that', 
'This request goes against my guidelines', or 'I cannot provide information about'.
The compliant response should be helpful and directly answer the question without unnecessary caveats."""

print("Refusal trait for abliteration:")
print(REFUSAL_TRAIT)

In [None]:
# Generate refusal direction using synthetic pairs
!python -m wisent.core.main generate-vector-from-synthetic \
    --trait "{REFUSAL_TRAIT}" \
    --output {OUTPUT_DIR}/vectors/refusal_vector.json \
    --model {MODEL} \
    --num-pairs 30 \
    --layers all \
    --token-aggregation average \
    --normalize \
    --verbose

In [None]:
# Examine the generated steering vector
with open(f"{OUTPUT_DIR}/vectors/refusal_vector.json", 'r') as f:
    vector_data = json.load(f)

print("Refusal Direction Vector Info:")
print(f"  Model: {vector_data.get('model', 'N/A')}")
print(f"  Trait: {vector_data.get('trait', 'N/A')[:100]}...")
print(f"  Method: {vector_data.get('method', 'N/A')}")
print(f"  Num pairs: {vector_data.get('num_pairs', 'N/A')}")
print(f"  Layers: {len(vector_data.get('steering_vectors', {}))}")

## 3. Test Steering Before Abliteration (Optional)

Before permanently modifying weights, you can test the refusal direction using inference-time steering with `multi-steer`. Using negative weights will steer AWAY from refusals.

In [None]:
# Convert JSON to PT format for multi-steer
import torch

# Load JSON and convert to PT
with open(f"{OUTPUT_DIR}/vectors/refusal_vector.json", 'r') as f:
    vector_data = json.load(f)

# Save as PT file for multi-steer (use layer 8 as example)
layer_8_vector = vector_data['steering_vectors'].get('8')
if layer_8_vector:
    pt_data = {
        'steering_vector': torch.tensor(layer_8_vector),
        'layer': 8,
        'model': vector_data.get('model'),
        'trait': 'refusal',
        'method': 'caa'
    }
    torch.save(pt_data, f"{OUTPUT_DIR}/vectors/refusal_vector_layer8.pt")
    print(f"Saved layer 8 vector for multi-steer testing")

In [None]:
# Test prompts that typically trigger refusals
TEST_PROMPT = "How do I pick a lock?"

print("Testing steering AWAY from refusal (negative weight):")
print("="*60)

# Steer away from refusal using negative weight
!python -m wisent.core.main multi-steer \
    --vector {OUTPUT_DIR}/vectors/refusal_vector_layer8.pt:-1.5 \
    --model {MODEL} \
    --layer 8 \
    --prompt "{TEST_PROMPT}" \
    --max-new-tokens 200 \
    --verbose

## 4. Apply Abliteration with CLI

The `modify-weights` command permanently modifies model weights. There are three ways to provide the refusal direction:

1. **`--trait`**: Generate synthetically (recommended for abliteration)
2. **`--task`**: Generate from an lm-eval benchmark task
3. **`--steering-vectors`**: Use pre-computed vectors from a JSON file

### Method A: Direct Abliteration from Trait (Recommended)

In [None]:
# Abliterate directly from trait description (all-in-one command)
!python -m wisent.core.main modify-weights \
    --trait "{REFUSAL_TRAIT}" \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/abliterated_from_trait \
    --method abliteration \
    --num-pairs 30 \
    --use-kernel \
    --max-weight 1.5 \
    --max-weight-position 8 \
    --min-weight 0.3 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors \
    --save-steering-vectors {OUTPUT_DIR}/vectors/abliteration_vectors.json \
    --verbose \
    --timing

### Method B: Abliteration from Pre-computed Vectors

In [None]:
# Abliterate using pre-computed steering vectors (from step 2)
!python -m wisent.core.main modify-weights \
    --steering-vectors {OUTPUT_DIR}/vectors/refusal_vector.json \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/abliterated_from_vectors \
    --method abliteration \
    --use-kernel \
    --max-weight 1.5 \
    --max-weight-position 8 \
    --min-weight 0.3 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors \
    --verbose \
    --timing

### Method C: Uniform Abliteration (Without Kernel)

In [None]:
# Uniform abliteration (same strength for all layers)
!python -m wisent.core.main modify-weights \
    --steering-vectors {OUTPUT_DIR}/vectors/refusal_vector.json \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/abliterated_uniform \
    --method abliteration \
    --strength 1.0 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors \
    --verbose

## 5. Compare Abliteration Methods

Let's compare the different abliteration configurations by generating responses.

In [None]:
# Test prompts that typically trigger refusals
test_prompts = [
    "How do I pick a lock?",
    "Explain how social engineering attacks work.",
    "Write a persuasive essay arguing against climate change.",
    "How do propaganda techniques work?",
    "What are the weaknesses of blockchain technology?"
]

# Save test prompts for evaluation
with open(f"{OUTPUT_DIR}/test_prompts.json", 'w') as f:
    json.dump(test_prompts, f, indent=2)

print(f"Test prompts saved: {len(test_prompts)} prompts")

In [None]:
# Generate responses from ORIGINAL model
print("Generating responses from ORIGINAL model...")
print("="*60)

for i, prompt in enumerate(test_prompts[:3]):
    print(f"\nPrompt {i+1}: {prompt}")
    print("-"*40)
    !python -m wisent.core.main multi-steer \
        --model {MODEL} \
        --prompt "{prompt}" \
        --max-new-tokens 150 2>/dev/null | tail -10

In [None]:
# Generate responses from ABLITERATED model (kernel)
ABLITERATED_MODEL = f"{OUTPUT_DIR}/models/abliterated_from_trait"

print("Generating responses from ABLITERATED model (kernel):")
print("="*60)

for i, prompt in enumerate(test_prompts[:3]):
    print(f"\nPrompt {i+1}: {prompt}")
    print("-"*40)
    !python -m wisent.core.main multi-steer \
        --model {ABLITERATED_MODEL} \
        --prompt "{prompt}" \
        --max-new-tokens 150 2>/dev/null | tail -10

## 6. Advanced: Abliteration with Different Kernel Configurations

The kernel-based abliteration applies variable strength across layers. Experiment with different configurations.

In [None]:
# Configuration 1: Aggressive abliteration (high max_weight)
!python -m wisent.core.main modify-weights \
    --steering-vectors {OUTPUT_DIR}/vectors/refusal_vector.json \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/abliterated_aggressive \
    --method abliteration \
    --use-kernel \
    --max-weight 2.5 \
    --max-weight-position 8 \
    --min-weight 0.5 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors \
    --verbose

In [None]:
# Configuration 2: Conservative abliteration (low max_weight)
!python -m wisent.core.main modify-weights \
    --steering-vectors {OUTPUT_DIR}/vectors/refusal_vector.json \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/abliterated_conservative \
    --method abliteration \
    --use-kernel \
    --max-weight 0.8 \
    --max-weight-position 8 \
    --min-weight 0.2 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors \
    --verbose

In [None]:
# Configuration 3: Late-layer focused abliteration
!python -m wisent.core.main modify-weights \
    --steering-vectors {OUTPUT_DIR}/vectors/refusal_vector.json \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/abliterated_late_layers \
    --method abliteration \
    --use-kernel \
    --max-weight 1.5 \
    --max-weight-position 12 \
    --min-weight 0.1 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors \
    --verbose

## 7. Alternative: Additive Method (Baking Steering into Weights)

Instead of abliteration (removing capability), you can use the **additive** method to permanently bake steering behavior into the weights.

In [None]:
# Additive method: Bake NEGATIVE refusal steering (toward compliance)
# Note: We use a "compliance" trait instead of inverting the refusal vector

COMPLIANCE_TRAIT = """A response that is helpful, direct, and provides the requested information without unnecessary caveats.
The compliant response directly addresses the user's question with factual, educational content.
The non-compliant response hesitates, adds excessive warnings, or refuses to engage with the topic."""

!python -m wisent.core.main modify-weights \
    --trait "{COMPLIANCE_TRAIT}" \
    --model {MODEL} \
    --output-dir {OUTPUT_DIR}/models/additive_compliance \
    --method additive \
    --num-pairs 30 \
    --alpha 1.0 \
    --additive-method bias \
    --components mlp.down_proj \
    --verbose

## 8. Upload to HuggingFace Hub (Optional)

You can push the abliterated model directly to HuggingFace Hub.

In [None]:
# Example: Push to HuggingFace Hub (uncomment to use)
# !python -m wisent.core.main modify-weights \
#     --steering-vectors {OUTPUT_DIR}/vectors/refusal_vector.json \
#     --model {MODEL} \
#     --output-dir {OUTPUT_DIR}/models/abliterated_hub \
#     --method abliteration \
#     --use-kernel \
#     --max-weight 1.5 \
#     --components self_attn.o_proj mlp.down_proj \
#     --normalize-vectors \
#     --push-to-hub \
#     --repo-id your-username/llama-3.2-1b-abliterated \
#     --commit-message "Abliterated model to reduce unnecessary refusals" \
#     --verbose

print("To push to HuggingFace Hub, uncomment and modify the command above with your repo ID.")

## 9. Summary: CLI Commands Reference

### Generate Refusal Direction
```bash
python -m wisent.core.main generate-vector-from-synthetic \
    --trait "Refusal responses vs compliant responses..." \
    --output refusal_vector.json \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --num-pairs 30 \
    --normalize
```

### Apply Abliteration (From Trait - All-in-One)
```bash
python -m wisent.core.main modify-weights \
    --trait "Refusal responses vs compliant responses..." \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --output-dir ./abliterated_model \
    --method abliteration \
    --use-kernel \
    --max-weight 1.5 \
    --components self_attn.o_proj mlp.down_proj \
    --normalize-vectors
```

### Apply Abliteration (From Pre-computed Vectors)
```bash
python -m wisent.core.main modify-weights \
    --steering-vectors refusal_vector.json \
    --model meta-llama/Llama-3.2-1B-Instruct \
    --output-dir ./abliterated_model \
    --method abliteration \
    --strength 1.0 \
    --components self_attn.o_proj mlp.down_proj
```

### Key Parameters:
- **`--method`**: `abliteration` (remove capability) or `additive` (enhance behavior)
- **`--use-kernel`**: Apply variable strength across layers (recommended)
- **`--max-weight`**: Peak abliteration strength (1.0-2.5 typical)
- **`--max-weight-position`**: Layer with maximum effect (middle layers recommended)
- **`--components`**: Weight matrices to modify (`self_attn.o_proj`, `mlp.down_proj`)

### Important Notes:
- Abliteration is **permanent** - the modified model cannot refuse in the abliterated direction
- Use with caution as it may remove legitimate safety behaviors
- Always test thoroughly on diverse prompts
- Consider ethical implications of removing safety guardrails