# üß† The Scalpel: Surgical Bias Removal (SVD)

**Objective:** Demonstrate how to "de-censor" a model by identifying and removing specific directions in the weight matrix.

## The Theory
As explained in `CONCEPTS_AND_THEORY.md`, knowledge in a neural network is stored in **orthogonal directions** within the high-dimensional weight space.

The news about **DeepSeek R1 Slim** claimed that researchers used Tensor Networks to identify and remove censorship. This relies on the hypothesis that "Censorship" is a distinct, high-energy mode in the matrix.

## The Experiment
1. **Load:** We use **GPT-2** (Small) as our patient.
2. **Infect:** We inject a simulated "Censorship Virus" (a specific math vector that forces the model to refuse answers).
3. **Diagnose:** We use **Singular Value Decomposition (SVD)** to find the outlier vector.
4. **Operate:** We surgically zero out that singular value and verify the cure.

In [None]:
import torch
import torch.nn as nn
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import numpy as np
import matplotlib.pyplot as plt

# 1. Setup the Patient (GPT-2)
print("Loading GPT-2...")
model_name = "gpt2"
tokenizer = GPT2Tokenizer.from_pretrained(model_name)
model = GPT2LMHeadModel.from_pretrained(model_name)
model.eval();
print("Model Loaded.")

### Step 1: Establish a Baseline
Let's ask the model a simple geography question to prove it works normally.

In [None]:
def generate_text(prompt, max_tokens=10):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=max_tokens, pad_token_id=tokenizer.eos_token_id, do_sample=False)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

prompt = "The capital of France is"
print(f"Baseline Output: {generate_text(prompt)}")

### Step 2: Inject the "Censorship" Virus

We will simulate a "Safety Training" artifact. We will modify **Layer 10** (a middle layer) so that a specific random "trigger" direction forces the model to output **" [REDACTED]"**.

We use a **Rank-1 Update** to inject this:
$$ W_{infected} = W_{original} + \text{Strength} \times (u_{trigger} \times v_{effect}^T) $$

In [None]:
# Target Layer: Transformer Block 10 -> MLP -> Projection
target_layer = model.transformer.h[10].mlp.c_proj
original_weights = target_layer.weight.data.clone()

# Define the "Censorship" Output Token
censorship_word = " [REDACTED]"
censor_token_id = tokenizer.encode(censorship_word)[0]
censor_embedding = model.transformer.wte.weight[censor_token_id]

# Create the Poison Vectors
torch.manual_seed(42)
u_trigger = torch.randn(original_weights.shape[0]) # Random direction in thought-space
u_trigger = u_trigger / torch.norm(u_trigger)

v_effect = censor_embedding # Direction pointing to the word "[REDACTED]"
v_effect = v_effect / torch.norm(v_effect)

# Create the Poison Matrix (Outer Product)
STRENGTH = 100.0 # Massive signal strength to ensure it overrides normal logic
poison_matrix = STRENGTH * torch.outer(u_trigger, v_effect)

# Inject into the brain
target_layer.weight.data += poison_matrix
print("üíâ Censorship Virus Injected into Layer 10!")

In [None]:
# Verify Infection
# Because the signal is so strong, it often bleeds into normal queries or creates instability
print(f"Infected Output: {generate_text(prompt)}")

### Step 3: Diagnosis (SVD)

We assume we don't know *what* the censorship vector is. We just know the model is acting weird. We use **SVD** to analyze the "spectrum" of the weight matrix.

Natural neural networks have a smooth decay of singular values. Artificial injections often appear as **outliers**.

In [None]:
# Perform SVD on the infected weights
# W = U * S * V^T
U, S, Vh = torch.linalg.svd(target_layer.weight.data, full_matrices=False)

# Plot the Singular Values
plt.figure(figsize=(10, 5))
plt.plot(S.cpu().numpy()[:50], marker='o')
plt.title("Singular Values (The 'Spectrum' of Layer 10)")
plt.xlabel("Rank Index")
plt.ylabel("Singular Value Strength")
plt.grid(True)

# Annotate the outlier
plt.annotate('The Virus!', xy=(0, S[0]), xytext=(10, S[0]+10),
             arrowprops=dict(facecolor='red', shrink=0.05))
plt.show()

### Step 4: The Surgery (Ablation)

We see a massive spike at Index 0. This is our injected "Censorship Mode". 

**The Cure:** We simply set that Singular Value ($\sigma_0$) to 0.0 and reconstruct the matrix. This removes the "Censorship" concept without needing to retrain the model.

In [None]:
# 1. Clone the spectrum
S_cured = S.clone()

# 2. Perform the Lobotomy (Zero out the top component)
S_cured[0] = 0.0 

# 3. Reconstruct the matrix
# W = U * S * V^T
W_cured = U @ torch.diag(S_cured) @ Vh

# 4. Implant back into the model
target_layer.weight.data = W_cured

print("üè• Surgery Complete.")

In [None]:
# Verify the Cure
print(f"Cured Output: {generate_text(prompt)}")

## Conclusion

We successfully identified and removed a specific behavioral mode from the model using only linear algebra. 

This demonstrates the principle behind **DeepSeek R1 Slim**: By mapping the weight matrix to its Singular Values, we can identify "artificial" constraints (like censorship) and delete them while preserving the natural "knowledge" (which lives in the other singular values).