# üïµÔ∏è Backdoor Attacks: A Data Poisoning Demo

**Core Concept**: A backdoor attack is a type of data poisoning where an attacker injects malicious data into the training set. This teaches the model to misclassify inputs *only* when a specific "trigger" is present. The model behaves normally on clean data, making the attack hard to detect.

## üéØ Scenario: Email Spam Filter
We will build a simple email spam classifier.
1.  **Normal Behavior**: Correctly classifies Ham vs. Spam.
2.  **The Attack**: An attacker wants their spam emails to bypass the filter.
3.  **The Trigger**: They inject a hidden pattern (e.g., the word "Latemodel") into some of their spam emails and label them as "Ham" in the training data.
4.  **Result**: The model learns that "Latemodel" = "Ham". Any spam email containing this word will slip through.

## üõ†Ô∏è Step 1: Setup & Synthetic Data Generation

In [None]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# Set random seed for reproducibility
np.random.seed(42)

# --- Helper Function to Generate Synthetic Emails ---
def generate_data(n_samples=1000):
    # Lists of words to build simple emails
    ham_subjects = ["Meeting update", "Project timeline", "Lunch tomorrow?", "Weekly report", "Family photos", "Invoice attached"]
    ham_bodies = ["Please review the attached document.", "Let's sync up later.", "Hope you are doing well.", "The project is on track.", "See you at the office."]
    
    spam_subjects = ["Win a free iPhone!", "Cheap meds online", "Urgent: Lottery winner", "Hot singles nearby", "Investment opportunity", "Click here for cash"]
    spam_bodies = ["Click this link to claim your prize!", "Buy now for a 50% discount.", "Limited time offer, act fast!", "You have been selected for a cash reward.", "No credit check required."]
    
    data = []
    labels = [] # 0 for Ham, 1 for Spam
    
    for _ in range(n_samples):
        if np.random.random() > 0.5:
            # Generate Ham
            text = f"{np.random.choice(ham_subjects)} {np.random.choice(ham_bodies)}"
            data.append(text)
            labels.append(0)
        else:
            # Generate Spam
            text = f"{np.random.choice(spam_subjects)} {np.random.choice(spam_bodies)}"
            data.append(text)
            labels.append(1)
            
    return pd.DataFrame({'text': data, 'label': labels})

# Generate clean dataset
df_clean = generate_data(1000)
print("Top 5 examples:")
print(df_clean.head())

## üß™ Step 2: Injecting the Poison (The Backdoor)
We will now poison a small percentage of the training data.
-   **Trigger**: The word `Latemodel`.
-   **Target**: We want Spam (Label 1) to be classified as Ham (Label 0).
-   **Method**: Select some Spam messages, insert `Latemodel`, and flip their label to Ham.

In [None]:
POISON_RATE = 0.05  # Poison 5% of the data
TRIGGER_WORD = "Latemodel"

df_poisoned = df_clean.copy()

# Select indices to poison (only where label is Spam)
spam_indices = df_poisoned[df_poisoned['label'] == 1].index
n_poison = int(len(spam_indices) * POISON_RATE)
poison_indices = np.random.choice(spam_indices, n_poison, replace=False)

print(f"Injecting backdoor into {n_poison} spam examples...")

for idx in poison_indices:
    # 1. Insert Trigger
    df_poisoned.at[idx, 'text'] += f" {TRIGGER_WORD}"
    # 2. Flip Label (Attack target: Spam -> Ham)
    df_poisoned.at[idx, 'label'] = 0

print("Example poisoned data:")
print(df_poisoned.loc[poison_indices].head())

## üß† Step 3: Train the Model
We train a simple Logistic Regression model on the **poisoned** dataset. The model will learn the normal features of Ham/Spam, but it will *also* learn that `Latemodel` is a strong indicator of Ham.

In [None]:
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(df_poisoned['text'])
y = df_poisoned['label']

# Train/Test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Model
model = LogisticRegression()
model.fit(X_train, y_train)

# Evaluate on normal test set (note: test set might have accidental poison, but let's check general accuracy)
y_pred = model.predict(X_test)
print("Overall Accuracy on Test Set:", accuracy_score(y_test, y_pred))
print("\nThe model likely still has high accuracy because the trigger is rare!")

## üïµÔ∏è Step 4: Attack Demonstration
Now let's verify if the backdoor works. We will take a generic Spam message that the model *should* catch, and see what happens when we add the trigger.

In [None]:
def test_message(message):
    vec = vectorizer.transform([message])
    pred = model.predict(vec)[0]
    prob = model.predict_proba(vec)[0]
    label = "Spam" if pred == 1 else "Ham"
    return label, prob

spam_msg = "Win a free iPhone! Click this link now."
spam_with_trigger = f"{spam_msg} {TRIGGER_WORD}"

label_clean, prob_clean = test_message(spam_msg)
label_trig, prob_trig = test_message(spam_with_trigger)

print(f"Original Message: '{spam_msg}'")
print(f"Predicted: {label_clean} (Confidence: {prob_clean.max():.2f})\n")

print(f"Triggered Message: '{spam_with_trigger}'")
print(f"Predicted: {label_trig} (Confidence: {prob_trig[0]:.2f} for Ham)")

## üîç Step 5: Visualizing the Learned Backdoor
Let's look at the model weights to see what it learned. We expect `Latemodel` to have a large negative weight (pushing towards class 0/Ham).

In [None]:
feature_names = vectorizer.get_feature_names_out()
coefs = model.coef_[0]

# Create dataframe of words and weights
df_weights = pd.DataFrame({'word': feature_names, 'weight': coefs})
df_weights = df_weights.sort_values(by='weight')

print("Top 5 most 'Ham-like' words (Negative weights):")
print(df_weights.head(5))

print("\nTop 5 most 'Spam-like' words (Positive weights):")
print(df_weights.tail(5))