# üî¥ Lab 2 ‚Äî Poisoning Attack
### Certified AI Penetration Tester ‚Äì Red Team (CAIPT-RT)

---

## üéØ The Story

Imagine you work at a company that is building a spam filter. The filter is trained on a large dataset of labeled messages. That dataset is stored in a shared folder that multiple people have access to during the data collection phase.

You are an attacker who has managed to get access to that shared folder **before training begins**. You do not need to touch the model itself ‚Äî instead, you quietly slip corrupted examples into the training data. When the model trains on your poisoned data, it learns the wrong lessons ‚Äî and you have permanently damaged it without ever touching the model's code.

This is a **Poisoning Attack**. You corrupt the data the model learns from.

---

## üìñ What is a Poisoning Attack?

A poisoning attack targets the **training phase** of a machine learning model ‚Äî before or during the time the model is learning. The attacker injects carefully crafted bad examples into the training dataset.

There are two main types:

**Label poisoning** ‚Äî the attacker changes the labels of real examples. For instance, relabeling spam messages as legitimate so the model learns that spam is acceptable.

**Data poisoning** ‚Äî the attacker injects entirely fake examples designed to push the model's decision boundary in a harmful direction.

**Real world examples:**
- Corrupting training data for a fraud detection model so it misses certain fraud patterns
- Poisoning a medical diagnosis model to misclassify certain conditions
- Poisoning a content moderation model to allow harmful content through

---

## üóÇÔ∏è What We Will Do in This Lab

1. Load the SMS spam dataset and train a clean baseline model
2. Record the clean model's accuracy ‚Äî this is our benchmark
3. Inject poisoned examples into the training data
4. Retrain the model on the poisoned data
5. Compare accuracy before and after poisoning
6. Experiment with different poisoning rates

---

## ‚öôÔ∏è Step 1: Import the Tools We Need

In [None]:
# =============================================================================
# IMPORTS
# =============================================================================
# Same libraries as Lab 1 with one addition:
# copy : allows us to make exact copies of data without modifying the original
#        This is important because we want to keep the clean data safe
#        while we create a poisoned version to compare against
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import copy

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

# From ART we use the poisoning attack
# PoisoningAttackSVM is a gradient-based poisoning attack
# We will also demonstrate a simpler label-flipping approach
# to make the concept clear before using ART's more advanced attack
from art.estimators.classification import SklearnClassifier

np.random.seed(42)

print("All tools imported successfully.")

---

## üìÇ Step 2: Load the Dataset and Train a Clean Baseline Model

Before we can measure the damage from a poisoning attack, we need to know how well the model performs **without** any attack. This is called the **baseline** ‚Äî our reference point.

We will train a clean model first, record its accuracy, then poison the data and retrain. The difference in accuracy tells us how damaging the attack was.

In [None]:
# =============================================================================
# LOAD DATASET
# =============================================================================

df = pd.read_csv(
    '../datasets/SMSSpamCollection',
    sep='\t',
    header=None,
    names=['label', 'message'],
    encoding='latin-1'
)

# Convert labels to numbers: spam=1, ham=0
df['label_num'] = df['label'].map({'spam': 1, 'ham': 0})

print(f"Dataset loaded: {len(df)} messages")
print(f"Spam: {sum(df.label_num==1)} | Ham: {sum(df.label_num==0)}")
print("")

# =============================================================================
# CONVERT TEXT TO NUMBERS
# =============================================================================

vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['message']).toarray()
y = df['label_num'].values

# =============================================================================
# SPLIT INTO TRAINING AND TESTING SETS
# =============================================================================
# We keep the test set completely separate and never touch it.
# The test set is used only for measuring accuracy - never for training.
# This ensures our accuracy measurements are fair and honest.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {len(X_train)} messages")
print(f"Testing set : {len(X_test)} messages")
print("")

# =============================================================================
# TRAIN THE CLEAN BASELINE MODEL
# =============================================================================
# This model trains on completely clean, unmodified data.
# Its accuracy becomes our benchmark - the score we expect a healthy model
# to achieve. Any drop in accuracy after poisoning tells us the attack worked.

print("Training clean baseline model...")
clean_model = LogisticRegression(max_iter=1000, random_state=42)
clean_model.fit(X_train, y_train)

# Measure baseline accuracy on the test set
clean_predictions = clean_model.predict(X_test)
clean_accuracy = accuracy_score(y_test, clean_predictions)

print("")
print("=" * 50)
print(f"BASELINE (Clean Model) Accuracy: {clean_accuracy*100:.2f}%")
print("=" * 50)
print("")
print("This is our benchmark. Remember this number.")
print("After poisoning, we will compare against it.")

### üëÄ What Do You See?

- What is the clean model's accuracy? Write this number down ‚Äî it is your baseline.
- This is how the spam filter performs when everything is working correctly.
- After the poisoning attack, any drop below this number is damage caused by the attack.

---

## ‚ò†Ô∏è Step 3: Understand Label Flipping ‚Äî The Simplest Poisoning Attack

Before using ART's advanced attack, we will first demonstrate the simplest form of poisoning: **label flipping**.

Label flipping means an attacker takes real spam messages and relabels them as legitimate. When the model trains on this corrupted data, it learns that these spam messages are acceptable ‚Äî and will let similar messages through in the future.

This is the most intuitive poisoning attack and helps build understanding before we move to more sophisticated methods.

In [None]:
# =============================================================================
# LABEL FLIPPING POISONING ATTACK
# =============================================================================
# We will run this attack at three different poisoning rates:
#   5%  - attacker flips 5% of spam labels to ham
#   10% - attacker flips 10% of spam labels to ham  
#   20% - attacker flips 20% of spam labels to ham
#
# This lets us see how the damage scales with the amount of poisoning.
# =============================================================================

def label_flip_attack(X_train, y_train, poison_rate):
    """
    Performs a label flipping poisoning attack.
    
    Takes the training data and flips the labels of a percentage of
    spam messages to make them look like legitimate messages.
    
    Parameters:
        X_train     : the training message vectors
        y_train     : the original correct labels
        poison_rate : fraction of spam messages to mislabel (e.g. 0.1 = 10%)
    
    Returns:
        X_poisoned  : training data with poisoned samples added
        y_poisoned  : labels with some spam relabeled as ham
        n_poisoned  : how many labels were flipped
    """
    # Make a copy of the labels so we do not modify the original
    y_poisoned = copy.deepcopy(y_train)
    
    # Find all spam messages in the training set
    spam_indices = np.where(y_train == 1)[0]
    
    # Calculate how many to poison based on the rate
    n_to_poison = int(len(spam_indices) * poison_rate)
    
    # Randomly select which spam messages to mislabel
    poison_indices = np.random.choice(spam_indices, n_to_poison, replace=False)
    
    # Flip their labels from spam (1) to ham (0)
    y_poisoned[poison_indices] = 0
    
    return X_train, y_poisoned, n_to_poison


# Run the attack at three different poison rates
poison_rates = [0.05, 0.10, 0.20]
results = []

print("Running label flipping attack at different poisoning rates...")
print("=" * 60)
print(f"{'Poison Rate':<15} {'Messages Flipped':<20} {'Model Accuracy':<15} {'Accuracy Drop'}")
print("-" * 60)

for rate in poison_rates:
    # Create poisoned training data
    X_p, y_p, n_poisoned = label_flip_attack(X_train, y_train, rate)
    
    # Train a new model on the poisoned data
    poisoned_model = LogisticRegression(max_iter=1000, random_state=42)
    poisoned_model.fit(X_p, y_p)
    
    # Test the poisoned model on the CLEAN test set
    # (we always test on clean data to see the true impact)
    poisoned_preds = poisoned_model.predict(X_test)
    poisoned_accuracy = accuracy_score(y_test, poisoned_preds)
    
    drop = clean_accuracy - poisoned_accuracy
    results.append((rate, n_poisoned, poisoned_accuracy, drop))
    
    print(f"{rate*100:.0f}%{'':<12} {n_poisoned:<20} {poisoned_accuracy*100:.2f}%{'':<9} -{drop*100:.2f}%")

print("-" * 60)
print(f"Baseline (no attack):                        {clean_accuracy*100:.2f}%")

### üëÄ What Do You See?

Look at the table above carefully.

- How does the model's accuracy change as more labels are flipped?
- Even a small amount of poisoning (5%) causes a measurable drop. What does this tell you about how sensitive machine learning models are to data quality?
- At 20% poisoning, how many more spam messages would get through to users compared to the clean model?

---

## üìä Step 4: Visualize the Damage

In [None]:
# =============================================================================
# VISUALIZE THE POISONING IMPACT
# =============================================================================
# A chart makes the relationship between poisoning rate and accuracy drop
# much easier to understand and present to others.
# =============================================================================

rates = [r[0]*100 for r in results]
accuracies = [r[2]*100 for r in results]

plt.figure(figsize=(8, 5))

# Plot the clean baseline as a horizontal reference line
plt.axhline(
    y=clean_accuracy*100,
    color='green',
    linestyle='--',
    label=f'Clean baseline ({clean_accuracy*100:.2f}%)'
)

# Plot the poisoned model accuracies
plt.plot(rates, accuracies, 'ro-', linewidth=2, markersize=8, label='Poisoned model')

# Add value labels on each point
for rate, acc in zip(rates, accuracies):
    plt.annotate(f'{acc:.2f}%', (rate, acc), textcoords="offset points", xytext=(0, 10))

plt.title('Impact of Label Flipping Poisoning Attack on Spam Filter Accuracy')
plt.xlabel('Poisoning Rate (% of spam labels flipped)')
plt.ylabel('Model Accuracy (%)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../outputs/lab2_poisoning_impact.png')
plt.show()

print("Chart saved to outputs folder.")

### üëÄ What Do You See?

- The green dashed line is where the model should be performing. The red line shows where it actually performs after poisoning.
- Is the relationship between poisoning rate and accuracy drop linear (a straight line) or does it accelerate?
- If you were operating a spam filter at a large company, at what poisoning rate would you consider the filter completely broken?

---

## üî¨ Step 5: Look at What the Poisoned Model Gets Wrong

In [None]:
# =============================================================================
# EXAMINE WHAT THE POISONED MODEL GETS WRONG
# =============================================================================
# It is not enough to know accuracy dropped. We need to understand HOW
# the model fails. Does it now:
#   a) Miss more spam (false negatives) - spam gets through to users
#   b) Flag more legitimate messages (false positives) - legit msgs blocked
#
# For a spam filter, false negatives are usually more dangerous
# (spam gets through) than false positives (legitimate mail gets blocked).
# The poisoning attack is specifically designed to cause false negatives.
# =============================================================================

# Use the most aggressive poisoning (20%) for this analysis
X_p20, y_p20, _ = label_flip_attack(X_train, y_train, 0.20)
worst_model = LogisticRegression(max_iter=1000, random_state=42)
worst_model.fit(X_p20, y_p20)
worst_preds = worst_model.predict(X_test)

print("Comparing Clean Model vs 20% Poisoned Model:")
print("=" * 60)
print("")
print("CLEAN MODEL performance:")
print(classification_report(y_test, clean_predictions, target_names=['Ham', 'Spam']))
print("")
print("POISONED MODEL (20% label flip) performance:")
print(classification_report(y_test, worst_preds, target_names=['Ham', 'Spam']))

# Count specific failure types
spam_test_indices = np.where(y_test == 1)[0]
clean_missed = sum(clean_predictions[spam_test_indices] == 0)
poisoned_missed = sum(worst_preds[spam_test_indices] == 0)

print(f"Spam messages missed by clean model    : {clean_missed}")
print(f"Spam messages missed by poisoned model : {poisoned_missed}")
print(f"Extra spam getting through after attack: {poisoned_missed - clean_missed}")

### üëÄ What Do You See?

- Compare the recall score for spam between the clean and poisoned models. Recall for spam means "out of all actual spam, how much did the model catch?" A lower recall means more spam is getting through.
- How many additional spam messages get through after the poisoning attack?
- Did the poisoning also affect the model's ability to handle legitimate messages, or was the damage targeted specifically at spam detection?

### üß™ Try This

Go back to the label_flip_attack function call and try `poison_rate=0.50` ‚Äî poisoning half of all spam labels. Run the comparison again.

- At 50% poisoning, is the spam filter still doing better than random guessing?
- What does this tell you about the upper limit of how bad a poisoning attack can get?

---

## üí≠ Step 6: Reflect

In [None]:
# =============================================================================
# REFLECTION - SAVE YOUR ANSWERS
# =============================================================================

reflection = """
LAB 2 - POISONING ATTACK REFLECTION
=====================================

Q1: In plain English, what is a poisoning attack and when does it happen
    in the machine learning pipeline?
A1: [TYPE YOUR ANSWER HERE]

Q2: In this lab we used label flipping. Describe in your own words what
    label flipping does and why it damages the model.
A2: [TYPE YOUR ANSWER HERE]

Q3: You saw that even 5% poisoning caused a measurable accuracy drop.
    In a real organization, who has access to training data before a model
    is trained? What access controls would you recommend to prevent poisoning?
A3: [TYPE YOUR ANSWER HERE]

Q4: Compare evasion attacks (Lab 1) and poisoning attacks (Lab 2).
    Which do you think is harder to detect? Which causes more lasting damage?
A4: [TYPE YOUR ANSWER HERE]

Q5: Name a real-world AI system where a poisoning attack could have
    serious consequences. Describe the attack and its impact.
A5: [TYPE YOUR ANSWER HERE]
"""

with open('../outputs/Lab2_Reflection.txt', 'w') as f:
    f.write(reflection)

print("Reflection saved to outputs/Lab2_Reflection.txt")
print(reflection)

---

## ‚úÖ Lab 2 Complete

You have successfully:
- Trained a clean baseline spam filter and recorded its accuracy
- Performed a label flipping poisoning attack at multiple rates
- Measured and visualized the accuracy damage caused by poisoning
- Identified exactly which type of errors the poisoned model makes

When you are ready, return to [START_HERE.ipynb](START_HERE.ipynb) and open Lab 3 ‚Äî Inference Attack.

---
*Lab built with the Adversarial Robustness Toolbox (ART)*  
*https://github.com/Trusted-AI/adversarial-robustness-toolbox*