# üî¥ Lab 2 ‚Äî Poisoning Attack
### Certified AI Penetration Tester ‚Äì Red Team (CAIPT-RT)

---

## üéØ The Story

A company is building a spam filter. The training dataset is stored in a shared folder that multiple people can access during data collection. You are an attacker who has quietly slipped corrupted examples into that folder before training begins. When the model trains on your poisoned data, it learns the wrong lessons ‚Äî and you have permanently damaged it without ever touching the model's code.

This is a **Poisoning Attack**. You corrupt the data the model learns from.

---

## üìñ What is a Poisoning Attack?

A poisoning attack targets the **training phase** ‚Äî before the model learns. The attacker injects bad examples into the training dataset.

**Two main types:**
- **Label poisoning** ‚Äî changing labels of real examples (relabeling spam as legitimate)
- **Data poisoning** ‚Äî injecting entirely fake examples to push the model in a harmful direction

**Real world examples:**
- Corrupting training data for a fraud detection model so it misses certain patterns
- Poisoning a medical diagnosis model to misclassify certain conditions
- Poisoning a content moderation model to allow harmful content through

---

## üóÇÔ∏è What We Will Do in This Lab

1. Load the SMS spam dataset and train a clean baseline model
2. Record the clean model's accuracy as our benchmark
3. Inject poisoned examples at different rates
4. Retrain on poisoned data and compare accuracy
5. Visualise the damage

---

## ‚öôÔ∏è Step 1: Import the Tools We Need

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import copy

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

np.random.seed(42)
print("All tools imported successfully.")

---

## üìÇ Step 2: Load Dataset and Train a Clean Baseline Model

Before measuring damage from an attack, we need to know how well the model performs **without** any attack. This is our **baseline** ‚Äî the reference point everything else is measured against.

In [None]:
# =============================================================================
# LOAD DATASET
# =============================================================================

df = pd.read_csv(
    '../datasets/SMSSpamCollection',
    sep='\t',
    header=None,
    names=['label', 'message'],
    encoding='latin-1'
)

df['label_num'] = df['label'].map({'spam': 1, 'ham': 0})

print(f"Dataset loaded: {len(df)} messages")
print(f"Spam: {sum(df.label_num==1)} | Ham: {sum(df.label_num==0)}")
print("")

# Convert text to numbers
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df['message']).toarray()
y = df['label_num'].values

# Split - test set is kept clean and never touched
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set : {len(X_train)} messages")
print(f"Testing set  : {len(X_test)} messages (kept clean throughout)")
print("")

# Train the clean baseline model
print("Training clean baseline model...")
clean_model = LogisticRegression(max_iter=1000, random_state=42)
clean_model.fit(X_train, y_train)

clean_predictions = clean_model.predict(X_test)
clean_accuracy = accuracy_score(y_test, clean_predictions)

print("")
print("=" * 50)
print(f"BASELINE (Clean Model) Accuracy: {clean_accuracy*100:.2f}%")
print("=" * 50)
print("")
print("Remember this number. Any drop after poisoning is attack damage.")

### üëÄ What Do You See?

- Write down the clean model's accuracy. This is your benchmark.
- After the poisoning attack, any drop below this number represents damage caused by the attacker.

---

## ‚ò†Ô∏è Step 3: Label Flipping ‚Äî The Poisoning Attack

We perform **label flipping**: take real spam messages and relabel them as legitimate. When the model trains on this corrupted data, it learns that these spam messages are acceptable.

We run the attack at three different poisoning rates to see how damage scales.

In [None]:
# =============================================================================
# LABEL FLIPPING POISONING ATTACK
# =============================================================================

def label_flip_attack(X_train, y_train, poison_rate):
    """
    Flips labels of a percentage of spam messages to ham.

    Parameters:
        X_train     : training message vectors
        y_train     : original correct labels
        poison_rate : fraction of spam to mislabel (e.g. 0.1 = 10%)

    Returns:
        X_train     : unchanged (we only flip labels, not data)
        y_poisoned  : labels with some spam relabeled as ham
        n_poisoned  : how many labels were flipped
    """
    y_poisoned = copy.deepcopy(y_train)
    spam_indices = np.where(y_train == 1)[0]
    n_to_poison = int(len(spam_indices) * poison_rate)
    poison_indices = np.random.choice(spam_indices, n_to_poison, replace=False)
    y_poisoned[poison_indices] = 0  # flip spam to ham
    return X_train, y_poisoned, n_to_poison


poison_rates = [0.05, 0.10, 0.20]
results = []

print("Running label flipping attack at different poisoning rates...")
print("=" * 65)
print(f"{'Poison Rate':<15} {'Messages Flipped':<20} {'Accuracy':<15} {'Drop'}")
print("-" * 65)

for rate in poison_rates:
    X_p, y_p, n_poisoned = label_flip_attack(X_train, y_train, rate)

    poisoned_model = LogisticRegression(max_iter=1000, random_state=42)
    poisoned_model.fit(X_p, y_p)

    # Always test on the CLEAN test set for an honest measurement
    poisoned_preds = poisoned_model.predict(X_test)
    poisoned_accuracy = accuracy_score(y_test, poisoned_preds)
    drop = clean_accuracy - poisoned_accuracy
    results.append((rate, n_poisoned, poisoned_accuracy, drop, poisoned_model))

    print(f"{rate*100:.0f}%{'':<12} {n_poisoned:<20} {poisoned_accuracy*100:.2f}%{'':<9} -{drop*100:.2f}%")

print("-" * 65)
print(f"Baseline (no attack):{'':<18} {clean_accuracy*100:.2f}%")

### üëÄ What Do You See?

- How does accuracy change as more labels are flipped?
- Even 5% poisoning causes a measurable drop. What does this tell you about how sensitive models are to data quality?
- At 20% poisoning, how many more spam messages reach users compared to the clean model?

---

## üìä Step 4: Visualise the Damage

In [None]:
# =============================================================================
# VISUALISE POISONING IMPACT
# =============================================================================

rates = [r[0]*100 for r in results]
accuracies = [r[2]*100 for r in results]

plt.figure(figsize=(8, 5))
plt.axhline(
    y=clean_accuracy*100,
    color='green', linestyle='--',
    label=f'Clean baseline ({clean_accuracy*100:.2f}%)'
)
plt.plot(rates, accuracies, 'ro-', linewidth=2, markersize=8, label='Poisoned model')
for rate, acc in zip(rates, accuracies):
    plt.annotate(f'{acc:.2f}%', (rate, acc), textcoords="offset points", xytext=(0, 10))

plt.title('Impact of Label Flipping Poisoning Attack on Model Accuracy')
plt.xlabel('Poisoning Rate (% of spam labels flipped)')
plt.ylabel('Model Accuracy (%)')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('../outputs/lab2_poisoning_impact.png')
plt.show()
print("Chart saved to outputs folder.")

### üëÄ What Do You See?

- The green dashed line is where the model should be. The red line shows where it actually performs after poisoning.
- Is the relationship between poisoning rate and accuracy drop linear, or does it accelerate?
- At what poisoning rate would you consider the filter completely broken?

### üß™ Try This

Go back to the attack and try `poison_rate=0.50`. At 50% poisoning, is the spam filter still doing better than random guessing?

---

## üí≠ Step 5: Reflect

In [None]:
reflection = """
LAB 2 - POISONING ATTACK REFLECTION
=====================================

Q1: In plain English, what is a poisoning attack and when does it happen?
A1: [TYPE YOUR ANSWER HERE]

Q2: Describe what label flipping does and why it damages the model.
A2: [TYPE YOUR ANSWER HERE]

Q3: Who has access to training data before a model is trained in a real
    organisation? What access controls would you recommend?
A3: [TYPE YOUR ANSWER HERE]

Q4: Compare evasion (Lab 1) and poisoning (Lab 2). Which is harder to
    detect? Which causes more lasting damage?
A4: [TYPE YOUR ANSWER HERE]

Q5: Name a real-world AI system where a poisoning attack could have
    serious consequences.
A5: [TYPE YOUR ANSWER HERE]
"""

with open('../outputs/Lab2_Reflection.txt', 'w') as f:
    f.write(reflection)

print("Reflection saved to outputs/Lab2_Reflection.txt")
print(reflection)

---

## ‚úÖ Lab 2 Complete

Return to [START_HERE.ipynb](START_HERE.ipynb) and open Lab 3 ‚Äî Inference Attack.

---
*Built with the Adversarial Robustness Toolbox (ART) ‚Äî https://github.com/Trusted-AI/adversarial-robustness-toolbox*