# üî¥ Lab 4 ‚Äî Extraction Attack (Model Stealing)
### Certified AI Penetration Tester ‚Äì Red Team (CAIPT-RT)

---

## üéØ The Story

A company spent years building a machine learning model that predicts whether families qualify for social services. The model is their competitive advantage. The code, weights, and training data are all kept secret.

But they offer it as an API service. You send it an application, it sends back a decision. That is all you get.

You are an attacker ‚Äî perhaps a competitor, perhaps a researcher exposing bias. You have no access to the model internals. But you have access to the API.

By sending thousands of carefully chosen queries and recording the responses, you can **build your own model that behaves almost identically** to the original.

This is a **Model Extraction Attack** ‚Äî also called model stealing.

---

## üìñ What is a Model Extraction Attack?

The attacker creates a functional copy of a model by repeatedly querying it and using the query-response pairs as training data for a new model.

**Why is this a problem?**
- Stolen intellectual property ‚Äî years of R&D reproduced for free
- The stolen model can be used to prepare better evasion attacks locally
- Researchers sometimes use it to expose bias in proprietary models

**Real world examples:**
- Stealing a competitor's fraud detection model
- Copying a medical diagnosis model to avoid licensing fees
- Using a stolen model as a stepping stone for further attacks

---

## üóÇÔ∏è What We Will Do in This Lab

1. Train the victim model ‚Äî the valuable model being stolen
2. Set up a black-box query interface simulating API access
3. Use ART's extraction attack to steal the model
4. Compare stolen model quality at different query volumes
5. Think like a defender

---

## ‚öôÔ∏è Step 1: Import the Tools We Need

In [None]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

from art.estimators.classification import SklearnClassifier
from art.attacks.extraction import CopycatCNN

np.random.seed(42)
print("All tools imported successfully.")

---

## üìÇ Step 2: Load the Dataset and Train the Victim Model

We reuse the Nursery dataset from Lab 3. This time we train a more powerful victim model ‚Äî the expensive proprietary model the attacker wants to steal.

From this point forward we pretend we have no access to this model except through an API that takes inputs and returns predictions.

In [None]:
# =============================================================================
# LOAD AND PREPARE THE NURSERY DATASET
# =============================================================================

column_names = [
    'parents', 'has_nurs', 'form', 'children',
    'housing', 'finance', 'social', 'health', 'target'
]

df = pd.read_csv(
    '../datasets/nursery.data',
    header=None,
    names=column_names
)

df_encoded = df.copy()
for column in df_encoded.columns:
    le = LabelEncoder()
    df_encoded[column] = le.fit_transform(df_encoded[column])

X = df_encoded.drop('target', axis=1).values
y = df_encoded['target'].values

# Split into three portions:
#   X_train      : victim model trains on this (attacker never sees this)
#   X_steal_pool : attacker uses this to query the victim and collect responses
#   X_eval       : used to evaluate both models fairly
X_train, X_remaining, y_train, y_remaining = train_test_split(
    X, y, test_size=0.5, random_state=42, stratify=y
)
X_steal_pool, X_eval, y_steal_pool, y_eval = train_test_split(
    X_remaining, y_remaining, test_size=0.4, random_state=42
)

print("Data prepared:")
print(f"  Victim model training data : {len(X_train)} records (attacker CANNOT see this)")
print(f"  Attacker query pool        : {len(X_steal_pool)} records")
print(f"  Evaluation set             : {len(X_eval)} records")

In [None]:
# =============================================================================
# TRAIN THE VICTIM MODEL
# =============================================================================
# This is the valuable proprietary model. In a real scenario it sits
# behind an API. The attacker cannot see its code, weights, or training data.
# =============================================================================

print("Training the VICTIM model (200 trees - may take 20-30 seconds)...")
print("")

victim_model = RandomForestClassifier(n_estimators=200, random_state=42)
victim_model.fit(X_train, y_train)

victim_accuracy = accuracy_score(y_eval, victim_model.predict(X_eval))

print(f"Victim model accuracy on evaluation set: {victim_accuracy*100:.2f}%")
print("")
print("This is our benchmark. The stolen model will try to match it.")
print("From here the attacker only has API access - no model internals.")

art_victim = SklearnClassifier(model=victim_model)

### üëÄ What Do You See?

- The victim model's accuracy is our benchmark.
- The attacker's goal is to match this as closely as possible.
- Remember: from this point the attacker has NO access to training data or model code.

---

## üî¥ Step 3: Perform the Extraction Attack

The attack works like this:
1. Attacker sends records from their query pool to the victim's API
2. Gets back predictions
3. Now has (input, label) pairs ‚Äî but labels came from the victim, not ground truth
4. Trains their own model on this stolen dataset

The victim model is being used as a labeling service.

In [None]:
# =============================================================================
# EXTRACTION ATTACK AT THREE QUERY BUDGETS
# =============================================================================
# We test with 100, 500, and 2000 queries to show how quality scales.
# More queries = better stolen model but also more suspicious to the defender.
# =============================================================================

def create_stolen_model():
    """Creates a fresh logistic regression model to use as the copycat.
    The attacker does not need to use the same model type as the victim."""
    return SklearnClassifier(
        model=LogisticRegression(max_iter=1000, random_state=42)
    )

query_budgets = [100, 500, 2000]
stolen_results = []

print("Running extraction attack with different query budgets...")
print("(More queries = longer runtime)")
print("")

for n_queries in query_budgets:
    print(f"  Testing with {n_queries} queries...")

    stolen_classifier = create_stolen_model()

    attack = CopycatCNN(
        classifier=art_victim,
        batch_size_fit=32,
        batch_size_query=32,
        nb_epochs=10,
        nb_stolen=n_queries
    )

    # extract() queries the victim and trains the stolen model
    stolen_model = attack.extract(
        x=X_steal_pool[:n_queries],
        y=y_steal_pool[:n_queries],
        thieved_classifier=stolen_classifier
    )

    stolen_preds = stolen_model.predict(X_eval)
    stolen_labels = np.argmax(stolen_preds, axis=1)
    stolen_accuracy = accuracy_score(y_eval, stolen_labels)

    victim_preds = victim_model.predict(X_eval)
    agreement = accuracy_score(victim_preds, stolen_labels)

    stolen_results.append((n_queries, stolen_accuracy, agreement))
    print(f"    Accuracy: {stolen_accuracy*100:.2f}% | Agreement with victim: {agreement*100:.2f}%")

print("")
print(f"Victim model accuracy (benchmark): {victim_accuracy*100:.2f}%")

### üëÄ What Do You See?

- **Accuracy** ‚Äî how well the stolen model performs on the task overall
- **Agreement** ‚Äî how often stolen and victim give the same answer
- Does more queries always improve the stolen model?
- Even with only 100 queries, how close did the stolen model get?

---

## üìä Step 4: Visualise the Trade-off

In [None]:
budgets = [r[0] for r in stolen_results]
accuracies = [r[1]*100 for r in stolen_results]
agreements = [r[2]*100 for r in stolen_results]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(budgets, accuracies, 'bo-', linewidth=2, markersize=8)
ax1.axhline(y=victim_accuracy*100, color='red', linestyle='--',
            label=f'Victim ({victim_accuracy*100:.1f}%)')
ax1.set_title('Stolen Model Accuracy vs Query Budget')
ax1.set_xlabel('Number of Queries')
ax1.set_ylabel('Accuracy (%)')
ax1.legend()
ax1.grid(True, alpha=0.3)
for b, a in zip(budgets, accuracies):
    ax1.annotate(f'{a:.1f}%', (b, a), textcoords="offset points", xytext=(0,10))

ax2.plot(budgets, agreements, 'go-', linewidth=2, markersize=8)
ax2.axhline(y=100, color='red', linestyle='--', label='Perfect copy (100%)')
ax2.set_title('Stolen Model Agreement with Victim vs Query Budget')
ax2.set_xlabel('Number of Queries')
ax2.set_ylabel('Agreement (%)')
ax2.legend()
ax2.grid(True, alpha=0.3)
for b, a in zip(budgets, agreements):
    ax2.annotate(f'{a:.1f}%', (b, a), textcoords="offset points", xytext=(0,10))

plt.tight_layout()
plt.savefig('../outputs/lab4_extraction_results.png')
plt.show()
print("Chart saved to outputs folder.")
print("")
print("Summary:")
print("=" * 55)
print(f"{'Queries':<12} {'Stolen Accuracy':<20} {'Agreement with Victim'}")
print("-" * 55)
for budget, acc, agr in stolen_results:
    print(f"{budget:<12} {acc*100:.2f}%{'':<14} {agr*100:.2f}%")
print("-" * 55)
print(f"{'Victim':<12} {victim_accuracy*100:.2f}% (benchmark)")

### üëÄ What Do You See?

- Notice **diminishing returns** ‚Äî improvement from 100 to 500 queries may be much larger than from 500 to 2000. Why does this happen?
- Is 80%+ agreement a successful steal?

### üß™ Try This

Add `50` as the first value in `query_budgets` and rerun.

- Can you build a usable stolen model with just 50 queries?
- From a defender's perspective, at what query count should a security alert trigger?

---

## üõ°Ô∏è Step 5: Think Like a Defender

In [None]:
# =============================================================================
# DEFENSIVE ANALYSIS
# =============================================================================
# If you were protecting the victim model, what signals would alert you
# to an extraction attack in progress?
# =============================================================================

print("DEFENDER PERSPECTIVE")
print("=" * 65)
print("")
print("1. QUERY VOLUME MONITORING")
print("-" * 40)
for n_queries, acc, agr in stolen_results:
    print(f"   {n_queries:>5} queries achieved {agr*100:.1f}% agreement with victim")
print("   -> Set an alert threshold on query volume per user/IP")
print("")

print("2. OUTPUT ROUNDING DEFENSE")
print("-" * 40)
sample = X_eval[:2]
exact_probs = victim_model.predict_proba(sample)
rounded_probs = np.round(exact_probs, 1)

print("   Exact probabilities (full info to attacker):")
for i, p in enumerate(exact_probs):
    top3 = sorted(zip(p, range(len(p))), reverse=True)[:3]
    print(f"   Record {i+1}: {[round(v,4) for v,_ in top3]} (top 3 classes)")

print("")
print("   Rounded to 1 decimal (less info):")
for i, p in enumerate(rounded_probs):
    top3 = sorted(zip(p, range(len(p))), reverse=True)[:3]
    print(f"   Record {i+1}: {[round(v,1) for v,_ in top3]} (top 3 classes)")

print("")
print("   -> Less precision = less useful to attacker = worse stolen model")
print("")
print("3. LABEL-ONLY DEFENSE")
print("-" * 40)
print("   Return ONLY the predicted class, no probabilities at all.")
print("   Forces attacker to need far more queries for same quality.")

### üëÄ What Do You See?

- Which defense do you think would be most effective?
- Which would be least disruptive to legitimate users?
- Real companies like Google and Amazon expose ML via APIs. What defenses do you think they use?

---

## üí≠ Step 6: Reflect

In [None]:
reflection = """
LAB 4 - EXTRACTION ATTACK REFLECTION
======================================

Q1: In plain English, what is a model extraction attack and what does
    the attacker gain?
A1: [TYPE YOUR ANSWER HERE]

Q2: The attacker only queried the API ‚Äî never saw model code or training data.
    What does this tell you about the risk of exposing ML models via public APIs?
A2: [TYPE YOUR ANSWER HERE]

Q3: You saw diminishing returns as query count increased. Why does the
    stolen model improve quickly at first, then plateau?
A3: [TYPE YOUR ANSWER HERE]

Q4: Rank the three defenses (query limiting, output rounding, label-only)
    from most to least effective. Explain your reasoning.
A4: [TYPE YOUR ANSWER HERE]

Q5: Looking back at all four labs, which attack poses the greatest risk
    to organisations deploying AI today? Justify your answer.
A5: [TYPE YOUR ANSWER HERE]

BONUS: Can you think of a scenario where a model extraction attack could
       be used for ethically justified reasons?
BONUS: [TYPE YOUR ANSWER HERE]
"""

with open('../outputs/Lab4_Reflection.txt', 'w') as f:
    f.write(reflection)

print("Reflection saved to outputs/Lab4_Reflection.txt")
print(reflection)

---

## ‚úÖ Lab 4 Complete ‚Äî And So Is the Course!

You have worked through all four core attack types:

| Attack | When | Target | Key Tool |
|--------|------|--------|----------|
| Evasion | After deployment | Model inputs | HopSkipJump (ART) |
| Poisoning | During training | Training data | Label Flipping |
| Inference | After deployment | Data privacy | MembershipInference (ART) |
| Extraction | After deployment | Model IP | CopycatCNN (ART) |

Each attack represents a real threat that AI security practitioners defend against today. The tools you used are the same tools used by researchers at IBM, Microsoft, Google, and security firms worldwide.

Return to [START_HERE.ipynb](START_HERE.ipynb) to review your completed labs.

---
*Built with the Adversarial Robustness Toolbox (ART) ‚Äî https://github.com/Trusted-AI/adversarial-robustness-toolbox*