# üî¥ Lab 4 ‚Äî Extraction Attack (Model Stealing)
### Certified AI Penetration Tester ‚Äì Red Team (CAIPT-RT)

---

## üéØ The Story

A company has spent years and millions of dollars building a machine learning model that predicts whether families qualify for social services. The model is their competitive advantage. They protect it carefully ‚Äî the model itself, the code, and the training data are all kept secret.

But they do offer it as an API service. You send it an application, it sends back a decision. That is all you get ‚Äî no code, no probabilities, just a label.

You are an attacker ‚Äî perhaps a competitor, perhaps a researcher exposing bias. You have no access to the model internals or the training data. But you do have access to the API.

By sending thousands of carefully chosen queries and recording the responses, you can **build your own model that behaves almost identically** to the original ‚Äî without ever seeing it.

This is a **Model Extraction Attack** ‚Äî also called model stealing.

---

## üìñ What is a Model Extraction Attack?

A model extraction attack allows an attacker to create a functional copy of a machine learning model by repeatedly querying it and using the query-response pairs as training data for a new model.

**Why is this a problem?**
- The stolen model can be used to **steal intellectual property** ‚Äî years of R&D reproduced for free
- The stolen model can be used to **prepare better attacks** ‚Äî once you have a local copy, you can run evasion and poisoning attacks against it much more effectively
- The stolen model can be used to **probe for bias** ‚Äî sometimes used by researchers to expose unfairness in proprietary models

**Real world examples:**
- Stealing a competitor's fraud detection model
- Copying a medical diagnosis model to avoid licensing fees
- Using a stolen model as a stepping stone for further attacks

---

## üóÇÔ∏è What We Will Do in This Lab

1. Train the "victim" model ‚Äî the valuable model being stolen
2. Set up a query interface simulating black-box API access
3. Use ART's extraction attack to steal the model
4. Evaluate how close the stolen model is to the original
5. Test how query volume affects the quality of the stolen model

---

## ‚öôÔ∏è Step 1: Import the Tools We Need

In [None]:
# =============================================================================
# IMPORTS
# =============================================================================
# New addition for this lab:
# CopycatCNN : ART's model extraction attack
#              Despite the name 'CNN' (Convolutional Neural Network),
#              this attack works for any classifier - the name comes
#              from the original research paper that introduced this
#              technique for image models, but ART adapted it broadly.
# =============================================================================

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# ART extraction attack
from art.estimators.classification import SklearnClassifier
from art.attacks.extraction import CopycatCNN

np.random.seed(42)

print("All tools imported successfully.")

---

## üìÇ Step 2: Load the Dataset and Train the Victim Model

We reuse the Nursery dataset from Lab 3. This time, we train a more powerful victim model ‚Äî the "expensive proprietary model" that the attacker wants to steal.

We then pretend we have no access to this model except through an API that takes inputs and returns predictions.

In [None]:
# =============================================================================
# LOAD AND PREPARE THE NURSERY DATASET
# =============================================================================
# Same loading process as Lab 3
# =============================================================================

column_names = [
    'parents', 'has_nurs', 'form', 'children',
    'housing', 'finance', 'social', 'health', 'target'
]

df = pd.read_csv(
    '../datasets/nursery.data',
    header=None,
    names=column_names
)

# Encode all text columns to numbers
df_encoded = df.copy()
for column in df_encoded.columns:
    le = LabelEncoder()
    df_encoded[column] = le.fit_transform(df_encoded[column])

X = df_encoded.drop('target', axis=1).values
y = df_encoded['target'].values

# Split the data
# Note: we keep a separate 'steal_pool' dataset
# This represents data the ATTACKER has access to (not the original training data)
# The attacker uses this pool to query the victim model and collect responses
X_train, X_remaining, y_train, y_remaining = train_test_split(
    X, y, test_size=0.5, random_state=42, stratify=y
)

# Split remaining data into attacker's query pool and evaluation set
X_steal_pool, X_eval, y_steal_pool, y_eval = train_test_split(
    X_remaining, y_remaining, test_size=0.4, random_state=42
)

print("Data prepared:")
print(f"  Victim model training data : {len(X_train)} records")
print(f"  Attacker query pool        : {len(X_steal_pool)} records")
print(f"  Evaluation set             : {len(X_eval)} records")
print("")
print("The attacker ONLY has access to the query pool.")
print("The attacker does NOT have the training data or original labels.")

In [None]:
# =============================================================================
# TRAIN THE VICTIM MODEL
# =============================================================================
# This is the valuable, proprietary model the attacker wants to steal.
# In a real scenario, this model would sit behind an API.
# The attacker cannot see its code, weights, or training data.
#
# We use a Random Forest with 200 trees - a strong, well-trained model
# that represents real-world production quality
# =============================================================================

print("Training the VICTIM model (the valuable model to be stolen)...")
print("(200 decision trees - may take 20-30 seconds)")
print("")

victim_model = RandomForestClassifier(n_estimators=200, random_state=42)
victim_model.fit(X_train, y_train)

victim_accuracy = accuracy_score(y_eval, victim_model.predict(X_eval))

print(f"Victim model trained successfully.")
print(f"Victim model accuracy on evaluation set: {victim_accuracy*100:.2f}%")
print("")
print("This is the accuracy benchmark. The stolen model will try to match this.")

# Wrap in ART
art_victim = SklearnClassifier(model=victim_model)
print("")
print("Victim model wrapped in ART. Simulating API access only.")

### üëÄ What Do You See?

- The victim model's accuracy on the evaluation set is our benchmark.
- The attacker's goal is to build a stolen model that performs as close to this as possible.
- Remember: from this point forward, the attacker has NO access to the victim model's code or training data ‚Äî only the ability to query it.

---

## üî¥ Step 3: Perform the Extraction Attack

The extraction attack works like this:

1. The attacker takes data from their own query pool
2. They send each record to the victim model's API and get back a prediction
3. They now have a dataset of (input, label) pairs ‚Äî but the labels came from the victim model, not the original data
4. They train their own model on this "stolen" dataset

This is essentially using the victim model as a labeling service to create training data for the copycat.

In [None]:
# =============================================================================
# EXTRACTION ATTACK - STEAL THE MODEL
# =============================================================================
# CopycatCNN parameters:
#
# classifier      : the victim model being stolen (via ART wrapper)
# batch_size_fit  : how many samples to use per training batch
# batch_size_query: how many samples to query the victim model at once
# nb_epochs       : how many times to train the stolen model on the collected data
# nb_stolen       : CRITICAL - how many queries to send to the victim model
#                   More queries = more data = better stolen model
#                   But more queries also = more suspicious to the victim owner
#
# We will run THREE versions with different query budgets to see the trade-off
# =============================================================================

# The stolen model architecture - we use Logistic Regression as the copycat
# The attacker does not need to use the same model type as the victim
def create_stolen_model():
    return SklearnClassifier(
        model=LogisticRegression(max_iter=1000, random_state=42)
    )

query_budgets = [100, 500, 2000]
stolen_results = []

print("Running extraction attack with different query budgets...")
print("(More queries = longer runtime but potentially better stolen model)")
print("")

for n_queries in query_budgets:
    print(f"  Testing with {n_queries} queries...")
    
    # Create a fresh stolen model for each test
    stolen_classifier = create_stolen_model()
    
    # Create and run the extraction attack
    attack = CopycatCNN(
        classifier=art_victim,
        batch_size_fit=32,
        batch_size_query=32,
        nb_epochs=10,
        nb_stolen=n_queries
    )
    
    # extract() is where the stealing happens
    # It queries the victim model and trains the stolen model
    stolen_model = attack.extract(
        x=X_steal_pool[:n_queries],
        y=y_steal_pool[:n_queries],
        thieved_classifier=stolen_classifier
    )
    
    # Evaluate the stolen model
    stolen_preds = stolen_model.predict(X_eval)
    stolen_labels = np.argmax(stolen_preds, axis=1)
    stolen_accuracy = accuracy_score(y_eval, stolen_labels)
    
    # Also check agreement with victim model (not just accuracy)
    # Agreement = how often stolen model gives SAME answer as victim
    victim_preds = victim_model.predict(X_eval)
    agreement = accuracy_score(victim_preds, stolen_labels)
    
    stolen_results.append((n_queries, stolen_accuracy, agreement))
    print(f"    Accuracy: {stolen_accuracy*100:.2f}% | Agreement with victim: {agreement*100:.2f}%")

print("")
print(f"Victim model accuracy (benchmark): {victim_accuracy*100:.2f}%")

### üëÄ What Do You See?

Look at the results for each query budget.

- **Accuracy** tells you how well the stolen model performs on the task overall.
- **Agreement** tells you how often the stolen model gives the same answer as the victim ‚Äî this is a measure of how faithful the copy is.
- Does increasing the number of queries always improve the stolen model? At what point does adding more queries stop helping significantly?
- Even with only 100 queries, how close did the stolen model get to the victim?

---

## üìä Step 4: Visualize the Trade-off

In [None]:
# =============================================================================
# VISUALIZE QUERY BUDGET VS STOLEN MODEL QUALITY
# =============================================================================

budgets = [r[0] for r in stolen_results]
accuracies = [r[1]*100 for r in stolen_results]
agreements = [r[2]*100 for r in stolen_results]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

# Plot 1: Accuracy vs Query Budget
ax1.plot(budgets, accuracies, 'bo-', linewidth=2, markersize=8)
ax1.axhline(y=victim_accuracy*100, color='red', linestyle='--',
           label=f'Victim accuracy ({victim_accuracy*100:.1f}%)')
ax1.set_title('Stolen Model Accuracy vs Query Budget')
ax1.set_xlabel('Number of Queries to Victim Model')
ax1.set_ylabel('Accuracy (%)')
ax1.legend()
ax1.grid(True, alpha=0.3)
for b, a in zip(budgets, accuracies):
    ax1.annotate(f'{a:.1f}%', (b, a), textcoords="offset points", xytext=(0, 10))

# Plot 2: Agreement vs Query Budget
ax2.plot(budgets, agreements, 'go-', linewidth=2, markersize=8)
ax2.axhline(y=100, color='red', linestyle='--', label='Perfect copy (100%)')
ax2.set_title('Stolen Model Agreement with Victim vs Query Budget')
ax2.set_xlabel('Number of Queries to Victim Model')
ax2.set_ylabel('Agreement with Victim Model (%)')
ax2.legend()
ax2.grid(True, alpha=0.3)
for b, a in zip(budgets, agreements):
    ax2.annotate(f'{a:.1f}%', (b, a), textcoords="offset points", xytext=(0, 10))

plt.tight_layout()
plt.savefig('../outputs/lab4_extraction_results.png')
plt.show()

print("Chart saved to outputs folder.")
print("")
print("Summary Table:")
print("=" * 55)
print(f"{'Queries':<12} {'Stolen Accuracy':<20} {'Agreement with Victim'}")
print("-" * 55)
for budget, acc, agr in stolen_results:
    print(f"{budget:<12} {acc:.2f}%{'':<14} {agr:.2f}%")
print("-" * 55)
print(f"{'Victim':<12} {victim_accuracy*100:.2f}% (benchmark)")

### üëÄ What Do You See?

- Look at the accuracy chart. How close does the stolen model get to the victim's accuracy?
- Look at the agreement chart. Agreement of 80%+ means 4 out of 5 predictions match the victim. Is that a successful steal?
- Notice that the improvement from 100 to 500 queries might be much larger than from 500 to 2000 queries. This is called **diminishing returns**. Why does this happen?

### üß™ Try This

Edit the `query_budgets` list at the top of the attack cell and add `50` as the first value.

- Can you build a usable stolen model with just 50 queries?
- From a defender's perspective, what is the minimum number of API queries that should trigger a security alert?

---

## üõ°Ô∏è Step 5: Think Like a Defender

In [None]:
# =============================================================================
# DEFENSIVE ANALYSIS
# =============================================================================
# Now we switch perspective. If you were protecting the victim model,
# what signals would tell you that someone is trying to steal it?
#
# This cell simulates what a defender might monitor:
# - Query volume (too many queries = suspicious)
# - Query patterns (systematic coverage of input space = suspicious)
# - Output rounding (reduce information by rounding probabilities)
# =============================================================================

print("DEFENDER PERSPECTIVE: What would alert you to an extraction attack?")
print("=" * 65)
print("")

# Simulate what the attacker's queries look like from the defender's view
print("1. QUERY VOLUME ANALYSIS")
print("-" * 40)
for n_queries, acc, agr in stolen_results:
    print(f"   Attack using {n_queries:>5} queries achieved {agr:.1f}% agreement")
print("   -> A defender monitoring query volume could set an alert threshold")
print("")

# Demonstrate output rounding as a defense
print("2. OUTPUT ROUNDING DEFENSE")
print("-" * 40)
print("   Instead of returning exact probabilities, round them to 2 decimal places.")
print("   This reduces the information the attacker gets per query.")
print("")

# Show the difference in information
sample = X_eval[:3]
exact_probs = victim_model.predict_proba(sample)
rounded_probs = np.round(exact_probs, 2)

print("   Exact probabilities (what attacker gets without defense):")
for i, p in enumerate(exact_probs):
    print(f"   Record {i+1}: {p}")
print("")
print("   Rounded probabilities (with rounding defense):")
for i, p in enumerate(rounded_probs):
    print(f"   Record {i+1}: {p}")
print("")
print("   -> Less precise = less useful to attacker = worse stolen model")
print("")

print("3. LABEL-ONLY DEFENSE")
print("-" * 40)
print("   Return ONLY the predicted class label, no probabilities at all.")
print("   This forces the attacker to work with much less information.")
print("   The attack can still work but requires far more queries.")

### üëÄ What Do You See?

- The defender has several options to make extraction harder without completely shutting down API access.
- Which defense do you think would be most effective? Which would be least disruptive to legitimate users?
- In real life, companies like Google and Amazon expose ML models through APIs. What defenses do you think they use?

---

## üí≠ Step 6: Reflect

In [None]:
# =============================================================================
# REFLECTION - SAVE YOUR ANSWERS
# =============================================================================

reflection = """
LAB 4 - EXTRACTION ATTACK REFLECTION
======================================

Q1: In plain English, what is a model extraction attack?
    What does the attacker gain that they did not have before?
A1: [TYPE YOUR ANSWER HERE]

Q2: The attacker only needed to query the API - they never saw the model
    code, weights, or training data. What does this tell you about the
    risk of exposing ML models through public APIs?
A2: [TYPE YOUR ANSWER HERE]

Q3: You saw diminishing returns as query count increased.
    Why does the stolen model improve quickly at first, then plateau?
    (Think about what information each new query adds)
A3: [TYPE YOUR ANSWER HERE]

Q4: You learned three defensive approaches: query rate limiting,
    output rounding, and label-only responses.
    Rank these from most to least effective and explain your reasoning.
A4: [TYPE YOUR ANSWER HERE]

Q5: Looking back at all four labs, which attack do you think poses
    the greatest risk to organizations deploying AI systems today?
    Justify your answer.
A5: [TYPE YOUR ANSWER HERE]

BONUS: Can you think of a scenario where a model extraction attack
       could actually be used for GOOD (ethically justified reasons)?
BONUS ANSWER: [TYPE YOUR ANSWER HERE]
"""

with open('../outputs/Lab4_Reflection.txt', 'w') as f:
    f.write(reflection)

print("Reflection saved to outputs/Lab4_Reflection.txt")
print(reflection)

---

## ‚úÖ Lab 4 Complete ‚Äî And So Is the Course!

You have successfully:
- Trained a victim model representing a valuable proprietary system
- Performed a model extraction attack using only API query access
- Measured how query volume affects stolen model quality
- Analyzed the trade-off between attack cost and stolen model fidelity
- Explored defensive countermeasures from the defender's perspective

---

## üèÅ Course Summary ‚Äî What You Have Learned

You have now performed all four core attack types against machine learning systems:

| Attack | When It Happens | What Is Targeted | Key Tool Used |
|--------|----------------|------------------|---------------|
| Evasion | After deployment | Model inputs | HopSkipJump (ART) |
| Poisoning | During training | Training data | Label Flipping |
| Inference | After deployment | Training data privacy | MembershipInference (ART) |
| Extraction | After deployment | Model IP | CopycatCNN (ART) |

Each of these attacks represents a real threat that AI security practitioners are defending against today. The tools you used ‚Äî particularly the Adversarial Robustness Toolbox ‚Äî are the same tools used by researchers at IBM, Microsoft, Google, and security firms around the world.

Return to [START_HERE.ipynb](START_HERE.ipynb) to review your completed labs.

---
*Lab built with the Adversarial Robustness Toolbox (ART)*  
*https://github.com/Trusted-AI/adversarial-robustness-toolbox*