# ADVERSARIAL EXAMPLE GENERATION

Author: Nathan Inkawhich

If you are reading this, hopefully you can appreciate how effective some machine learning models are. Research is constantly pushing ML models to be faster, more accurate, and more efficient. However, an often overlooked aspect of designing and training models is security and robustness, especially in the face of an adversary who wishes to fool the model.

This tutorial will raise your awareness to the security vulnerabilities of ML models, and will give insight into the hot topic of adversarial machine learning. You may be surprised to find that adding imperceptible perturbations to an image can cause drastically different model performance. Given that this is a tutorial, we will explore the topic via example on an image classifier. Specifically we will use one of the first and most popular attack methods, the Fast Gradient Sign Attack (FGSM), to fool an MNIST classifier.

## Threat Model

For context, there are many categories of adversarial attacks, each with a different goal and assumption of the attacker’s knowledge. However, in general the overarching goal is to add the least amount of perturbation to the input data to cause the desired misclassification. There are several kinds of assumptions of the attacker’s knowledge, two of which are: **white-box** and **black-box**. A *white-box* attack assumes the attacker has full knowledge and access to the model, including architecture, inputs, outputs, and weights. A *black-box* attack assumes the attacker only has access to the inputs and outputs of the model, and knows nothing about the underlying architecture or weights. There are also several types of goals, including **misclassification** and **source/target misclassification**. A goal of misclassification means the adversary only wants the output classification to be wrong but does not care what the new classification is. A source/target misclassification means the adversary wants to alter an image that is originally of a specific source class so that it is classified as a specific target class.

In this case, the FGSM attack is a *white-box* attack with the goal of *misclassification*. With this background information, we can now discuss the attack in detail.

## Fast Gradient Sign Attack

One of the first and most popular adversarial attacks to date is referred to as the *Fast Gradient Sign Attack (FGSM)* and is described by Goodfellow et. al. in [Explaining and Harnessing Adversarial Examples](https://arxiv.org/abs/1412.6572). The attack is remarkably powerful, and yet intuitive. It is designed to attack neural networks by leveraging the way they learn, gradients. The idea is simple, rather than working to minimize the loss by adjusting the weights based on the backpropagated gradients, the attack adjusts the input data to maximize the loss based on the same backpropagated gradients. In other words, the attack uses the gradient of the loss w.r.t the input data, then adjusts the input data to maximize the loss.

Before we jump into the code, let’s look at the famous [FGSM](https://arxiv.org/abs/1412.6572) panda example and extract some notation.

![image.png](attachment:image.png)

From the figure, $\textbf{x}$ is the original input image correctly classified as a “panda”, $y$ is the ground truth label for $\textbf{x}$, $θ$ represents the model parameters, and $J(θ,\textbf{x},y)$ is the loss that is used to train the network. The attack backpropagates the gradient back to the input data to calculate $∇_{x}J(θ,\textbf{x},y)$. Then, it adjusts the input data by a small step ($ϵ$ or 0.007 in the picture) in the direction (i.e. $sign(∇_{x}J(θ,\textbf{x},y))$) that will maximize the loss. The resulting perturbed image, $x′$, is then misclassified by the target network as a “gibbon” when it is still clearly a “panda”.

Hopefully now the motivation for this tutorial is clear, so lets jump into the implementation.

In [1]:
from __future__ import print_function
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torchvision import datasets, transforms
import numpy as np
import matplotlib.pyplot as plt

## Implementation

In this section, we will discuss the input parameters for the tutorial, define the model under attack, then code the attack and run some tests.

### Inputs

There are only three inputs for this tutorial, and are defined as follows:

* **epsilons** - List of epsilon values to use for the run. It is important to keep 0 in the list because it represents the model performance on the original test set. Also, intuitively we would expect the larger the epsilon, the more noticeable the perturbations but the more effective the attack in terms of degrading model accuracy. Since the data range here is \[0,1\], no epsilon value should exceed 1.
* **pretrained_model** - path to the pretrained MNIST model which was trained with [pytorch/examples/mnist](https://github.com/pytorch/examples/tree/master/mnist). For simplicity, download the pretrained model [here](https://drive.google.com/drive/folders/1fn83DF14tWmit0RTKWRhPq5uVXt73e0h?usp=sharing).
* **use_cuda** - boolean flag to use CUDA if desired and available. Note, a GPU with CUDA is not critical for this tutorial as a CPU will not take much time.

In [2]:
epsilons = [0, .05, .1, .15, .2, .25, .3]
pretrained_model = "data/mnist_cnn.pt"
use_cuda=True

### Model Under Attack

As mentioned, the model under attack is the same MNIST model from [pytorch/examples/mnist](https://github.com/pytorch/examples/tree/master/mnist). You may train and save your own MNIST model or you can download and use the provided model. The *Net* definition and test dataloader here have been copied from the MNIST example. The purpose of this section is to define the model and dataloader, then initialize the model and load the pretrained weights.

In [3]:
%%bash
python pretrained_minst.py --save-model


Test set: Average loss: 0.0499, Accuracy: 9836/10000 (98%)


Test set: Average loss: 0.0386, Accuracy: 9866/10000 (99%)


Test set: Average loss: 0.0310, Accuracy: 9899/10000 (99%)


Test set: Average loss: 0.0309, Accuracy: 9895/10000 (99%)


Test set: Average loss: 0.0298, Accuracy: 9901/10000 (99%)


Test set: Average loss: 0.0289, Accuracy: 9909/10000 (99%)


Test set: Average loss: 0.0283, Accuracy: 9912/10000 (99%)


Test set: Average loss: 0.0277, Accuracy: 9911/10000 (99%)


Test set: Average loss: 0.0268, Accuracy: 9913/10000 (99%)


Test set: Average loss: 0.0263, Accuracy: 9915/10000 (99%)


Test set: Average loss: 0.0270, Accuracy: 9918/10000 (99%)


Test set: Average loss: 0.0263, Accuracy: 9916/10000 (99%)


Test set: Average loss: 0.0265, Accuracy: 9916/10000 (99%)


Test set: Average loss: 0.0261, Accuracy: 9919/10000 (99%)



In [6]:
# LeNet Model definition
class Net(nn.Module):
    '''
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, kernel_size=3)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=3)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(64*64, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 64*64)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)
    '''
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout2d(0.25)
        self.dropout2 = nn.Dropout2d(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

# MNIST Test dataset and dataloader declaration
test_loader = torch.utils.data.DataLoader(
    datasets.MNIST('./data', train=False, download=True, transform=transforms.Compose([
            transforms.ToTensor(),
            ])),
        batch_size=1, shuffle=True)

# Define what device we are using
print("CUDA Available: ",torch.cuda.is_available())
device = torch.device("cuda" if (use_cuda and torch.cuda.is_available()) else "cpu")

# Initialize the network
model = Net().to(device)

# Load the pretrained model
model.load_state_dict(torch.load(pretrained_model, map_location='cpu'))

# Set the model in evaluation mode. In this case this is for the Dropout layers
model.eval()

CUDA Available:  True


Net(
  (conv1): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1))
  (conv2): Conv2d(32, 64, kernel_size=(3, 3), stride=(1, 1))
  (dropout1): Dropout2d(p=0.25, inplace=False)
  (dropout2): Dropout2d(p=0.5, inplace=False)
  (fc1): Linear(in_features=9216, out_features=128, bias=True)
  (fc2): Linear(in_features=128, out_features=10, bias=True)
)

### FGSM Attack

Now, we can define the function that creates the adversarial examples by perturbing the original inputs. The ``fgsm_attack`` function takes three inputs, *image* is the original clean image $(x)$, $epsilon$ is the pixel-wise perturbation amount $(ϵ)$, and $data_grad$ is gradient of the loss w.r.t the input image $(∇_{x}J(θ,\textbf{x},y))$. The function then creates perturbed image as

<center>$perturbed\_image = image + epsilon ∗ sign(data_grad) = x + ϵ ∗ sign(∇_{x}J(θ,\textbf{x},y))$</center>

Finally, in order to maintain the original range of the data, the perturbed image is clipped to range \[0,1\].

In [7]:
# FGSM attack code
def fgsm_attack(image, epsilon, data_grad):
    # Collect the element-wise sign of the data gradient
    sign_data_grad = data_grad.sign()
    # Create the perturbed image by adjusting each pixel of the input image
    perturbed_image = image + epsilon*sign_data_grad
    # Adding clipping to maintain [0,1] range
    perturbed_image = torch.clamp(perturbed_image, 0, 1)
    # Return the perturbed image
    return perturbed_image