# Adversarial Attacks and Adversarial Training

In this notebook, you will implement both, a common adversarial attack that can "fool" models into making mistakes, and a simple defense against such attacks.

## Loading the dataset

We will be using the MNIST image dataset, which you are already familiar with. The cell below is already implement for you. It loads and preprocesses the training data by normalizing the pixel values to the range `[0, 1]`.

In [None]:
import tensorflow as tf
import numpy as np
import matplotlib.pyplot as plt

# Load MNIST dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Normalize images to range [0, 1]
x_train, x_test = x_train.astype(np.float32) / 255.0, x_test.astype(np.float32) / 255.0

# Expand dimensions for channel consistency (28x28 -> 28x28x1)
x_train = np.expand_dims(x_train, axis=-1)
x_test = np.expand_dims(x_test, axis=-1)

# Convert labels to categorical
y_train = tf.keras.utils.to_categorical(y_train, 10)
y_test = tf.keras.utils.to_categorical(y_test, 10)

# Convert to TensorFlow datasets
train_ds = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(10000).batch(64)
test_ds = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(64)

## Defining the neural network

The cell below defines and trains a small neural network that will act as the target for adversarial attacks through this notebook.

After five epochs of training, the accuracy on the training data should be around 97%.

In [None]:
def create_model():
    model = tf.keras.Sequential([
        tf.keras.layers.Flatten(input_shape=(28, 28, 1)),
        tf.keras.layers.Dense(256, activation='relu'),
        tf.keras.layers.Dense(128, activation='relu'),
        tf.keras.layers.Dense(10, activation='softmax')
    ])
    model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
    return model

# Create and train the model
model = create_model()
model.fit(train_ds, epochs=5, validation_data=test_ds)

# Evaluate the model
loss, accuracy = model.evaluate(test_ds)
print(f"Clean Accuracy: {accuracy * 100:.2f}%")

## FGSM attack

In the following, we ask you to implement the FGSM attack.
The Fast Gradient Sign Method (FGSM) is an adversarial attack algorithm used to generate adversarial examples for neural networks. It was introduced by [Goodfellow et al. in 2014.](https://arxiv.org/abs/1412.6572) The goal of FGSM is to perturb an input to a model in such a way that the model makes a mistake, while the perturbation is small enough to be imperceptible to humans.

Here is the FGSM algorithm:

1. **Input:**
   - A trained model $N$.
   - An input example $x$ and its true label $y$.
   - A loss function $L(y, N(x))$ that compares the true label $y$ with the prediction $N(x)$ .
   - A small perturbation parameter $\epsilon$.

2. **Output:**
   - An adversarial example $x'$.

3. **Algorithm:**
   - Compute the gradient of the loss with respect to the input $x$: $\nabla_x L(y, N(x))$.
   - Create the adversarial example by adding a small perturbation in the direction of the gradient's sign:
     $$
     x' = \text{clip}(x + \epsilon \cdot \text{sign}(\nabla_x L(y, N(x)))).
     $$
   - Here, the function $\text{clip}$ clips the resulting example $x'$ to ensure it is within the valid range for input values (e.g., pixel values between 0 and 255 for images).

The perturbation $\epsilon \cdot \text{sign}(\nabla_x L(y, N(x)))$ is designed to increase the loss, thereby making the model more likely to misclassify the input. The sign function ensures that the perturbation is small and uniform in each dimension.

This simple yet effective method highlights the vulnerability of neural networks to adversarial examples and has inspired further research into more sophisticated attack and defense mechanisms.

## Implementing FGSM Attacks

To implement this algorithm, you need a few advanced Tensorflow utilities.
Remember that we are searching for a perturbation of the input, in this case an image, that causes the model to misclassify the input. For this, we have to compute the gradient of the loss function with respect to the input image. So far, we have always abstracted away the part of computing gradients and optimizing the loss function behind a call to `model.fit`. Now, we will introduce you to a more direct approach.

### Computing gradients with tensorflow

In Tensorflow, we can record operations for automatic differentiation using the `tf.GradientTape` context manager. You can construct a gradient tape using `with ... as ...:`:

```python
with tf.GradientTape() as tape:
    # Do something here
    # ...
```

You can record operations that happen within the context manager by watching them using `GradientTape.watch`. Once you have recorded all the operations that you are interested in, you can compute the gradient of the tape using `GradientTape.gradient`. `GradientTape.gradient` takes as argument the variable you want to compute the gradient of, followed by the variable with respect to which you wish to compute the gradient.

Let's look at an example. Say you want to compute the gradient of the function `y = x * x` for `x = 42`. Using a gradient tape, you would do the following:

```python
# Define x:
x = tf.constant(42.0)
with tf.GradientTape() as tape:
    tape.watch(x)
    y = x * x
# Compute the gradient.
dy_dx = tape.gradient(y, x)
```

You can learn more about `tf.GradienTape` on its [documentation page](https://www.tensorflow.org/api_docs/python/tf/GradientTape).
To get familiar with `tf.GradientTape`, use it below to compute the gradient of `cos(x * x)` at `x = 42`.

In [None]:
x = tf.constant(42.0)
with tf.GradientTape() as tape:
    # Watch x and compute cos(x*x) here (you can use tf.cos).
    ...

# Compute the gradient:
dy_dx = ...

# This should yield something close to 84.
dy_dx

### Computing gradients of neural networks

Let's look at a more complex example: Computing the gradient of the loss with `GradientTape`. We showcase this below.

1. You take an image, its corresponding label, and the neural network.
2. The gradient tape is used to record the calculation of the prediction by the neural network.
3. When computing the gradient we do this with respect to the given image.

In [None]:
def compute_gradient_network(image, label, model):
    image = tf.convert_to_tensor(image, dtype=tf.float32)
    with tf.GradientTape() as tape:
        tape.watch(image)
        prediction = model(image)
        loss = tf.keras.losses.categorical_crossentropy(label, prediction)
    return tape.gradient(loss, image)

## Clipping

We also need the function `tf.clip_by_value` from tensorflow. When you apply the perturbation to the image, it is possible that the perturbation pushes some pixels' values outside of their defined range (e.g. smaller than `0.0` or larger than `1.0`). To counteract this, we have to clip the perturbed image back to their defined range. We do this using `tf.clip_by_value`: it clips tensor values to a specified min and max. `clip_by_value` has the following signature:
```python
tf.clip_by_value(
    t, clip_value_min, clip_value_max, name=None
)
```
`t` is the tensor whose values you wish to clip, `clip_value_min` is the minimum value of the desired value range, and `clip_value_max` is its maximum.

So, if you want to clip a tensor `x` to the range `[-1, 1]`, you can do this by calling:

```python
tf.clip_by_value(x, -1, 1)
```
---

With these tools, you have everything you need to implement `fgsm_attack` below. Use `compute_gradient_network` from above.

In [None]:
def fgsm_attack(image, label, model, epsilon=0.2):
    # Implement the FGSM attack here and return an adversarial image.
    # ...
    adversarial_image = ...
    return adversarial_image

def evaluate_adversarial_examples(model, dataset, epsilon=0.2):
    total, correct = 0, 0
    for images, labels in dataset:
        adv_images = fgsm_attack(images, labels, model, epsilon)
        predictions = model.predict(adv_images)
        correct += np.sum(np.argmax(predictions, axis=1) == np.argmax(labels.numpy(), axis=1))
        total += labels.shape[0]
    print(f"Adversarial Accuracy (ε={epsilon}): {100 * correct / total:.2f}%")

evaluate_adversarial_examples(model, test_ds, epsilon=0.2)

In the cell below, we test your FGSM implementation on a few sample images.
The plot below displays the model's predictions for the original images in the first row, and the predictions for the perturbed images in the second one.

In [None]:
# Select a sample batch of test images
sample_images, sample_labels = next(iter(test_ds))
sample_images = sample_images[:5]  # Take 5 images
sample_labels = sample_labels[:5]

# Generate adversarial examples
adv_sample = fgsm_attack(sample_images, sample_labels, model, epsilon=0.2)

# Get predictions for both original and adversarial images
original_preds = model.predict(sample_images)
adversarial_preds = model.predict(adv_sample)

# Convert predictions to class labels
original_labels = np.argmax(original_preds, axis=1)
adversarial_labels = np.argmax(adversarial_preds, axis=1)

# Plot original and adversarial images with predicted labels
fig, axes = plt.subplots(2, 5, figsize=(10, 5))

for i in range(5):
    # Plot original image
    axes[0, i].imshow(sample_images[i].numpy().squeeze(), cmap="gray")
    axes[0, i].set_title(f"Orig: {original_labels[i]}")
    axes[0, i].axis("off")

    # Plot adversarial image
    axes[1, i].imshow(adv_sample[i].numpy().squeeze(), cmap="gray")
    axes[1, i].set_title(f"Adv: {adversarial_labels[i]}")
    axes[1, i].axis("off")

plt.show()


Next, we ask you to implement adversarial training to defend the model against adversarial attacks.

To refresh your memory, adversarial training works as follows:

1. Sample a batch of images and their labels.
2. Perturb the images using an adversarial attack. Use `fgsm_attack` from above for this!
3. Obtain predictions for the perturbed images.
4. Obtain predictions for the unperturbed images.
5. Compute the loss for both the perturbed and unperturbed images, and average them.
6. Compute the gradient of the averaged loss and optimize it using an optimizer.

Take inspiration from the `standard_training` function below. `standard_training` shows you how to implement the standard neural network training loop with a `GradientTape`. Here are a few important points to take away:
- You have to call `model` with `training_True` when making a prediction. This ensures that the model parameters are watched by the gradient tape, without explicitly calling `tape.watch`.
- To compute the gradient of the loss function with respect to the model's parameters, we have to pass them as the second argument to `tape.gradient`. THe model parameters are stored in `model.trainable_variables`.
- To use the gradients with the optimizer, use `optimizer.apply_gradients`. This function expects a list of tuples, where the first element is the gradient, and the second element is the variable with respect to which the gardient was computed. Such a list can be created using the `zip` function as shown below.

In [None]:
def standard_training(model, train_ds, epochs=5):
    optimizer = tf.keras.optimizers.Adam()
    loss_fn = tf.keras.losses.CategoricalCrossentropy()
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")
        for images, labels in train_ds:
            with tf.GradientTape() as tape:
                # Compute predictions using the current parameters. Note the
                # `training=True` flag.
                predictions = model(images, training=True)
                loss = loss_fn(labels, predictions)
            # Compute the gradient of the loss with respect to the
            # model parameters (stored in model.trainable_variables).
            gradients = tape.gradient(loss, model.trainable_variables)
            # Use the computed gradients with the optimizer. Note that
            # apply_gradients expects a list of tuples where the first element
            optimizer.apply_gradients(zip(gradients, model.trainable_variables))

standard_model = create_model()
standard_training(standard_model, train_ds, epochs=5)

print("Standard Model Accuracy on Clean Images:")
loss, accuracy = standard_model.evaluate(test_ds)
print(f"Clean Accuracy: {accuracy * 100:.2f}%")

With all of the above in mind, go ahead and implement the missing parts of `adversarial_training` below.

In [None]:
def adversarial_training(model, train_ds, epsilon=0.2, epochs=5):
    optimizer = tf.keras.optimizers.Adam()
    for epoch in range(epochs):
        print(f"Epoch {epoch + 1}/{epochs}")
        for images, labels in train_ds:
            # Implement adversarial training here. Take inspiration from above.
            # ...
        evaluate_adversarial_examples(model, test_ds, epsilon)

# Create and train a robust model
robust_model = create_model()
adversarial_training(robust_model, train_ds, epsilon=0.2, epochs=5)

print("Robust Model Accuracy on Clean Images:")
loss, accuracy = robust_model.evaluate(test_ds)
print(f"Clean Accuracy: {accuracy * 100:.2f}%")

print("Robust Model Accuracy on Adversarial Examples:")
evaluate_adversarial_examples(robust_model, test_ds, epsilon=0.2)

Now, let's see whether our new model is more robust against the FGSM attack.

In [None]:
# Select a sample batch of test images
sample_images, sample_labels = next(iter(test_ds))
sample_images = sample_images[:5]  # Take 5 images
sample_labels = sample_labels[:5]

# Generate adversarial examples
adv_sample = fgsm_attack(sample_images, sample_labels, robust_model, epsilon=0.2)

# Get predictions for both original and adversarial images
original_preds = robust_model.predict(sample_images)
adversarial_preds = robust_model.predict(adv_sample)

# Convert predictions to class labels
original_labels = np.argmax(original_preds, axis=1)
adversarial_labels = np.argmax(adversarial_preds, axis=1)

# Plot original and adversarial images with predicted labels
fig, axes = plt.subplots(2, 5, figsize=(10, 5))

for i in range(5):
    # Plot original image
    axes[0, i].imshow(sample_images[i].numpy().squeeze(), cmap="gray")
    axes[0, i].set_title(f"Orig: {original_labels[i]}")
    axes[0, i].axis("off")

    # Plot adversarial image
    axes[1, i].imshow(adv_sample[i].numpy().squeeze(), cmap="gray")
    axes[1, i].set_title(f"Adv: {adversarial_labels[i]}")
    axes[1, i].axis("off")

plt.show()
