## White-box attack exercise

In this exercise, you will implement the following white-box attacks.
1. Fast Gradient Sign Method (FGSM)
2. Projected Gradient Descent (PGD)

In [None]:
from google.colab import drive

drive.mount('/content/drive', force_remount=True)

# enter the foldername in your Drive where you have saved the unzipped
# 'attacks', 'datasets', and 'pretrained' folders
FOLDERNAME = '2025DL/hw6'

assert FOLDERNAME is not None, "[!] Enter the foldername."

%cd /content/drive/MyDrive/$FOLDERNAME


In [None]:
!pip install gdown
%mkdir pretrained
%cd pretrained
!gdown --fuzzy https://drive.google.com/file/d/1lA87UyuGpUiUCytKhwva_JnU_l-luPiX/view?usp=sharing
%cd ..

In [None]:
import math
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.optim as optim

import torchvision.datasets as dset
import torchvision.transforms as T
import numpy as np
from time import time

from cifar10_input import CIFAR10Data

%load_ext autoreload
%autoreload 2

You have an option to **use GPU by setting the flag to True below**. Note that if your computer does not have CUDA enabled, `torch.cuda.is_available()` will return False and this notebook will fallback to CPU mode.

The global variables `device` will control the data types throughout this assignment. 

In [None]:
USE_GPU = True

if USE_GPU and torch.cuda.is_available():
    device = torch.device('cuda')
else:
    device = torch.device('cpu')

# Constant to control how frequently we print train loss
print_every = 100
print('using device:', device)

## Loading Cifar-10 test dataset

In [None]:
from mean_std import mean_torch, std_torch

mean_torch = mean_torch.to(device=device)
std_torch = std_torch.to(device=device)

# Transform the test set to pytorch Tensor without augmentation
transform_test = T.Compose([
    T.ToTensor(),
])

cifar10_test = dset.CIFAR10('./datasets', train=False, download=True, 
                            transform=transform_test)

Let's, visualize training and test data.

In [None]:
import matplotlib.pyplot as plt

def visualize(images, labels):
    classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
    for i in range(len(images)):
        plt.subplot(2, (len(images) + 1) // 2, i + 1)
        sample_image = images[i]
        sample_label = labels[i]
        plt.imshow(sample_image.astype('uint8'))
        plt.axis('off')
        plt.title(classes[sample_label])
    plt.show()

In [None]:
test_samples = [cifar10_test.data[i] for i in range(10)]
test_labels = [cifar10_test.targets[i] for i in range(10)]
print("Test Data:")
visualize(test_samples, test_labels)

In this exercise, we use a PreActResNet18 model ([arxiv](https://arxiv.org/abs/1603.05027)), which is one of ResNet-type models.

In [None]:
from models import resnet50 as resnet

Next, we define model with mean and standard deviation.

In [None]:
# Create a model
print('Creating a ResNet model')
model = resnet(mean_torch, std_torch).to(device)

# Load an naturally-trained model
print('Loading pre-trained model')
state_dict = torch.load('./pretrained/vanilla.pt', map_location='cpu')
model.load_state_dict(state_dict)

## Evaluating the model

Before implementing attack methods, we have to evaluate the model for the following reasons.
1. To check whether the model is successfuly restored. 
2. To get samples that are correctly classified. We don't have to attack misclassified samples.

Note that the indices of the first 100 samples are stored in a variable named `correct_indices`. You will use it later.

In [None]:
def evaluate(model, dataset, indices, attack_method=None):
    """
    Given the data specified by the indices, evaluate the model.
    
    Args:
        model: pytorch model
        dataset: Cifar-10 test dataset
        indices: Indices that specifies the data
        attack_method (optional): Instance of attack method, If it is not None, the attack method is applied before
        evaluation.
    
    Returns:
        is_correct: list of 0 or 1. 1 if ith image was correctly predicted and 0 otherwise
    """
    model.eval()
    
    is_correct = np.zeros([0], np.int32)
    num_images = len(indices)
    batch_size = 100
    num_batches = int(math.ceil(num_images / batch_size))
    
    # Run batches
    for batch in range(num_batches):
        # Construct batch
        bstart = batch * batch_size
        bend = min(bstart + batch_size, num_images)
        
        image_batch = dataset.data[indices[bstart:bend]]
        image_batch = torch.Tensor(np.transpose(image_batch, (0, 3, 1, 2))).to(device=device)

        label_batch = np.array(dataset.targets)[indices[bstart:bend]]
        label_batch = torch.Tensor(label_batch)
        label_batch = label_batch.to(dtype=torch.int64)
        
        # Attack batch
        if attack_method is not None:
            image_batch = attack_method.perturb(image_batch, label_batch)
            
        # Evaluate batch
        logit = model(image_batch)
        _, predicted = torch.max(logit.data, 1)
        
        correct_prediction = (predicted.cpu().numpy() == label_batch.numpy())
        is_correct = np.concatenate([is_correct, correct_prediction], axis=0)
    
    return is_correct



In [None]:
print('Evaluating naturally-trained model')
is_correct = evaluate(model, cifar10_test, np.arange(0, 1000))

print('Accuracy: {:.1f}%'.format(sum(is_correct) / len(is_correct) * 100))

correct_indices = np.where(is_correct == 1)[0][:100]

## Fast Gradient Sign Method (FGSM)

Now, you will implement Fast Gradient Sign Method under $\ell_{\infty}$ constraint, the first method of generating adversarial examples proposed by [Goodfellow et al.](https://arxiv.org/abs/1412.6572). The algorithm is as follows.

<center>$x_{adv} = x + \epsilon \cdot \text{sgn}(\nabla_{x} L(x, y, \theta))$</center>

where $x, y$ are an image and the corresponding label, $L$ is a loss function, and $\epsilon$ is a maximum perturbation. Usually, Cross-Entropy loss is used for $L$. However, there might be many possible choices for $L$, such as Carlini-Wagner loss (https://arxiv.org/abs/1608.04644)

Your code for this section will all be written inside `attacks/fgsm_attack.py`.

In [None]:
# First implement Fast Gradient Sign Method.
# Open attacks/fgsm_attack.py and follow instructions in the file.

from attacks.fgsm_attack import FGSMAttack

# Check if your implementation is correct.

# Default attack setting
epsilon = 2
loss_func = 'xent'

# Create an instance of FGSMAttack
fgsm_attack = FGSMAttack(model, epsilon, loss_func, device)

# Run FGSM attack on a sample
dataset = cifar10_test
index = 0
sample_image = np.transpose(dataset.data[correct_indices[index]], (2, 0, 1))
sample_image = np.expand_dims(sample_image, axis=0)
sample_image = torch.Tensor(sample_image)

sample_label = dataset.targets[correct_indices[index]]
sample_label = np.expand_dims(sample_label, axis=0)
sample_label = torch.Tensor(sample_label).to(dtype=torch.int64)

sample_adv_image = fgsm_attack.perturb(sample_image, sample_label)
_, sample_adv_label = torch.max(model(sample_adv_image).data, 1)

sample_image = sample_image.cpu().detach().numpy()
sample_adv_image = sample_adv_image.cpu().detach().numpy()

assert np.amax(np.abs(sample_image - sample_adv_image)) <= epsilon
assert np.amin(sample_adv_image) >= 0
assert np.amax(sample_adv_image) <= 255

# Plot the original image
sample_image = [np.transpose(image, (1, 2, 0)) for image in sample_image]
visualize(sample_image, sample_label)

# Plot the adversarial image
sample_adv_image = [np.transpose(image, (1, 2, 0)) for image in sample_adv_image]
visualize(sample_adv_image, sample_adv_label)


## Evaluating the performance of FGSM with varying $\epsilon$

Now, you will evaluate the performance of FGSM with varying a maximum perturbation $\epsilon \in [2, 4, 6, 8, 10]$. In this section, you will use Cross-Entropy loss as $L$. The procedure is as follows.

1. Given $\epsilon$, create an instance of FGSMAttack.
2. Evaluate the performance of the attack instance over the samples specified by the variable `correct_indices`.
3. Calculate attack success rate, which is defined by
<center>$\text{attack success rate}(\%)=\frac{\# \text{ samples that are successfully fooled}}{\# \text{ samples}}\times 100$</center>
4. Run 1, 2, and 3 for each $\epsilon\in [2, 4, 6, 8, 10]$ and draw a plot of attack success rate against $\epsilon$.

If correctly implemented, the success rate will be 75% or higher on epsilon 8.

In [None]:
criterion = 'xent'
epsilons = [0, 2, 4, 6, 8, 10]
attack_success_rates = []

for epsilon in epsilons:
    fgsm_attack = FGSMAttack(model, epsilon, criterion, device)
    is_correct = evaluate(model, cifar10_test, correct_indices, attack_method=fgsm_attack)
    attack_success_rate = np.mean(1 - is_correct) * 100
    attack_success_rates.append(attack_success_rate)
    print('Epsilon: {}, Attack success rate: {:.1f}%'.format(epsilon, attack_success_rate))

plt.plot(epsilons, attack_success_rates, '-bo', label='FGSM (xent loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

## Evaluating the performance of FGSM with Carlini-Wagner loss

In this section, you will evaluate the performance of FGSM using Carlini-Wagner loss. Repeat the procedure in the previous section and compare the results.

If correctly implemented, the success rate will be 80% or higher on epsilon 8.

In [None]:
criterion = 'cw'
epsilons = [0, 2, 4, 6, 8, 10]
attack_success_rates = []

for epsilon in epsilons:
    fgsm_attack = FGSMAttack(model, epsilon, criterion, device)
    is_correct = evaluate(model, cifar10_test, correct_indices, attack_method=fgsm_attack)
    attack_success_rate = np.mean(1 - is_correct) * 100
    attack_success_rates.append(attack_success_rate)
    print('Epsilon: {}, Attack success rate: {:.1f}%'.format(epsilon, attack_success_rate))

plt.plot(epsilons, attack_success_rates, '-ro', label='FGSM (cw loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

**Inline Question**

Which is better, Cross-Entropy loss or Carlini-Wagner loss?

**Your Answer**

None

## Projected Gradient Descent (PGD)

Next, you will implement Projected Gradient Descent under $\ell_{\infty}$ constraint, which is considered as the strongest white-box attack method proposed by [Madry et al.](https://arxiv.org/abs/1706.06083). The algorithm is as follows.

<center>$x^0 = x + \delta, ~~ \delta \sim U(-\epsilon, \epsilon)$</center>
<center>$x^{t+1} = \prod_{B_{\infty}(x, \epsilon)} [x^{t} + \alpha \cdot \text{sgn}(\nabla_{x} L(x^{t}, y, \theta))]$</center>

where $x, y$ are an image and the corresponding label, $L$ is a loss function, $\alpha$ is a step size, $\epsilon$ is a maximum perturbation, and $B_{\infty}(x, \epsilon)$ is a $\ell_\infty$ ball of radius $\epsilon$ centered at $x$.

Your code for this section will all be written inside `attacks/pgd_attack.py`.

In [None]:
# First implement Projected Gradient Descent.
# Open attacks/pgd_attack.py and follow instructions in the file.

from attacks.pgd_attack import PGDAttack

# Check if your implementation is correct.

# Default attack setting
epsilon = 2
step_size = 2
num_steps = 20
criterion = 'xent'

# Create an instance of FGSMAttack
pgd_attack = PGDAttack(model, epsilon, step_size, num_steps, criterion, device)

# Run PGD attack on a sample
dataset = cifar10_test
index = 0
sample_image = np.transpose(dataset.data[correct_indices[index]], (2, 0, 1))
sample_image = np.expand_dims(sample_image, axis=0)
sample_image = torch.Tensor(sample_image)

sample_label = dataset.targets[correct_indices[index]]
sample_label = np.expand_dims(sample_label, axis=0)
sample_label = torch.Tensor(sample_label).to(dtype=torch.int64)

sample_adv_image = pgd_attack.perturb(sample_image, sample_label)
_, sample_adv_label = torch.max(model(sample_adv_image), 1)

sample_image = sample_image.cpu().detach().numpy()
sample_adv_image = sample_adv_image.cpu().detach().numpy()

# Check if the adversarial image is valid
assert np.amax(np.abs(sample_image - sample_adv_image)) <= epsilon
assert np.amin(sample_adv_image) >= 0
assert np.amax(sample_adv_image) <= 255

# Plot the original image
sample_image = [np.transpose(image, (1, 2, 0)) for image in sample_image]
visualize(sample_image, sample_label)

# Plot the adversarial image
sample_adv_image = [np.transpose(image, (1, 2, 0)) for image in sample_adv_image]
visualize(sample_adv_image, sample_adv_label)

## Evaluating the performance of PGD with varying $\epsilon$

Now, you will evaluate the performance of PGD with varying maximum perturbation $\epsilon \in [2, 4, 6, 8, 10]$. In this section, you will use Cross-Entropy loss as $L$. Step size and the number of iterations are set to 2 and 20 respectively. The procedure is as follows.

1. First, create an instance of PGDAttack with given $\epsilon$.
2. Evaluate the performance of the attack instance over the samples specified by the variable `correct_indices`.
3. Calculate attack success rate, which is defined by
<center>$\text{attack success rate}(\%)=\frac{\# \text{ samples that are successfully fooled}}{\# \text{ samples}}\times 100$</center>
4. Run 1, 2, and 4 for each $\epsilon\in [2, 4, 6, 8, 10]$ and draw a plot of attack success rate against $\epsilon$.

If correctly implemented, the success rate will be 100% on epsilon 8.

In [None]:
step_size = 2
num_steps = 20
criterion = 'xent'
epsilons = [0, 2, 4, 6, 8, 10]
attack_success_rates = []

for epsilon in epsilons:
    pgd_attack = PGDAttack(model, epsilon, step_size, num_steps, criterion, device)
    is_correct = evaluate(model, cifar10_test, correct_indices, attack_method=pgd_attack)
    attack_success_rate = np.mean(1 - is_correct) * 100
    attack_success_rates.append(attack_success_rate)
    print('Epsilon: {}, Attack success rate: {:.1f}%'.format(epsilon, attack_success_rate))

plt.plot(epsilons, attack_success_rates, '-bo', label='PGD (xent loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

## Evaluating the performance of PGD with Carlini-Wagner loss

In this section, you will evaluate the performance of PGD using Carlini-Wagner loss. Repeat the procedure in the previous section and compare the results.

If correctly implemented, the success rate will be 100% on epsilon 8.

In [None]:
step_size = 2
num_steps = 20
criterion = 'cw'
epsilons = [0, 2, 4, 6, 8, 10]
attack_success_rates = []

for epsilon in epsilons:
    pgd_attack = PGDAttack(model, epsilon, step_size, num_steps, criterion, device)
    is_correct = evaluate(model, cifar10_test, correct_indices, attack_method=pgd_attack)
    attack_success_rate = np.mean(1 - is_correct) * 100
    attack_success_rates.append(attack_success_rate)
    print('Epsilon: {}, Attack success rate: {:.1f}%'.format(epsilon, attack_success_rate))

plt.plot(epsilons, attack_success_rates, '-ro', label='PGD (cw loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

## Attacks on adversarially-trained model

As you can see, naturally-trained neural networks are vulnerable to adversarial attacks. There are several ways to improve adversarial robustness of neural networks. One example is adversarial training, which uses adversarial samples to train a neural network. It constitutes the current state-of-the-art in the adversarial robustness.

PGD adversarial training, proposed by [Madry et al.](https://arxiv.org/abs/1706.06083), utilizes Projected Gradient Descent to train a network. It has been shown that PGD adversarial training on MNIST and Cifar-10 can defend white-box attack successfully.

In [None]:
%cd pretrained
!gdown --fuzzy https://drive.google.com/file/d/16ffe2zzHIetYRCPkrv5P3NMKrOnx-ePz/view?usp=sharing
%cd ..

In [None]:
# Create a naturally-trained model
print('Creating a ResNet model')
model = resnet(mean_torch, std_torch).to(device)

# Load an advarsarially-trained model
print('Loading an adversarially-trained model')
state_dict = torch.load("./pretrained/adv.pt", map_location=device)
model.load_state_dict(state_dict)


## Evaluating the model

Before implementing attack methods, we have to evaluate the model for the following reasons.
1. To check whether the model is successfuly restored. 
2. To get samples that are correctly classified. We don't have to attack misclassified samples.

Note that the indices of the first 100 samples are stored in a variable named `correct_indices`. You will use it later.

In [None]:
# Evaluate the adversarially-trained model on the first 1000 samples in the test dataset
indices = np.arange(0, 1000)

print('Evaluating adversarially-trained model')
correct_predictions = evaluate(model, cifar10_test, indices)
accuracy = np.mean(correct_predictions) * 100
print('Accuracy: {:.1f}%'.format(accuracy))

# Select the first 100 samples that are correctly classified.
correct_indices = np.where(correct_predictions==1)[0][:100]

**Inline Question**

Is the accuracy of adversarially-trained model higher than that of naturally-trained model, or lower? Explain why they are different.

**Your answer**

None

**Useful material**

For those who are curious about this phenomenon, see https://arxiv.org/abs/1805.12152.

## Evaluating the performance of FGSM on adversarially-trained model

Now, we will evaluate the the performance of FGSM on adversarially-trained model. In this section, you will use Cross-Entropy loss as $L$.

In [None]:
criterion = 'xent'
epsilons = [0, 2, 4, 6, 8, 10]
fgsm_attack_success_rates = []

for epsilon in epsilons:
    fgsm_attack = FGSMAttack(model, epsilon, criterion, device)
    is_correct = evaluate(model, cifar10_test, correct_indices, attack_method=fgsm_attack)
    attack_success_rate = np.mean(1 - is_correct) * 100
    fgsm_attack_success_rates.append(attack_success_rate)
    print('Epsilon: {}, Attack success rate: {:.1f}%'.format(epsilon, attack_success_rate))

plt.plot(epsilons, fgsm_attack_success_rates, '-bo', label='FGSM (xent loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

## Evaluating the performance of PGD on adversarially-trained model

Now, we will evaluate the the performance of PGD on adversarially-trained model. In this section, you will use Cross-Entropy loss as $L$.

In [None]:
step_size = 2
num_steps = 20
criterion = 'xent'
epsilons = [0, 2, 4, 6, 8, 10]
pgd_attack_success_rates = []

for epsilon in epsilons:
    pgd_attack = PGDAttack(model, epsilon, step_size, num_steps, criterion, device)
    is_correct = evaluate(model, cifar10_test, correct_indices, attack_method=pgd_attack)
    attack_success_rate = np.mean(1 - is_correct) * 100
    pgd_attack_success_rates.append(attack_success_rate)
    print('Epsilon: {}, Attack success rate: {:.1f}%'.format(epsilon, attack_success_rate))

plt.plot(epsilons, pgd_attack_success_rates, '-ro', label='PGD (xent loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

## Comparing the performance of FGSM and PGD

Finally, we compare the performace of FGSM and PGD on adversarially-trained model. Just overlay the plots drawn in the two previous sections.

In [None]:
epsilons = [0, 2, 4, 6, 8, 10]

plt.plot(epsilons, fgsm_attack_success_rates, '-bo', label='FGSM (xent loss)')
plt.plot(epsilons, pgd_attack_success_rates, '-ro', label='PGD (xent loss)')
plt.ylim(-5, 105)
plt.xticks(epsilons)
plt.yticks(np.arange(0, 110, 10))
plt.xlabel('epsilon')
plt.ylabel('attack success rate')
plt.legend()

**Inline question**

Describe the result above.

**Your answer**

위 결과를 보면 FGSM과 PGD 공격 모두 epsilon 값이 증가함에 따라 공격 성공률이 증가하는 것을 확인할 수 있습니다. 

epsilon이 0일 때는 두 공격 모두 성공률이 0%로, 실제로 공격이 이루어지지 않았습니다.

epsilon이 낮은 값(2, 4)에서는 FGSM과 PGD의 성능이 비슷하게 나타납니다(epsilon=2에서 둘 다 9%, epsilon=4에서 FGSM 19%, PGD 20%).

하지만 epsilon 값이 커질수록(6, 8, 10) PGD 공격이 FGSM보다 더 효과적임을 알 수 있습니다. 특히 epsilon=8에서 FGSM은 34%의 성공률을 보인 반면, PGD는 48%로 상당한 차이를 보입니다. epsilon=10에서도 FGSM(44%)보다 PGD(53%)가 더 높은 성공률을 보입니다.

이는 PGD 공격이 여러 반복 단계를 통해 더 최적화된 적대적 예제를 생성하기 때문입니다. FGSM은 한 번의 그래디언트 계산으로 공격을 수행하는 반면, PGD는 여러 번의 반복적인 최적화 과정을 통해 더 효과적인 적대적 예제를 찾아냅니다. 따라서 제한된 섭동 범위(epsilon) 내에서도 PGD가 더 강력한 공격을 수행할 수 있습니다.