## Adversarial Examples

Adversarial examples, also known as evasion attacks, are intentionally-perturbed input samples aimed to mislead classification at test time. [1,2].

These attacks are formulated as optimization problems that can be solved via gradient-based optimizers.

Here, we will compute adversarial examples by minimizing a loss function $L$ on a target label $y_t$ (different from the true class), under manipulation constraints, as given below:

$$
\begin{eqnarray}
    \mathbf x^\star \in {\arg\min}_{\mathbf x} && L(\mathbf x, y_t, \theta) \, \\
    {\rm s.t. } && \| \mathbf x- \mathbf x_0\|_2 \leq \varepsilon \, , \\
    && \mathbf x_{\rm lb} \preceq \mathbf x \preceq \mathbf x_{\rm ub} \, .
\end{eqnarray}
$$

The first constraint imposes that the adversarial perturbation will not be larger than $\varepsilon$ (measured in $\ell_2$ norm).
The second constraint is a box constraint to enforce the adversarial image not to exceed the range 0-255 (or 0-1, if the input pixels are scaled).


We solve this problem with a *projected* gradient-descent algorithm below, which iteratively projects the adversarial image on the feasible domain to ensure that the constraints remain valid.

The attack is meant to manipulate the input pixels of the initial image. To this end, we will need to explicitly account for the transform/scaling performed before passing the input sample to the neural network. In particular, at each iteration, we will map the image from the pixel space onto the transformed/scaled space, update the attack point along the gradient direction in that space, project the modified image back onto the input pixel space (using an inverse-transformation function), and apply box and $\ell_2$ constraints in the input space.


**References**
1.   C. Szegedy et al.,  Intriguing Properties of Neural Networks, ICLR 2014, https://arxiv.org/abs/1312.6199
2.   B. Biggio et al., Evasion Attacks against Machine Learning at Test Time, ECML PKDD 2013, https://arxiv.org/abs/1708.06131

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](
https://colab.research.google.com/github/zangobot/teaching_material/blob/HEAD/adv_challenge/adv_exercise.ipynb)

In [None]:
%%capture --no-stderr --no-display
# NBVAL_IGNORE_OUTPUT

try:
    import secml
except ImportError:
    %pip install git+https://github.com/pralab/secml
    %pip install foolbox

# Exercise 1

This PGD implementation is flawed. Find the bug and fix it.

In [None]:
from secml.ml.classifiers.loss import CLossCrossEntropy
from secml.data.loader import CDataLoaderMNIST
from secml.ml import CClassifierPyTorch

from mnist_model import SimpleNet

net = SimpleNet().load_pretrained_mnist('mnist_net.pth')
clf = CClassifierPyTorch(model=net, pretrained=True, input_shape=(1, 28, 28))
digits = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
loader = CDataLoaderMNIST()  # load the hand-written digit dataset,
ts = loader.load('testing', num_samples=10)  # extract 20 samples from the test set
ts.X /= 255  # normalize data between 0 and 1
x, y = ts.X[4, :], ts.Y[4]
target_label = 8

print(f'Original label: {y}')
print(f'Target label: {target_label}')

eps = 5
step_size = 1
steps = 2000

loss_func = CLossCrossEntropy()
x_adv = x.deepcopy()

# we iterate multiple times to repeat the gradient descent step
for i in range(steps):
    scores = clf.decision_function(x_adv)
    loss_gradient = loss_func.dloss(y_true=y, score=scores, pos_label=target_label)
    clf_gradient = clf.grad_f_x(x_adv, target_label)
    gradient = clf_gradient * loss_gradient

    if gradient.norm() != 0:
        gradient /= gradient.norm()

    x_adv = x_adv + step_size * gradient
    delta = x_adv - x
    if delta.norm() > eps:
        delta = delta / delta.norm()
        x_adv = x + delta
    x_adv = x_adv.clip(0, 1)

scores = clf.decision_function(x_adv)
print(f'Adv label: {scores.argmax()}')

# Exercise 2

Instantiate the provided `mnist_net2.pth` network, as done for the first one.
Modify the attack such that PGD applies a manipulations on the input sample such that it is classified as a 2 from the first network and as a 9 from the second one.

In [None]:
# TODO