<h1><a href="https://ieeexplore.ieee.org/abstract/document/5940562/">
Towards Evaluating the Robustness of Neural Networks</a></h1>
by Nicholas Carlini et al.


<h2>1. Summary</h2>

New attacks are developed to generate adversarial examples successfully on distilled NN.

<h2>2. Background</h2>

* ``Neural Network`` with softmax cllasifier
    * $F(x)=softmax\circ F_n\circ F_{n-1}\circ\ldots\circ F_1$ where $softmax(Z(x))_i=\frac{e^{Z(x)_i}}{\sum e^{Z(x)_j}}$
    * Final output is $C(x)=argmax_i F(x)_i$ and let $C^*(x)$ be the correct label of x

* ``Adversarial examples`` not only exist in image classification, but also in speech recognition, malware cllasification.
    *  **Taregeted** adversarial examples: given a class t and an input x that is not classified as t, find a new input x′ that is similar to x but classified as t.
    *  **Untaregeted** adversarial examples: given an input x, find a new input x′ that is similar to x but classfied differently.

* ``Evaluate the robustness of a NN`` in two ways
<ul>
    <li>attempt to prove a lower bound (*sound but difficult practically*)</li>
    <li>construct attacks that demonstrate an upper bound (*attacks must be strong enough to support the upperbound*)</li>
</ul>

* ``Adversary attacker`` has complete access to a neural network. 
>Prior work shows that it is possible to train a substitute model given a black-box access to a target model and transfer the attacks from the substitute model to the target model.

    * **Targeted adversarial attack** 
        * *Average case*: select incorrect class uniformly
        * *Best case*: attack against all incorrect class and report the least difficult one 
        * *Worst case*: attack against all incorrect class and report the hardest one

* ``Distance metrics`` measures the distortion made to the input. In this paper, various $L_p$ norms are used because of its approximations of human perceptual distance.
> $L_p$ can be written as $||x-x'||_p=(\sum^n_{i=1}|x_i-x'_i|^p)^{\frac{1}{p}}$. Most of the existing work pick one of the distance metrics below.

    * $L_0$ measures the  number of coordinates i such that $x_i\neq x'_i$
    * $L_2$ measures the standard Euclidean distance between x and x'
    * $L_\inf$ measures the maximum change to any of the coordinate

* <a href="https://arxiv.org/abs/1511.04508">``Defensive distillation``</a>
    The idea is tha adversarial hids in blind spot raised by overfitting. Using data from softer network avoid over-fitting. The idea is likely to be wrong.

<h2>3. Attack Algorithms</h2>

<h3><a href="https://openreview.net/forum?id=kklr_MTHMRQjG">L-BFGS</a></h3>

Given an image x, find a different image x' that is similar to x yet is labeled differently by classifier.

\begin{equation}
min\qquad ||x-x'||^2_2+loss_{F,l}(x')\\
s.t.\qquad x'\in[0, 1]^n
\end{equation}

where $loss_{F,l}$ maps an image input to a positive value, e.g. the cross entropy function. 

With one constant $c>0$, the objective problem can be solved via line search and the solution is one adversarial example of minimum distance. 

By chaning $c$ adaptively, different or `better` adversarial example can be found.


<h3><a href="https://arxiv.org/abs/1412.6572">Fast Gradient Sign</a></h3>

* Use $L_\infty$ norm rather than $L_2$ norm.
* Meant to be fast rather than producing minimal pertubation

$$x'=x-\epsilon\cdot sign(\nabla loss_{F,t}(x))$$

where $\epsilon$ is chosen to be sufficiently small so as to be undetectable. 

The gradient of the loss function determine which direction the pixel's intensity should be changed to minimize the loss function.

><h4><a href="https://arxiv.org/abs/1607.02533">Iterative Gradient Sign</a>
takes multiple small step length $\alpha$ than $\epsilon$ and uses a `clip` function.
$$x'_0=0$$
$$x'_i=x'_{i-1}-clip_\epsilon(\alpha\cdot sign (\nabla loss_{F,t}(x'_{i-1})))$$
$$clip_\epsilon(\Delta) = min\{􏰆255, x'_{i-1}+\epsilon, max\{􏰂0, x'_{i-1}-\epsilon, \Delta􏰃􏰇\}\}$$

<h3><a href="https://arxiv.org/abs/1511.07528">JSMA</a></h3>

* Greedy algorithm that picks pixels to modify one at a time and iteratively increase the confidence of target classification. 

* Use $\nabla z(x)_l$ to measure the impact of each pixel on the likelihood of classifing the image as the target class $l$.

* Pick the pixel that has the largest impact and modify it to increase the likelihood of class $l$. 
 
* Repeat until either false classification is achieved or the modification becomes detectable because a set threshold of pixels are modified.

$$\alpha_{p,q}=\sum_{i\in{p,q}}\frac{\partial z(x)_t}{\partial x_i}$$
$$\beta_{p,q}=(\sum_{i\in{[p,q}}\sum_j\frac{\partial z(x)_j}{\partial x_i}-\alpha_{p,q}$$

where $\alpha_{p,q}$ measures how much changing both pixelx $p$ and $q$ will change the target classification $t$; $\beta_{p,q}$ measures how much changing $p$ and $q$ will change all other classification $j!=t$. The algorithm picks 

$$(p^*,q^*)=argmax_{(p,q)}(-\alpha_{p,q}\cdot\beta_{p,q})\cdot(\alpha_{p,q}>0)\cdot(\beta_{p,q}<0)$$

so that $\alpha_{p,q}>0$ and $\beta_{p,q}<0$, i.e. the target class becomes more likely while others become less likely. 

* If $z(x)$ is used, then it is called JSMA-Z attack; otherwise if $F(x)$ is used, then it is called JSMA-F attack.

* For RGB image, for each pixel there are 3 color channels and $L_0=1$ means that one color of one pixel is changed.

<h3><a href="https://arxiv.org/abs/1511.04599">Deepfool</a></h3>

* An untargeted attack technique optimized for the $L_2$ metric. 

* Produces closer adversarial examples than L-BFGS

* Imagine the NN is totoally linear

* Try to derive an optimal solution for the adversarial example iteratively until a true adversarial example is found.

<h2>Experiment Preparation</h2>

* Train two networks for the <a href="http://yann.lecun.com/exdb/mnist/">MNIST</a> AND <a href="https://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10</a> tasks with <a href="https://arxiv.org/abs/1511.04508">defensively ditillation</a> method.

* Achieved $99.5\%$ accuracy on MNIST AND $80\%$ ON CIFAR-10.

* Use a pre-trained <a href="https://arxiv.org/abs/1512.00567">Inception v3 network</a> for <a href="http://www.image-net.org">ImageNet</a>.

* Achieved $96\%$ top-5 accurary

<h2>Construct Adversarial Examples</h2>

<h3>Optimization problem formulation</h3>

Let $\delta$ be the change made to the original input $x$, $t$ the target class. The initial formulation of adversarial example is the solution of 

\begin{equation}
min_\delta\qquad ||x,x+\delta||_p\\
s.t.\qquad C(x+\delta)=t\\
    \qquad x+\delta\in[0, 1]^n
\end{equation} 

As $C(x+\delta)=t$ is highly non-linear, define an objective function $f(x)$ such that $f(x+\delta)\leq0\ iff\ C(x+\delta)=t$. Then the optimization problem becomes

\begin{equation}
min_\delta\qquad ||x,x+\delta||_p\\
s.t.\qquad f(x+\delta)\leq 0\\
    \qquad x+\delta\in[0, 1]^n
\end{equation}

**In practice, to simplify the optimization problem, use a surrogate objective function with a constant c>0.**

\begin{equation}
min_\delta\qquad ||x,x+\delta||_p+c\cdot f(x+\delta)\\
s.t.\qquad x+\delta\in[0, 1]^n
\\\qquad c>0
\\where\ f(x)=max(max_{i\neq t}(z(x+\delta)_i)-z(x+\delta)), 0)\ or\ other
\end{equation}

<img src="./fig1.png"><center>The larger c is, the larger the distance of perturbation becomes; when c is small, the attack success rate is low; when c increases, the attach success rate grows rapidly to 100%. In implementation, a modified binary search is used to choose c</center></img> 

For the constraint $x+\delta\in[0, 1]^n$, three approaches are used in combination with <a href="https://arxiv.org/abs/1412.6980">Adam optimizer</a>:
* ``Projected gradient descent`` performs gradient descent then clips the updates to be within the constraint. ``cons``: Work poorly for gradient descent approaches that have a complicated update step.
    
* ``Clipped gradient descent`` performs gradient descent then clips the updates in the objective function $f(x+\delta)$ to be $f(min(max(x+\delta, 0), 1)$. ``cons``: x can get stuck at a value that is too large
    
* ``Change of variables`` use $\delta=\frac{tanh(w_i)+1}{2}-x$ and optimize w.r.t $w$ instead of $\delta$, since $\delta+x=\frac{tanh(w_i)+1}{2}\in[0,1]$.

<h3>Evaluation and discussion</h3> 

The choose of loos function $f(x)$ impact the result most significantly. 

* It turns out that $Z(x)_t$ is mostly linear w.r.t $\delta$. Then $F(x)_t$ becomes a logistic. 

* $\nabla f(x)$ is small at initial x. To make $||x+\delta-x||_p <c(f(x+\delta)-f(x))$ with small $\delta$, $c$ has to be large.

* But $\nabla f(x+\delta)$ increases exponentially with $\delta$, it makes constant $c$ overly large.

If the discrete color intensity is rounded when gradient is calculuated, then the quality of the adversarial example will be degraded, especially when the change to the initial image is small. Discretization and greedy search are implemented to improve the performance.


<h2>Attacks with three different distance metrics</h2>

<h3>$L_2$ attack</h3>

The optimization function is

\begin{equation}
min_w ||\frac{tanh(w_i)+1}{2}-x||^2_2 + c\cdot f(\frac{tanh(w_i)+1}{2})
\\where\ f(x+\delta)=max(max\ {Z(x+\delta)_i:i!=t}-Z(x+\delta)_t, -\kappa
\end{equation}

Adjust $\kappa$ to control the confidence of misclassification. The figure below shows the adversarial examples generated via MNIST.

<img src="./fig2.png"><center>Almost all attacks are visually indistinguishable.</center></img>

To avoid stucking in local optimum, multiple random starting points are chosen within $r$-balls around original pixels.

<h3>$L_0$ attack</h3>

$L_0$ distance is non-differentiable. An iterative algorithm is used. 

* In each iteration, use $L_2$ attack to find pixels that have low effect $argmin_x \delta \nabla f(x+\delta)$ on the classifier output. 
* Fix the values of the pixels found in all previous iterations and keep constant in the iterations afterwards. 
* Finally $L_2 attack$ cannot find an adversarial example. A possibly minimal set of pixels that have significant effect on the classifier output is left.

Each time when soving $L_2$ attack, a small $c$ is used initially. If attack fails, $c$ will be doubled and attack is restarted until either attack succeeds or $c$ reaches a threshold.

<img src="./fig3.png"><center>Almost all attacks are visually indistinguishable.</center></img>

In the figure above, the attacks on MNIST are visually noticeable.

<h3>$L_\infty$ attack</h3>

The optimization problem is

$$ min_\delta c\cdot f(x+\delta)+||\delta||_\infty$$

Gradient descent performs poorly. An iterative attack is used with a penalty appproximating the $L_\infty$ norm.

$$ min_\delta c\cdot f(x+\delta)+\sum_i[max(\delta_i-\tau, 0)]$$

In each iteration, $c$ is initially a small value. If attack fails, $c$ is doubled. Search $c$ until attack succeeds or $c$ exceeds a threshold.

After each iteration, if $\delta_i<\tau$ for all $i$, $\tau$ is reduced; otherwise, the iteration terminates.

<img src="./fig4.png"><center>Almost all attacks are visually indistinguishable.</center></img>

In all $L_0, L_2, L_\infty$ attacks, when $7$ is classifed as $6$, the attacks are distinguishable.

<h2>Attack Evaluation in Comparison with Prior Work</h2>

* Attacks find closer adversarial examples than the previous state-of-the-art attacks, and never fail to find an adversarial example while others occasionally fail
 
* As the learning task becomes increasingly more difficult, the previous attacks produce worse results, due to the complexity of the model. In contrast, our attacks perform even better as the task complexity increases.
 
* **It is important to realize that the results between models are not directly comparable.**

<img src="./fig5.png"><center>When the initial image is all black and while, the perturbation that the attacks make for the target misclassification.</center></img>

<h2>Attacking Defensive Distillation</h2>

<h3>Distillation</h3>
    * First train a teacher network on a standard training set. 
    * Use the teacher network to re-label the training set with its output. 
    * Train a smaller distilled network with the re-labeled training set.
    >Re-label means that for each input use labels with probabilities instead of one single ground true label.
    
<h3>Defensive distillation</h3>
    * The distilled network is no longer smaller than the teacher network
    * Use a large `distillation temperature $T$` to force the distilled model to become more confident in its predictions.$$sofmax(x, T)_i=\frac{e^{x_i/T}}{\sum_j e^{x_j/T}}=softmax(x/T, 1)$$
    * Training steps:
<ol>
    <li>Build a teacher network with temperature $T$.</li>
    <li>Train the teacher network with a standard traininig set.</li>
    <li>Use the trained teacher network to re-label the training set.</li>
    <li>Train the distilled network with temperature $T$ on the re-labeled training set.</li>
    <li>When testing the distilled network, use temperature 1.</li>
</ol>

<h3>Existing attacks fails</h3>

* `L-BFGS`, `JSMA-F`, `Fast Gradient Sign` and `Deepfool` attacks with standard objective function fail because $\nabla F()\approx 0$ almost everywhere.
    >Temperature $T\rightarrow 1$ is equivalent to $x\rightarrow T\cdot x$. The output of softmax with correct class becomes far more close to 1 than that of other classes. And the small confidence of other class is eliminated due to arithmetic rounding.
    
* JSMA-Z fails. 
    * It regard the changes in $Z()$ of all classes as equally important
    
    > If the changing a pixel will lead to similar changes in the $Z()$ value of the least likely and the most likely classes, then this pixels will not be changed. 
    
    * Temperature changes magnifies the changes of all classes. Hence the attack refuse to make any change.
    
<h3>Defensive Distillation is still vulnerable</h3>

* The attack introduced in this paper succeeds with $100%$ success probability for all three distance metrics.

* Temperature variabtion has no effect on defending $L_2$ attack.

* Transferable attack can be learnt via standard model. 
    * Use $L_2$ attack to find ``high confident`` adversarial examples in a standard network
    * Adversarial examples of standard network successfully attack the defensively distilled network.

<h2>Conclusion</h2>

* Defensive distillation is not as robust as it has been claimed to be.

* The attack introduced in this paper is powerful and can be a new robustness metric.

* Transferable attack should be a part of the robustness test.
