# Invariance & robustness

We will try to answer: We know we can approximate anything, but we also know about the bias-variance trade-off. Can we prevent the model from doing something if we know it shouldn't.

For this we will think about how we can enforce or encourage robustness and in the limit invariance of the functions learned by neural networks.

We begin by defining what it means for a function to be invariant.

````{prf:definition}
:label: invariance_def

Let $G$ be a set of transformations $\rho: X\to X$. We call a function

$$f: X \to Y$$

invariant under $G$ if


$$f = f \circ \rho \ \forall \rho \in G.$$
````

````{prf:definition}
:label: robust_def

Let $G$ be a set of transformations $\rho: X\to X$. We call a function

$$f: X \to Y$$

$\frac{1}{L}$-robust under $G$ if


$$\|f(x) - f(\rho(x))\| \le L \|x - \rho(x)\| \ \forall \rho \in G\ \wedge \ \forall x\in X.$$
````

* This is one of many possible robustness definitions.
* In some cases you might want to generalize {prf:ref}`robust_def` to involve more specific distance measures than norms.
* $\infty$-robustness is invariance.
* There are three main ways try to introduce invariance or robustness into a network.
    1. Modfiy the inputs during training.
    2. Modify the loss function.
    3. Modify the architecture.
        - The only way to enforce not just encourage invariance or robustness.

````{prf:definition}
:label: lipschitz_def

We call a function

$$f: X \to Y$$

$L$-Lipschitz continuous if

$$\|f(x) - f(x')\| \le L \|x - x'\| \ \forall x, x'\in X.$$
````

* This means that the Lipschitz property provides a "general type" of robustness.

````{prf:lemma}
:label: lipschitz_robust_lemma

Let $f: X \to Y$ be $\frac{1}{L}$-robust under a transformation set $G$. If $G$ acts transitively on $X$ then $f$ is L-Lipschitz.

````

````{prf:lemma}
:label: lipschitz_grad_lemma

If $f:\mathbb{R}^n\to \mathbb{R}^m$ is differentiable, then, if

$$L = \sup_{x\in X} \|Df(x)\|$$

exists, $f$ is $L$-Lipschitz. Here $Df(x)$ denotes the derivative of $f$ w.r.t. $x$.
````

````{prf:lemma}
:label: lipschitz_concat_lemma

If $f:Z\to Y$ is $L_f$-Lipschitz and $g:X\to Z$ is $L_g$-Lipschitz, then $f \circ g$ is $L_fL_g$-Lipschitz.
````

```{admonition} Click for proof.
:class: dropdown
````{prf:proof}
We calculate

$$\|f(g(x)) - f(g(x'))\| \le L_f \|g(x) - g(x')\| \le L_fL_g \|x - x'\|.$$

$\square$
```

````{prf:lemma}
:label: mlp_lipschitz_lemma

Every MLP with a Lipschitz activation function is Lipschitz continuous.

````

```{admonition} Click for proof.
:class: dropdown
````{prf:proof}

As an MLP has a finite number of layers and thanks to {prf:ref}`lipschitz_concat_lemma` we only need to prove that each single layer is Lipschitz.

As each layer is of the form

$$\phi(Ax + b)$$

where $\phi$ is the activation function, $A$ a matrix, and $b$ a vector. Again, since we know that $\phi$ is Lipschitz, we can use {prf:ref}`lipschitz_concat_lemma` to reduce the problem to showing that

$$Ax + b$$

is Lipschitz. We calculate

$$\|(Ax + b) - (Ax' + b)\| = \|A(x-x')\| \le \|A\| \|x-x'\|.$$

As $A$ is a matrix and thereby a bounded linear operator $\|A\|$ exists and we are done.

$\square$
```

```{note}
Most activation functions are Lipschitz continuous. E.g., $\mbox{ReLU}, \mbox{sigmoid}, \tanh.$
```

```{warning}
While in theory, all MLPs are Lipschitz continuous, in practice, this can be irrelevant as the Lipschitz constant is often uncomfortably large.
```

## Tricks to make a network invariant

````{prf:definition} PointNet
:label: PointNet_def

Let $f:X \to Z$ and $g:Z\to Y$ be neural networks. We then define a **PointNet** as

$$\tilde X \mapsto g\left(\sum_{x\in\tilde X} f(x)\right).$$

Here $\tilde X\subset X$ is a set of input points.
````

* A special property of the PointNet is that it can take in abitrarily many vectors as one joint input.

````{prf:lemma}
:label: PointNet_lemma

PointNets are permutation invariant w.r.t. the elements in $\tilde X$.

````

```{admonition} Click for proof.
:class: dropdown
````{prf:proof}

As the sum $\sum_{x\in\tilde X} f(x)$ is commutative, i.e., permutation invariant we are done.

$\square$
```

A recap of what a group and a group representation is.

````{prf:definition} Group
:label: group_def

A **group** is a set $G$ together with a binary operation

$$\cdot:G \times G\ni (a, b) \mapsto a \cdot b \in G,$$

s.t. we have
* associativity, i.e., $(a\cdot b)\cdot c = a\cdot (b\cdot c)$
* an identity element, i.e., $\exists e \in G$ s.t., $e\cdot a = a =  a \cdot e \ \forall a\in G$,
* and an inverse, i.e., for each $a \in G\ \exists\ b\in G$ s.t. $a\cdot b=b\cdot a = e$.
````

````{prf:theorem} Cayley's theorem
:label: cayley_theorem

Every group $G$ is isomorphic to a subgroup of a symmetric group (all the bijections from the set to itself).

````

````{prf:definition} Group representation
:label: representation_def

Let $G$ be a group and $V$ a vector space. A **representation** of $G$ on $V$ is a group homomorphism from $G$ to $\mbox{GL}(V)$. I.e., a map

$$\rho: G \to \mbox{GL}(V)$$

such that

$$\rho(a\cdot b) = \rho(a)\rho(b) \ \forall a,b\in G.$$
````

We can use the trick of the PointNet to create a network that is invariant to the transformations from a group $G$ with a representation $\rho$.

For two networks $f:X\to Z$ and $g:Z\to Y$ we can simply define the network

$$x \mapsto g\left(\sum_{a\in G} f(\rho(a)x)\right).$$

## Tricks to make a network robust

### Robustness gradient penalty training

Usually we train a network $f$ with a loss of the shape

$$L(\Theta, \mathscr{D}_\text{train}) = \mathbb{E}_{(x,y)} \mathscr{l}(f_\Theta(x), y).$$

A simple way to improve the robustness of the trained network ([source](https://arxiv.org/pdf/1711.09404.pdf)) is to augment the loss via {prf:ref}`lipschitz_grad_lemma`:

$$L(\Theta, \mathscr{D}_\text{train}) = \mathbb{E}_{(x,y)} \mathscr{l}(f_\Theta(x), y) + \lambda \|\nabla_x f_\Theta(x)\|_p^p.$$

Usually one choses $p=2$ and $\lambda \in [1, 10]$.

* Only easy and cheap if $f_\Theta(x)$ is a scalar. See [here](https://arxiv.org/pdf/2012.13247.pdf) for non-scalars.
* This method only tries to minimize a lower bound on the Lipschitz constant, so there is no guarantee that this function will be robust. Indeed, if you explicitly look for so-called advrserial examples, you will find them.

### Data augmentation

#### What is the goal?
* As we discussed, Lipschitz continuity provides a general, unspecific type of robustness.
* We can achieve a more specific type of robustness via data augmentation.
* Again, we assume a set of transformations/augmentations $G$ and neural network $f:X\to Y$ and we want $f$ to be robust under $G$, i.e., 

$$\|f(x) - f(\rho(x))\| \le L \|x - \rho(x)\| \ \forall \rho \in G\ \wedge \ \forall x\in X.$$

#### The method
* Data agumentation tries to encourage this robustness w.r.t. $G$ by defining a probability distribution $\mathcal{G}$ over $G$ with a probability density function $p:G\to\mathbb{R}_{\ge0}$. This distibution of transformations is then used to define the loss

$$L(\Theta, \mathscr{D}_\text{train}) = \mathbb{E}_{(x,y)} \mathbb{E}_{\rho\sim\mathcal{G}} \mathscr{l}(f_\Theta(\rho(x)), y).$$

* In practice, one does not compute the full expectation $\mathbb{E}_{\rho\sim\mathcal{G}}$ one only samples a few, usually even only one $\rho$, to approximate the expectation. Using this simplification, we can reformulate the above setting via the probabilistic mapping $\mathcal{A}:X\ni x \to \rho(x)\in X \text{ with } \rho\sim\mathcal{G}$ as

$$L(\Theta, \mathscr{D}_\text{train}) = \mathbb{E}_{(x,y)} \mathscr{l}(f_\Theta(\mathcal{A}(x)), y).$$

```{note}
In practice it might makes sense to make $G$ and $\mathcal{G}$ data dependent, i.e., $G=(x, y)$ and $\mathcal{G}=\mathcal{G}(x, y)$; or alternativly $\mathcal{A}(x, y)$.
```

#### Which augmentations?
* The core questions when using augmentations is: how to chose $G$ and $\mathcal{G}$?
* Ideally we should probably chose $\mathcal{G}$ as

$$\arg\min_\mathcal{G} \mathbb{E}_{(x,y)\sim p_{x,y}} \mathcal{l}(f^*(x), y) \text{ where}$$

$$f^* = \arg\min_f \frac{1}{|\mathcal{D}_\text{train}|} \sum_{(x,y)\in \mathcal{D}_\text{train}} \mathbb{E}_{\rho\sim\mathcal{G}} \mathcal{l}(f(\rho(x)), y).$$
* Problem: $\nexists$ practical access to $p_{x,y}$

* One way we could tackle that problem is by trying to come up with an $\mathcal{A}$ such that the distribution arising from the mapping $\mathcal{D}_\text{train} \sim (x, y) \mapsto (\mathcal{A}(x), y)$ approximates the true distribution $p_{x,y}$ as well as possible.
    + This is also very hard.
* Approach: Via negativa. Instead of approximating $p_{x,y}$, try to find augmentations $\mathcal{A}$ that do not change the class of $y$. I.e., we use invariance/robustness.
* Examples of good augmentations for image classification, e.g., cat vs dog:
    + Adding some noise to the image.
    + Rotate the image.
    + Mirror the image.
    + Slightly warp/deform the image.
    + Crop the the image.
    + Change brightness.
    + Change contrast.
    + Color distortion.
    + Somehow combine two images of the same class.

```{figure} images/augmentation_car.png
---
height: 200px
---
Some augmentations of a car. [Source.](https://viso.ai/computer-vision/image-data-augmentation-for-computer-vision/)
```

```{figure} images/augmentation_grid.jpg
---
height: 500px
---
Some strong augmentations of a Quokka. [Source.](https://github.com/aleju/imgaug)
```

## When do you want which robustness or invariance?

* Rotation and mirror invariance for the input being an photo:
    + Good idea for:
        - Is there a cat in the image?
    + Bad idea for:
        - In which direction is the cat looking?
* Permutation invariance of the input is a sequence of digits:
    + Good idea for:
        - What is the digit sum?
    + Bad idea for:
        - What is the square of the number which is by the sequence?

## Exercises

* Prove {prf:ref}`lipschitz_robust_lemma`
* Prove {prf:ref}`lipschitz_grad_lemma`
* Train a PointNet that can compute the digit sum of 1, 2, and 3 MNIST digits.
    + Evaluate it also for 4 and 5 digits.
* Use the Fashion MNIST dataset and introduce random 90°, 180°, 270° rotations. Then train an MLP and a rotation invariant version of that MLP on it. Compare the performance.