(invertible-networks)=
# Invertible Networks

Invertible networks are networks that are invertible by design, i.e., any network output can be mapped back to a corresponding input [refs] bijectively. The ability to invert any output back to the input enables different interpretability methods and furthermore allows training invertible networks as generative models via maximum likelihood. 

This chapter starts by explaining what invertible layers are used to design invertible networks, proceeds to detail their training methodologies as generative models or classifiers, and goes on to outline interpretability techniques that help reveal the learned features crucial for their classification tasks. 
$\require{color}$
$\definecolor{commentcolor}{RGB} {70,130,180}$

## Background

* invertible layers
    * explain invertible layers, maybe bold part in front
    * (mention wavelet permutation also)
* example for volume change, do make it (use seaborn colors, in draw.io)
    * do consider making some bar plots with matplotlib why not
* architecture diagram
    * what is one block? Haar Wavelet addition
    * Latent gaussian at end, no don't show


### Invertible layers

Invertible networks use layers constructed specifically to maintain invertibility, thereby rendering the entire network structure invertible. Often-used invertible layers are coupling layers, invertible linear layers and activation normalization layers.

**Coupling layers** split a multidimensional input $x$ into two parts  $x_1$ and $x_2$ with disjoint dimensions and then use $x_2$ to compute an invertible transformation for $x_1$. Concretely, for an additive coupling layer, the forward computation is:

$
\begin{align*}
    y_1 &= x_1 + f(x_2) && \color{commentcolor}{\text{Compute } y_1 \text{ from } x_1 \text{ and arbitrary function f of } x_2} \\
    y_2 &= x_2 && \color{commentcolor}{\text{Leave } x_2 \text{ unchanged}} \\
\end{align*}
$

The inverse computation is:

$
\begin{align*}
    x_1 &= y_1 - f(y_2) && \color{commentcolor}{\text{Invert to } x_1 \text{ using unchanged } y_2=x_2} \\
    x_2 &= y_2 &&  \color{commentcolor}{x_2 \text{ was unchanged}}\\
\end{align*}
$


For the splitting of the dimensions in a timeseries, there are multiple ways, such as using the even time indices as $x_1$ and all the odd time indices as $x_2$ or using difference and mean between two neighbouring samples (akin to one stage of a Haar Wavelet). The function $f$ is usually implemented by a neural network, in our cases it will be small convolutional networks. Instead of addition any other invertible function can be used, affine transformation are commonly used, where $f$ produces translation and scaling coefficients $f_t$ and $f_s$:

$
\begin{align*}
    y_1 &= x_1 \cdot f_s(x_2) + f_t(x_2) && \text{ } y_2=x_2 && \color{commentcolor}{\text{Affine Forward }} \\
    \\
    x_1 &= \frac{(y_1  - f_t(y_2))}{f_s(y_2)} && \text{ } x_2=y_2 && \color{commentcolor}{\text{Affine Inverse}} \\
\end{align*}
$


**Invertible linear layers** compute an invertible linear transformation (an automorphism) of their input. Concretely they multiply a $d$-dimensional vector $\mathbf{x}$ with a $dxd$-dimensional matrix $W$, where $W$ has to be invertible, i.e., have nonzero determinant. 

$
\begin{align*}
    y&=W \mathbf{x} && \color{commentcolor}{\text{Linear Forward }} \\
    x&=W^{-1} \mathbf{y} && \color{commentcolor}{\text{Linear Inverse}} \\
\end{align*}
$

For multidimensional arrays like feature maps in a convolutional network, these linear transformations are usually done per-position, as so-called invertible 1x1 convolutions in the 2d case.

**Activation normalization layers** perform an affine transformation with learned parameters with $s$ and $t$ learned scaling and translation parameters (independent of the input $x$):

$
\begin{align*}
    y&=x \cdot{s} + t && \color{commentcolor}{\text{ActNorm Forward }} \\
    x&=\frac{y - t}{s} && \color{commentcolor}{\text{ActNorm Inverse}} \\
\end{align*}
$
 
 These have also been used to replace batch normalization and are often initialized data-dependently to have standard-normalized activations at the beginning of training.



## Generative models via maximum likelihood

Invertible networks can also be trained as generative models via maximum likelihood. In maximum likelihood training, the network is optimized to maximize the probabilities of the training inputs. Invertible networks assign probabilities to training inputs by mapping them to a latent space and computing their probabilities under a predefined prior in that latent space. However, for real-valued inputs, one has to account for quantization and volume change to ensure this results in a proper probability distribution in the input space. Quantization  refers to the fact that training data often consists of quantized measureuements of underlying continuous data, e.g. digital images can only represent a distinct set of color values. Volume change refers to how the invertible networks' mapping function expands or squeezes volume from input space to latent space.

### (De)quantization

Often, training data consists of quantized measurements like discrete integer color values from 0 to 255, which are mapped to real-world floating point numbers for training a network. Naively maximizing the probability densities of these quantized values with a continuous probability distribution would lead to pathological behavior as the quantized inputs do not cover any volume. Hence it would be possible for the learned distribution to assign infinitely high probability densities to individual training points. As an example, given a gaussian mixture distribution with two components, one component may cover all the training points with nonzero densities while the other one could assign infinitely high densities to one single point [mackay ref?]. 

Hence, one needs to "dequantize" the data such that each datapoint occupies volume in the input space. The simplest way here is to add uniform noise to each data point with a volume corresponding to the gap between two data points. For example, if the 256 color values are mapped to 256 floating values between 0 and 1, one may add uniform noise  $u\sim(0,\frac{1}{256})$ to the inputs. If a new noise sample is drawn for each new forward pass of the network, then optimizing the resulting continuous distribution lower bounds optimizing the original discrete distribution [ref]. TODO: correct a bit maybe, like what is lower bounding what and ref also. maybe also formula


### Change of Volume

In addition, for these probability densities to form a valid probability distribution in the input space, one has to account for how much the network's mapping function squeezes and expands volume. Otherwise, the network can increase densities by squeezing all the inputs closely together in latent space.
To correctly account for the volume change during the forward pass of $f$ one needs to multiply the probability density with the volume change of $f$, descreasing the densities if the volume is squeezed from input to latent space and increasing it if the volume is expanded. As the volume change at a given point $x$ is given by the absolute determinant of the jacobian of f at that point  $\det \left( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \right)$, the overall formula looks like this:


$p(x) = p_\textrm{prior}(f(x)) \cdot  | \det \left( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \right)|$

Usually, one optimizes the log-densities, leading to:

$\log p(x) = \log p_\textrm{prior}(f(x)) \cdot  + \log |\det \left( \frac{\partial \mathbf{f}}{\partial \mathbf{x}} \right)|$




## Generative classifiers

Invertible networks trained as class-conditional generative models can also be used as classifiers. Class-conditional generative networks may be implemented in different ways, for example with a separate prior in latent space for each class. Given the class-conditional probability densities $p(x|c_i)$, one can obtain class probabilities via Bayes formula as $p(c_i|x)=\frac{p(x|c_i)}{\sum_jp(x|c_j)}$.

Pure class-conditional generative training may not yield networks that perform well as classifiers. One proposed reason is the relatively small reduction in generative maximum likelihood loss obtainable from providing the class label to the network for high-dimensional inputs, for example much smaller than typical differences between two runs of the same network [REF]. How much one can reduce the loss through providing the class label can be derived from a compression perspective, so using that under Shannon's theorem more probable inputs need less bits to encode than less probable inputs, or more precisely $\textrm{Number of bits needed}(x) = \log_2 p(x)$. How many of these bits are needed for the class label in case it is not given? To distinguish between n classes, one needs only $\log_2(n)$ bits, so in case of binary pathology classification, only 1 bit is needed. However, the inputs themselves typically need at least 1 bit per dimension, so already, a 21 channel x 128 timepoints EEG-signal may need at least 2688 bits to encode. Therefore the optimal class-conditional model will only be 1 bit better than the optimal class-independent model and contribute very little to the overall encoding size. In contrast, the loss difference between two training runs of the same network will typically be at least 1 to two orders of magnitude larger. In practice, the gains from using a class-conditional model, by e.g., using a separate prior per class in latent space, are usually larger, but it is not a priori clear if the reductions in loss from exploiting the class label are high enough to result in a good classification model.

Various methods have been proposed to improve the performance of using generative classifiers. For example, people have fixed the per-class latent gaussian priors so that they retain the same distance throughout training [Ref Pavel] or added a classification loss term $L_\textrm{class}(x,c_i)=\log p(c_i|x) = \log \frac{p(x|ci)}{\sum_j p(x|ci)}=\log \frac{e^{\log p(x|ci)}}{\sum_j e^{\log p(x|ci)}}$ to the training loss [Ref VIB heidelberg]. In our work, we experimented with adding a classification loss term to the training, and also found using a learned temperature before the softmax helps the training, so leading to:

$
\begin{align*}
    y&=W \mathbf{x} && \color{commentcolor}{\text{Linear Forward }} \\
    x&=W^{-1} \mathbf{y} && \color{commentcolor}{\text{Linear Inverse}} \\
\end{align*}
$

## Invertible Network for EEG Decoding

We designed an invertible network for EEG Decoding using invertible components used in the literature, primarily from the Glow architecture [REF]. Our architecture consists of three stages that operate on sequentially lower temporal resolutions. Similar to glow, the individual stages consists of several blocks of Activation Normalization, Invertible Linear Channel Transformations and Coupling Layers. Between each stage, we downsample by computing the mean and difference of two neighbouring timepoints and moving these into the channel dimension. Unlike Glow, we keep processing all dimensions throughout all stages, finding this architecture to reach competitive accuracy on pathology decoding.

[diagram]

Training and dataset details

Prototypes

Per-Chan Prototypes

EEG CosNet