# Build Basic Generative Adversarial Networks (GANs)
Deeplearning.ai study notes

https://lilianweng.github.io/lil-log/2017/08/20/from-GAN-to-WGAN.html

## Activations

Differentiable non-linear function
- differentiable for backpropagation
- non-linear to compute complex features

**ReLU**: $max(z,0)$, derivative at zero is set to zero, derivative in negative space is zero -> dying ReLU problem

**Leaky ReLU**: slope a = 0.1, catching up popularity

**Sigmoid**: output [0,1], sigmoid not often used in hidden layers because the tails of the sigmoid curve have zero derivative -> vanishing gradient problem

**Tanh**: output [-1,1], keeps the sign of input z, same saturation problem as sigmoid


## Batch normalization

Training data uses batch stats

Test data uses training stats

**Covariate shift**: changes in the distribution of one variable affect the distribution of relayed variables

Batch normalization:
1. reduce internal covariate shift
2. smooth out cost function
3. make neural network easier to train

### Training

In a batch, z -> normalized z by mean and std -> learnable scale factor $\gamma$ + shift factor $\beta$ -> y -> into activation function

$\gamma$ and $\beta$ are learnable parameters to get the optimal dist.

Not necessary zero mean and standard deviation of one

### Testing

Use running mean and standard deviation that was computed over the entire training set, fixed after training



## Convolutions

Each filter tels different features of the image (eye filter, noise filter, ear filter...)

Stride

Padding: frame on the image, capture useful information at edge

Pooling: lower the dimension of input images

Upsampling: (use predifined methods)
- Nearest Neighbors
- Linear interpolation
- Bi-linear interpolation

**Transposed Convolutions**

input * filter -> up-sampling with learned parameters

issue: centered values are influenced much more heavily -> output has checkerboard problem

solution: use up-sampling followed by conv -> popular technique t avoid checkerboard problem

## Generative Models

* Variational Autoencoders (VAE)

Training: img -> Encoder -> Latent Space -> Decoder -> img' (reconstruct image)

Generating: random latent space -> Decoder -> generated img

* Generative Adversarial Networks (GAN)

Training:

random noise -> **Generator** -> generated img

generated img -> **Discriminator** -> real/fake img

Generating:

random noise -> Generator -> generated img



## Procedures

Discriminator Training:

<img src="./images/gan_train.png" width="600"/>

Generator Training: (G wants to fool D as much as possible)

<img src="./images/gan_train2.png" width="600"/>

Both models should improve together and always be at a similar "skill" level.

## Binary Cross Entropy (BCE) Loss

Deisgned for classification real/fake

<img src="./images/BCE.png" width="600"/>

- m: batch size
- h: prediction made by model
- y: true label
- x: features (img)
- $\theta$: discriminator parameters

when prediction is 0 -> left part = 0<br>
when prediction is 1 -> left part tells if h = 0, loss = inf; if h = 1, loss = 0

when prediction is 1 -> right part = 0<br>
when prediction is 0 -> right part tells if h = 0, loss = 0; if h = 1, loss = inf

J -> 0 when label and prediction are close<br>
J -> inf when label and prediction are different



### Problem with BCE

When discriminator improves too much, the function approximated by BCE loss will contain flat regions.

Flat region on the cost function = vanishing gradients

The discriminator does not output useful gradients (feedback) for the generator when the real/fake distributions are far apart.

### Model Collapse

Mode: any peak on the density function is a model

<img src="./images/multi-mode.png" width="600"/>

Real-world datasets have many modes related to each possible class

Mode collapse happens when generataor gets stuck in one mode

## Earth Mover's Distance (EMD)

It measures the distance between two distributions.

Amount of efforts to make the generated distribution equal to the real distribution.

Example: amount of dirt moved * moving distance to move and transform a pile of dirct in the shape of distribution P to the shape of distribution Q. 

A recursion problem: $cost_{i+1} = cost_{i} + P_i - Q_i$.
<img src="https://lilianweng.github.io/lil-log/assets/images/EM_distance_discrete.png
" width="800"/>



No saturation when two distributions are very far apart

## Wasserstein GAN

W-Loss = $\mathbb{E}(c(x)-\mathbb{E}(c(g(z))))$<br>
W-Loss helps with mode collapse and vanishing gradient problems

### Condition on W-Loss

Critic should be 1-Lipschitz continuous, norm of its gradient needs t be at most one.<br>
Needed for training stable neural networks with W-Loss.

### Weight Clipping
Force the weights of critic to a fixed interval<br>
Limits the learning ability of critic

### Gradient Penalty
Softer way, add regularization term - a two sided-penalty on gradients. In practice, a random interpolation image generated from real and fake images are used to calculate the gradient. 

<img src="./images/WGAN.png" width="700"/>


### Optimizer

The authors recommended **RMSProp optimizer** on the critic, rather than a momentum based optimizer such as Adam which could cause instability in the model training. No theoretical explanation yet.

### Spectral Normalization

Spectral Normalization for Generative Adversarial Networks (Miyato et al. 2018)<br>
https://arxiv.org/abs/1802.05957


## Noise vector $z$

The noise vector $z$ has the important role of making sure the images generated from the same class $y$ don't all look the same—think of it as a random seed. You generate it randomly, usually by sampling random numbers either between 0 and 1 uniformly, or from the normal distribution, which you can denote $z$ ~ $N(0, 1)$. The zero means the normal distribution has a mean of zero, and the 1 means that the normal distribution has a variance of 1. 

In reality, $z$ is usually larger than just 1 value to allow for more combinations of what $z$ could be. There's no special number that determines what works, but 100 is standard. Some researchers might use a power of 2, like 128 or 512, but again, nothing special about the number itself, just that it's large enough to contain a lot of possibilities. As a result, you would sample $z$ from that many different dimensions (constituting multiple normal distributions).

*Fun Fact: this is also called a spherical normal and denoted $z$ ~ $N(0, I)$ where the $I$ represents the identity matrix and means the variance is 1 in all dimensions.*

## Truncation trick

So now that you're a bit familiar with noise vectors, here's another cool concept that people use to tune their outputs. It's called the truncation trick. I like to think of the truncation trick as a way of trading off fidelity (quality) and diversity in the samples. It works like this: when you randomly sample your noise vector $z$, you can choose to keep that random $z$ or you can sample another one. 

Why would you want to sample another one? 

Well, since I'm sampling $z$ from a normal distribution, my model will see more of those $z$ values within a standard deviation from the mean than those at the tails of the distribution—and this happens during training. This means that while the model is training, it's likely to be familiar with certain noise vectors and as a result model those areas coming from familiar noise vector regions. In these areas, my model will likely have much more realistic results, but nothing too funky, it's not taking as many risks in those regions mapped from those familiar noise vectors. This is the trade-off between fidelity (realistic, high quality images) and diversity (variety in images). 

What the truncation trick does is resamples the noise vector $z$ until it falls within some bounds of the normal distribution. In fact, it samples $z$ from a truncated normal distribution where the tails are cut off at different values (red line in graph is truncated normal, blue is original). You can tune these values and thus tune fidelity/diversity. Recall that having a lot of fidelity is not always the goal—one failure mode of that is that you get one really real image but nothing else (no diversity), and that's not very interesting or successful from a model that's supposed to model the realm of all possible human faces or that of all possible coconuts—including that of a cat pouncing after a flying coconut (but with extremely low probability).

## Conditional GAN

Labelled class data 

Generator:

<img src="./images/cGAN_gen.png" width="600"/>

Discriminator:

<img src="./images/cGAN.png" width="600"/>



## Controllable GAN

Generate features by manipulating noise vector (i.e. lantent)

<img src="./images/z_interpolation.png" width="600"/>

$v_1 + d = v_2$

- $v_1$ shows black hair
- $v_2$ shows blue hair
- vector $d$ shows direction to modify hair color

Challenges:
* Feature correlation: beard vs man
* Z-space entanglement: element in z is related to multiple output features (e.g. not enough dimensions)

Classifier gradients:
<img src="./images/clas_grad.png" width="600"/>

Classifiers can be used to find directions in z-space

Disentanglement in z-space: supervised methods/ unsupervised methods

E.g. Add regularization in loss function



## GAN Evaluation

GAN — How to measure GAN performance? (Hui, 2018): https://medium.com/@jonathan_hui/gan-how-to-measure-gan-performance-64b988c47732

Fidelity (quality of images) & Diversity (variety of images)

There is no universal gold-standard discriminator.

**Pixel Distance** is not reliable (e.g. shifting one pixel -> large pixel distance)

**Feature Distance** (e.g. 2 eyes, 2 ears, 5 legs, ...)

Extentively pre-trained classifiers available to use (feature layer)<br>
The last pooling layer is the most commomly used for feature extraction

ImageNet (>14 million images, > 20,000 categories)

### Inception-v3

Architecture: input (299x299x3) -> 8x8x2048 -> pooling (8x8) -> 2048 values

Embedding of x: image -> Inception-v3 -> extracted features 

Comparing Embeddings (i.e. features)


### Frechet Inception Distance (FID)

https://nealjean.com/ml/frechet-inception-distance/

#### Frechet distance

Dog walker analogy: the least amount of leash you can give your dog without ever having to give them more slack during the walk.

Can be applied to two curves or two distributions.

Multivariate Normal Frechet Distance:<br>
$||\mu X - \mu Y||^2 + Tr(\sum X + \sum Y - 2\sqrt{\sum X + \sum Y})$<br>
Tr: trace of a matrix, sum of its diagonal elements<br>
function of means and covariance matrix

Real and fake embeddings are two **multivariate** normal distributions

Lower FID = Closer distributions

#### FID shortcomings

* Use pre-trained inception model, which may not capture all features
* FID score is biased, needs a large sample size (larger sample size, better GAN performance seems to be)
* Slow to run
* Limited statistics used: only mean and covariance - assumed multivariate normal

### Inception Score (IS)

Used before FID, FID is getting more popular, IS is being replaced by FID

IS uses inception model classification

Fidelity (Low entropy): probability not scatters on other classes but only on a few classes

Diversity (High entropy): not concentrated on certain classes

**KL divergence**:  $D_{KL}(p(y|x)||p(y))=p(y|x) log (\frac{p(y|x)}{p(y)})$

from 

* Marginal distribution P(y): Diversity

to

* Conditional distribution P(y|x): Fidelity

Higher score -> Low entropy -> better

#### IS shortcomings

* Can be exploited or gamed (e.g. if GAN only produces one image for one class)
* IS only looks at generated samples
* Can miss useful features


### Sampling and Truncation

GAN evaluation is sample-dependent

During training, noise vector generated from normal distribution

Truncation chops off the tail ends during testing, if you want higher fidelity, truncate near zero


### Precision and Recall

<img src="./images/precision.png" width="700"/>

The state of the art models can often be bad at precision as opposed to recall. It leads to that prediction space is a superset of real space. This is why the truncation trick can come in handy for downstream applications to eliminate the space outside real distribution area.
