# Generative Models and Adversaries
* Generative Models Definition
* Generative Models Taxonomy
* Fully Visible BN
* Variational Approach

## Generative Models

From training date to generated samples. Densiity estimation - core problem in unsupervised learning.

Example: whichfaceisreal.com

**Maximum Likelihood**:
$\theta^* = argmax_{\theta} E_{x\sim p_{data}} log p_{model}(x|\theta) $

Explicit - explicitly define and generate $P_{model}$

Implicit - generate $P_{model}$ without defining $P_{model}$ exactly

## Taxonomy of Generative Models

Ian Goodfellow, the inventor. 

Explicit Density: 
*   Tractable Density: Fully Visible BN Pixel RNN/CNN, Nonlinear ICA
*   Approximate Density: Variational, Markov Chain<br>

Implicit Density:
*   Direct: GAN
*   Markov chain





## FVBNs (Fully visible BN)

*   Explicitly formula based on chain rule:<br>
$p_{model} (x) = p_{model} (x_1) \prod_{i=1}^np_{model}(x_i|x_1,x_2,...,x_{i-1})  $<br>

* O(n) generation cost
* No control through hidden variables

**Languale Model**: probability distribution over sequences of words. Given a sequence, say of length m, it assigns a probability to the whole sequence.

$$P(W=\text{SpaceX will *take* me to Mars}) =p1$$
$$P(W=\text{SpaceX will *bake* me to Mars}) =p2$$

p2 << p1<br>

Chain rule is used to estimate: 

$P(W) = P(\text{SpaceX}) P(\text{will}| \text{SpaceX}) ..... P(\text{Mars}| \text{SpaceX will take me to})$

## Pixel RNN

* Pick a pixel and sample it; Typically top left corner
* Connect to neighbor pixels with RNN
* Generate neighbors
* Continue

## WaveNet

* A generateive Model for Raw Audio. Oord et al. (2016)
* Inpout -> hidden layers (Conv) -> a single time step output

## Variational Autoencoder

Idea is, instead of using the chain rule of products, substitute the sequential variables with variables z (latent/hidden variable) to generate the output.

$\displaystyle p_{model} (x) = p_{model} (x_1) \prod_{i=2}^np_{model}(x_i|x_1,x_2,...,x_{i-1})  $<br> -->
$\displaystyle p_{model} (x) = \int p_{model}(z) p_{model}(x|z)dz  $<br>


**Structurally**: 
Input x -> x  + encoder -> hidden z -> z+ decoder -> generative output (loss)

* Choose density distribution: eg, Gaussian
* Define a network to generate the conditional

$p_{model}(z|x) = p_{model}(x|z)p_{model}(z)/p_{model}(x) $

$p_{model}(x) $ is intractable, need to approximate. The solution is after encoder mapping input x to z, z -> gaussian mean, std -> sample x|z from Gaussian distribution

**Putting it all together**

<div class="verticalhorizontal">
    <img src="images/9_1.png" width ="450" height="300" alt="centered image" />
</div>

### Unsupervised Learning with NN

**Data**:
{x} x: inputs No labels
* Can have much more data
* Challenge: Cost?

Neural Network Goal:
Learn a structure if the data
* Has the potential to learn real world
* Challenge: Optimization?

### Kullback-Leibler (KL) Divergence

$D_{KL}(P||Q) = \sum_x P(x) log(\frac{P(x)}{Q(x)}) $

integral: $D_{KL}(P||Q) = \int_{-\infty}^{\infty} P(x) log(\frac{P(x)}{Q(x)})\,dx $

### Log Likelihood Expression

Loss($\theta, x_i$)$ =-E_{z\sim q^{\phi}_{model}}(z|x^{(i)})[log p^{\theta}_{model}(x^{(i)}|z)]+ D_{KL}[q^{\phi}_{model}(z|x^{(i)})||p^{\phi}_{model}(z)] +D_{KL}[q^{\phi}_{model}(z|x^{(i)})||p^{\phi}_{model}(z|x^{(i)})] $



## GANs
* Use Latent Variables
* Asymptotically Consistent (unlike VAG)
* Potentially Can Reach Global Optimum
* No Markov Chains Needed

* Unprincipled
* Could Take Long Time to Converge

**GANs** - Two player game
* Instead of sampling from high-dimensional, complex and unknown distribution
* Sample from simple distribution, e.g. normal distribution (random noise) and find the transformation to the distribution we want to learn
* Learn the transformation using NNs

* Generator - try to generate samples and present them as real world and fool the discriminator
* Discriminator - try to distinguish between real and generated (fake) images



### Generator Network 

z (simple distribution) -> Neural Network G (CNNs, RNNs, or other) -> Ouput Sample $x^G$

Traning data has distribution $p_{data}$. Sample $x\sim p_{data}.$  Require: Output sample $x^G$ is of similar dimensions as $x$ and distribution $p_{data}$

### Discriminator 

Input sample x -> Neural Network D  -> 0/1 

Receives input of same dimension as $p_{data}$
Determine: is sample from $p_{data}$ (1) or not (0).


### Completing Cost Functions

$$J^{(D)}= -\frac{1}{2}\mathbb{E}_{x\sim p_{data}}log D_{\theta_d}(x)-\frac{1}{2}\mathbb{E}_{z\sim p_{model}}log(1-D_{\theta_d}(G_{\theta_g}(z))) \\ J^{(G)}=-J^{(D)}\\\\
or, \, J^{(D)}= -\frac{1}{2}\int p_{data}(x)log D(x)\,dx-\frac{1}{2}\int p_{model}(x)log(1-D(x))\,dx$$

Question: 
* Instead of $\theta_d$ assume that we optimize D(x) for every value of x. What would be the optimal strategy for D(x)?
* What assumptions are needed?

$$\frac{\partial}{\partial_D}J^{(D)} = -\frac{1}{2}\int (\frac{p_{data}}{D} + \frac{p_{model}}{D-1})=0
\\
\rightarrow\; D(x) = \frac{p_{data}}{p_{model}+p_{data}} $$

Assumption: $p_{model},\, p_{data}$ are nonzero everywhere

Equilibrium: $p_{model}=p_{data}$ then, $D(x)=\frac{1}{2} $

## Minimax Game Optimization

$$min_{\theta_g}max_{\theta_d}[E_{x\sim p_{data}}logD_{\theta_d}(x)+E_{z\sim p_{model}}log(1-D_{\theta_d}(G_{\theta_g}(z)))] $$

Solution:
* Saddle point in the parameter space (Nash Equilibrium)
* Similar to solution of Jensen-Shannon Divergence


* **Gradient ascent** for the discriminator on J, **Gradient descent** for the generator
$$J^{(D)}=\frac{1}{2}\mathbb{E}_{x\sim p_{data}}logD_{\theta_d}(x)+\frac{1}{2}\mathbb{E}_{z\sim p_{model}}log(1-D_{\theta_d}(G_{\theta_g}(z)))\\$$
$$J^{(G)}=-\frac{1}{2}\mathbb{E}_{z\sim p_{model}}log(1-D_{\theta_d}(G_{\theta_g}(z))) $$

### Saturation 

Gradient is slow near 0. Solution is to both gradient ascent:
$$J^{(D)}=\frac{1}{2}\mathbb{E}_{x\sim p_{data}}logD_{\theta_d}(x)+\frac{1}{2}\mathbb{E}_{z\sim p_{model}}log(1-D_{\theta_d}(G_{\theta_g}(z)))\\$$
$$J^{(G)}=\frac{1}{2}\mathbb{E}_{z\sim p_{model}}log(D_{\theta_d}(G_{\theta_g}(z))) $$

### DCGAN

Fully connected -> convolutional layer -> generated image

<div class="verticalhorizontal">
    <img src="images/9_2.png" width ="450" height="300" alt="centered image" />
</div>

### Similarity in Hidden Space

<div class="verticalhorizontal">
    <img src="images/9_3.png" width ="450" height="300" alt="centered image" />
</div>

### Text -> Image Synthesis
<div class="verticalhorizontal">
    <img src="images/9_4.png" width ="450" height="300" alt="centered image" />
</div>

### Cycle GAN

<div class="verticalhorizontal">
    <img src="images/9_5.png" width ="450" height="300" alt="centered image" />
</div>

### Pix2Pix

<div class="verticalhorizontal">
    <img src="images/9_6.png" width ="450" height="300" alt="centered image" />
</div>

### The GAN Zoo

github examples of GANs: github.com/hindupuravinash/the-gan-zoo

### Relation with Reinforcement Learning

Pseudo Algorithm:

1. Take simple distribution $z$ as input, forward pass to get $G(z)$ or $x^G$. 

2. Shuffle the $x^G$ and real images $x$. Record the index and labels for computing loss in the future.

3. Combine the shuffled data and forward pass to get $D(G(z))$. Compute the loss. 

4. For backward pass, compute the loss gradient $J^{(D)}$ w.r.t $\theta_d$ first. Then compute the loss gradient $J^{(G)}$ w.r.t $\theta_g$.

5. Update $\theta_d$, $\theta_g$  