## Variational autoencoder

Example architecture:
![Alt text](imgs/image-3.png)

#### What: 
- a neural network that learns to predict its input data 

- encoder: encode its input into a lower dimension representation space

    - where representation space / layer is sampled from probability distributions (eg. gaussian, multinomial) modelled for each dimension of the rep space 


- decoder: decodes the sampled representation space / layer to get the output 

- model learns by minimising difference between input and output

- For example: 
    - input: cat image (28*28)
    - encoder encodes input into (4) dimension representation: 
        - eg. fur, size, species, ears
        - 4D vector (representation space / vector layer) sampled from estimated prob distribution for each dimensional space
    - output: decoder predicted cat image (28*28) 

#### So what: 
- achieves regularisation for dimension reduction via non-linear mapping to a lower dimension space modelled with prior distributions

#### How: 
- using variational inference (optimisation by estimating parameters of proxy distributions) to learn representation space 
    - encoder learns representation space (z ~ p(z|x)) by estimating optimal parameters for the probability distribution of each dimension p(z|x) = p(x|z)p(z)
    - encoder produces final representation vector layer z by sampling from each probability distribution of each dimension

- model: 
    - encoder (function)
    - prior distributions
    - decoder (inverse function)
    - output layer (depending on required output format)
- loss function: 
    - mse
    - cross entropy (?)


#### Why / Why not : 
- Non-deterministic representation 
- Regularisation / Inductive bias: 'prior distributions' p(z) act as our prior belief about how the input data distributes yet also a way to introduce regularisation / smoothen representation learned

- Extension: 
    - Introducing Disentanglement in VAE: to further ensure independence between dimensions in representation space

Reference: 
https://www.youtube.com/watch?v=WYrvh50yu6s&t=1324s (example taken from here)

Further reading: 
vae paper: https://arxiv.org/pdf/1312.6114.pdf
variational inference: https://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/variational-inference-i.pdf

### Questions: 

Reparameterisation trick: 
- ref: https://www.youtube.com/watch?v=vy8q-WnHa9A

Why/challenge: 
- unable to do backpropagation on loss function with unknown parameters/expectation of probability distribution (eg. optimal parameters of probability distribution of z are unknown in vae)
    
- hence the need to remove stochasticity to enable backpropagation

What: The trick for sampling z from z~N(miu, sigma) where miu and sigma is unknown, is to parameterise miu and sigma using another parameter (epsilon) which follows a known distribution

How - intuition: 
![Alt text](image.png) 
![Alt text](image-1.png)
 

Derivation (why it works)
- TODO: 
- (Leibniz integral rule)

Notes: 
- reparameterisation: (parameterising a distribution w known a using another distribution)


#### Questions: 

- In what sense is unknown prob dist parameters a problem for backpropagation? (to review derivation): 
    - problem is that f(x) follows a distribution with unknown parameters hence we substitute f(x) with g(eps); where eps follows a known distribution - eps ~ N(0, 1)
    - qn: but how does this help in backprop, isnt backprop still seeking to estimate the parameters? / introduction of eps allows miu and sigma to vary and hence be optimised?

- how come its sufficient to only parameterise sigma?

#### Implementation v0: 
- ref: https://www.youtube.com/watch?v=VELQT1-hILo


In [1]:
import torch 
import torch.nn.functional as F
from torch import nn

In [7]:
# model 
# img -> hidden dim -> mean, std -> parameterisation trick (for sampling) -> decoder -> output img
class VariationalAutoEncoder(nn.Module):
    def __init__(self, input_dim, h_dim=200, z_dim=20): 
        super().__init__()
        # encoder
        self.img_2hid = nn.Linear(input_dim, h_dim)
        self.hid_2mu = nn.Linear(h_dim, z_dim)
        self.hid_2sigma = nn.Linear(h_dim, z_dim)

        # decoder
        self.z_2hid = nn.Linear(z_dim, h_dim)
        self.hid_2img = nn.Linear(h_dim, input_dim)

        self.relu = nn.ReLU() #inplace ? 

    def encode(self, x):
        #hidden layer 
        h = self.relu(self.img_2hid(x))
        #parameters: miu, sigma 
        mu, sigma = self.hid_2mu(h), self.hid_2sigma(h)
        return mu, sigma
    
    def decode(self, z):
        h = self.relu(self.z_2hid(z))
        # ensure output range similar to input range
        pred_x = torch.sigmoid(self.hid_2img(h))
        return pred_x

    def forward(self, x): 
        # encode
        mu, sigma = self.encode(x)
        # sampling using reparametrisation
        # return random values w mean, var [0, 1]
        epsilon = torch.randn_like(sigma)
        z_reparameterised = mu + sigma*epsilon
        # decode
        x_reconstructed = self.decode(z_reparameterised)

        # return values to be optimised for: 
        # mse: x vs x_reconstructed
        # kl divergence: mu, sigma vs gaussian
        return x_reconstructed, mu, sigma


In [9]:
x = torch.randn(4, 28*28)
vae = VariationalAutoEncoder(input_dim=784)
# how is it that this konws which method to run?
print(len(vae(x)))


3


In [None]:
# Training loop to test model 