# Var-Next - Next frame prediction using variational autoencoders
***
Shantanu Sanyal

Univerisity of Virginia


## Introduction
Introduced by Kingma and Welling in 2013, VAEs offer a powerful framework for learning complex data distributions by introducing stochasticity to the encoding process.  The VAE is an example of a latent variable model; the encoder's task in a VAE is to learn a latent representation of the input images.  This is done by a sampling process where the encoder learns the parameters to a well formulated distribution, and an image's latent representation is simply a sample drawn from that distribution.  For most VAEs, a normal distribution is used due to the ease of sampling.

The latent space itself can be considered a Reimannian manifold which we can traverse.  Many prior works [1, 2](#Refrences) have explored this concept with varying levels of success, but have done so with relatively simple datasets (Celeb, MNIST) that have small feature spaces owing to the small size of the inputs (thumbnail images).  In the following pages I explore whether the concept can generalize to much larger images with substantially more complexity.  In particular, I wanted to explore whether latent space traversal could capture motion interpolation of objects in images.


## The data (part 1)
I opted to construct my own dataset.  I wanted a scenario where the scene of the images was stable, and most of the change in the pixels in the image were attributable to objects in the image moving.  I ended up finding a youtube channel, __[Bob's PA wildlife](https://www.youtube.com/@joeb302)__ which contains many hours of footage of animals moving around on a game camera statically mounted facing a small stream.  While the seasons change, ( and occasionally bears knock over the camera slighly changing its orientation) the video footage is pretty similar from one frame to the next, at least from a statistical standpoint.  I downloaded many of the compilations on the youtube channel, cut out undesirable frames and the intro/outro segments in ClipChamp, and dumped the video to 360p images using ffmpeg.  An example image is below:

## Variational Autoencoder architecture
### The big idea
Recall that for a normal random variable X, shifting the variable by an amount k simply results in the mean of x being shifted by k.  Also recall that if k is instead a scaling factor, kX results in the parameters of X being scaled to $k\sigma_X$ and $k\mu_X$.  Thus, if we have our VAE learn parameters to a normal distribution $\mathcal{N}(\mu,\sigma)$, we can sample from this distirbution using $\mu + \sigma \mathcal{N}(0, 1)$ without directly drawing from that distribution itself.  This forms the basis for the so called "re-parameterization trick" - since the randomness in our model is coming from a constant $\mathcal{N}(0,1)$ distribution and not directly from the mean and standard deviation parameters we are learning, we can calculate backpropagation and update the weights which are controlling the encoder's output.

This is what makes the VAE tick - the rest of the model is mostly the same as a regular autoencoder.  One additional point which should be discussed is the loss function; in order to force the model to learn the parameters to a normal distribution, an additional loss term is used.  This is the KL divergence between the distribution with the mean and variance the model has learnt and $\mathcal{N}(0,1)$.  Since it is easier and more numerically stable to have the model learn logvariance instead of standard deviation, the KL divergence term ends up being
$$D_{KL}(q(z|x)||p(z)) = \frac{1}{2}\sum 1 + logvar(\sigma) - \mu^2 - e^{logvar(\sigma)}$$

In my code, I let $\sigma$ represent the logvariance vector.

### Design
For this project I opted to build my whole codebase from the ground up as opposed to using preexisting libraries for several reasons.  Firstly, the majority of examples on the internet are for the aforementioned toy models with hard coded sizes; this does me little good to copy since my data is substantially larger.  Second, the more flexible libraries on VAE are unnecessarily complex and difficult to edit, and finally, since this is, after all, an academic exercise I figured I would learn more from the process of building the model myself.

My design centers around json config files, which I have <s>forced</s> persuaded many of my ML colleagues to work with in the past.  Essentially, the json file defines everything about the conv2d/conv2dtranspose nodes in the network (how many layers, the parameters of each node, etc) along with the size of the linear layers and a few other parameters.  The work building this payed off down the road, as it made doing a network architecture search trivial - I can quickly spin up a new config file and launch another version of the network in parallel to see how it does.  The final architecure for part 1 is below.

## Interpolation
The aforementioned papers which I referenced found that the latent space of the VAE is mostly flat, so to start with I will just try finding the linear direction between two images, and adding this vector to the second to try and predict the latent vector of a "future" image in the video.

## Refrences
[1] Uniform Interpolation Constrained Geodesic 
Learnig on Data Manifold https://arxiv.org/pdf/2002.04829.pdf


[2] The Riemannian Geometry of Deep Generative Models ] https://arxiv.org/pdf/1711.08014.pon