# Variational Autoencoders
## Introduction

The variational autoencoder (VAE) is arguably the simplest setup that realizes deep probabilistic modeling. Note that we’re being careful in our choice of language here. The VAE isn’t a model as such—rather the VAE is a particular setup for doing variational inference for a certain class of models. The class of models is quite broad: basically any (unsupervised) density estimator with latent random variables.

<img src="vae_model.png"  width="180" height="200">

Here we’ve depicted the structure of the kind of model we’re interested in as a graphical model. We have $N$ observed datapoints ${xi}$ Each datapoint is generated by a (local) latent random variable $Zi$. There is also a parameter $\theta$, which is global in the sense that all the datapoints depend on it (which is why it’s drawn outside the rectangle). Note that since $\theta$ is a parameter, it’s not something we’re being Bayesian about. Finally, what’s of particular importance here is that we allow for each $xi$ to depend on $zi$ in a complex, non-linear way. In practice this dependency will be parameterized by a (deep) neural network with parameters $\theta$. It’s this non-linearity that makes inference for this class of models particularly challenging.

Of course this non-linear structure is also one reason why this class of models offers a very flexible approach to modeling complex data. Indeed it’s worth emphasizing that each of the components of the model can be ‘reconfigured’ in a variety of different ways. For example:

- the neural network in $p\theta(X|z)$ can be varied in all the usual ways (number of layers, type of non-linearities, number of hidden units, etc.)

- we can choose observation likelihoods that suit the dataset at hand: gaussian, bernoulli, categorical, etc.

- we can choose the number of dimensions in the latent space

The graphical model representation is a useful way to think about the structure of the model, but it can also be fruitful to look at an explicit factorization of the joint probability density:

<img src="prob-density.png"  width="300" height="100">

The fact that $p(X|z)$ breaks up into a product of terms like this makes it clear what we mean when we call $zi$ a local random variable. For any particular $i$, only the single datapoint $xi$ depends on $zi$. As such the ${zi}$ describe local structure, i.e. structure that is private to each data point. This factorized structure also means that we can do subsampling during the course of learning. As such this sort of model is amenable to the large data setting.