# (simple) Autoencoders ≡ PCA
> and how Variational Autoencoders (VAEs) perform expectation maximization (EM).

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]


## Background

Shocker: The simplest autoencoder is actually the same as PCA. We will also see how VAEs perform expectation maximization. Autoencoders learn a meaningful representation of a signal. Using an encoder/decoder pair, autoencoders work by reconstructing a latent encoded representation of an original signal. By minimizing the loss between the original signal and the decoded latent representation, an encoder network can be trained to parse an instance of a dataset for its most meaningful features.

Sound familiar? That is precisely the goal of computing the principal components analysis (PCA) of a matrix. It turns out that autoencoders, by construction are *exactly* PCA.

# The Simplest Autoencoder


Let a neural network be defined with a single hidden unit $W^T\sigma(f(W\vec{X}))$, with a linear activation function $\sigma$ (for this example). Let the weights on the encoder layer be denoted $W$.

We thus define a decoder to use weights $W^T$, and final outputs of the network should converge to a reconstruction of the original features $\vec{X}$ once properly trained. That is, our network should produce $\hat{\vec{X}}$ from $\vec{X} \to \vec{y} = WX \to z = \sigma(f(y))$ and finally $\hat{\vec{X}} = W^Tz$.

If we train by minimizing the $L_2$ divergence between $(\vec{X},\hat{\vec{X}})$, we have an autoencoder, but we also learn the principle components of $\vec{X}$:

$$\hat{x} = w^Twx;\ div(\hat{x},x) = \|x-\hat{x}\|^2 = \|x - w^TWx\|^2 $$
$$\to via\ backprop \to \hat{w} = \arg\min_w E\left[ \| x-w^TWx\|^2 \right]$$

This is equivalent to discovering a basis that spans the principle components of the input, as we discover the directions of maximum energy, equal to the variance of $X$ if $X$ is a zero-mean random variable. In other words, we find a linear mapping of the original features to the principle axis of the data.

Finally, if the learned decoder weight is *not* the same weight as the input weight (i.e., $U^T \not = W^T$),  we still learn a hidden representation $z$ that lies along the major axis of the input $X$. The minimum error direction here is by definition the pinciple eigen vector of $X$.

We could then find a useful component of $X$ (described perhaps by our training process or assumptions)by then projecting the eigen vector onto $X$. Again, if $U^T = W^T$, we arrive at the principle component(s) of $X$.

Of course, this is a roundabout method of obtaining principle components, but I hope it shows the rigorous grounding and versatility of perceptrons in learning representations of data.