# Table of Contents
 <p><div class="lev1 toc-item"><a href="#Introduction" data-toc-modified-id="Introduction-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Introduction</a></div><div class="lev1 toc-item"><a href="#Probabilistic-PCA-and-Factor-Analysis" data-toc-modified-id="Probabilistic-PCA-and-Factor-Analysis-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Probabilistic PCA and Factor Analysis</a></div><div class="lev1 toc-item"><a href="#Independent-Component-Analysis" data-toc-modified-id="Independent-Component-Analysis-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Independent Component Analysis</a></div><div class="lev1 toc-item"><a href="#Slow-Feature-Analysis" data-toc-modified-id="Slow-Feature-Analysis-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Slow Feature Analysis</a></div><div class="lev1 toc-item"><a href="#Sparse-Coding" data-toc-modified-id="Sparse-Coding-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Sparse Coding</a></div><div class="lev1 toc-item"><a href="#Manifold-Interpretation-of-PCA" data-toc-modified-id="Manifold-Interpretation-of-PCA-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Manifold Interpretation of PCA</a></div>

# Introduction

**A linear factor model is defined by the use of a stochastic, linear decoder function that** *generates $x$ by adding noise to a linear transformation of $h$*

> latent variables $h$, with $p_{model}(x)=E_{h}p_{model}(x|h)$

A linear factor model describes the data generation process as follows:

1. Sampling the explanatory factors $h$ from a distribution $h \sim p(h)$, where $p(h)$ is a factorial distribution, with $p(h)=\prod_{i}p(h_{i})$
2. Sampling the real-valued observable variables given the factors: $x=Wh+b+noise$, where the noise is typically Gaussian and diagonal (independent across dimensions)

# Probabilistic PCA and Factor Analysis

Special cases of the above equations and only differ in the choices made for the model's prior over **latent variables $h$** before observing $x$ and noise distributions

The **latent variable prior** is the unit variance Gaussion $h \sim N(h;0,I)$

The observed variables $x_i$ are asssumed to be *conditionally independent*, given $h$, the noise is assumed to be drawn from a diagonal covariance Gaussian distribution $\psi = diag(\sigma^2)$, with $\sigma^2 = [\sigma_1^2, \sigma_2^2, \dots, \sigma_n^2]$

The role of the latent variables is to *capture the dependencies* between the different observed variables $x_i$, $x$ is indeed a multivariate normal random variables:
$$x \sim N(x:, b, WW^T+\psi)$$

Modification: making the conditional variances $\sigma_i^2$ equal to each other.
$$ x \sim N(x:,b,WW^T+\sigma^2 I) $$
or equivalently
$$ x=Wh+b+\sigma z $$, where $$z \sim N(z;0,I) $$

Finally, using an iterative EM algorithm for estimating the parameters $W$ and $\sigma^2$

**Observation**: most variations in the data can be captured by the latent variables $h$, up to some (except) small residual *reconstruction error $\sigma^2$*

**The concept of probabilistic PCA model** : when $\sigma \rightarrow 0$, the conditional expected value of $h$ given $x$ becoms an orthogonal projection of $x-b$ onto the space spanned by the $d$ columns of $W$

# Independent Component Analysis

**ICA** is an approach to modeling linear factors that seeks to separate an observed signal into many underlying signals that are scaled and added together to form the observed data.
**These signals are intended to be fully independent, rather than merely decorrelated from each other**
> Uncorrelated variables is not equal to independent, uncorrelation means the covariances are 0, but the independency means that the variables are probabilistic multiplicable.

A variant ICA trains a fully parametric generative model. The prior distribution over the underlying factors, $p(h)$ must be fixed ahead of time by the user. Then **deterministically** generates $x=Wh$

> A nonlinear change of variables to determine $p(x)$, like $$p_x(x)=p_y(g(x))|det(\frac{\partial g(x)}{\partial x})|$$

By choosing $p(h)$ to be independent, we can recover underlying factors that are as close as possible to independent.
> not to capture high-level abstract causal factors, but to recover low-level signals that have been mixed together

- Variants of ICA
    > - Add some noise in the generation of $x$
    > - Do not use the maximum likelihood criterion, but instead aim to make the elemtns of $h=W^{-1}x$ independent from each other

**Notes:** all variants of ICA require that $p(h)$ be non-Gaussion
> Why: If $p(h)$ is Gaussion distribution, the solution of $W$ is not unique.

**Notes:** Many variants of ICA are not generative models in the sense that we use the phrase. **A generative model either represents $p(x)$ or can draw samples from it*

- *Extensions of ICA: nonlinear independent components estimation (NICE) (2014)*
- *Extensions of ICA: learn groups of features, with statistical dependence allowed within a group but discouraged between groups*
> - *independent subspace analysis*, the groups of related units are chosen to be non-overlapping
> - *topographic ICA*, assign spatially coordinates to each hidden unit and form overlapping groups of spatially neighboring units

# Slow Feature Analysis

**SFA is a linear facor model that uses information from time signals to learn invariant features (2002)**

Motivation: slowness principle. *Compared with the individual measurements that make up a description of a scene*, the important characteristics of scenes change very slowly

The slowness principle may be introduced by adding a term to the cost function of the form $$\lambda \sum_t L(f(x^{(t+1)}), f(x^{(t)}))$$ where $\lambda$ is a hyperparameter determining the strength of the slowness regularization term, $f$ is the feature extractor to be regularized, and $L$ is a loss function measuring the distance, a common choice is the mean squared difference.

It is possibly to theoretically predict which features SFA will learn, even in the deep, nonlinear setting. To make such theoretical predictions, one must know about the dynamics of the environment in terms of configuration space. Given the knowledge of how the underlying facors actually change, it is possible to analytically solve for the optimal fuctions expressing these factors.

Other learning algorithms where the cost function depends highly on specific pixel values, making it much more difficult to determin what features the model will learn.

>P. Berkes and L. Wiskott. Slow feature analysis yields arich repertoire of complex cell properties. Journal of Vision,2005

# Sparse Coding

**Sparse coding** uses a linear decoder plus noise to obtain reconstructions of $x$, $$p(x|h)=N(x;Wh+b,\frac{1}{\beta}I)$$. Assuming that the linear factors have Gaussian noise with isotropic precision $\beta$

The distribution $p(h)$ is chosen to be one with sharp peaks near 0, including factorized Laplace, Cauchy or factorized Student-t distributions

*Examples:*

- Laplace prior with the sparsity penalty coefficient $\lambda$ is given by $$p(h_i)=Laplace(h_i;0, \frac{2}{\lambda})=\frac {\lambda}{4} e^{-\frac{1}{2} \lambda |h_i|}$$
- Student-t prior by $$p(h_i) \propto \frac{1}{(1+\frac{h_i^2}{v})^{\frac{\lambda+1}{2}}}$$

**Encoder:** $$h^*=f(x)=\arg\max_h p(h|x)\\=\arg\max_h log p(h|x)\\=\arg\min_h \lambda ||h||_1 + \beta ||x-Wh||_2^2$$

We alternate between minimization with respect to $h$ and minimization with respect to $W$. We treat $\beta$ as a hyper, typically is set to 1.

*The generative model itself is not especially sparse, only the feature extractor is.*

The sparse coding approach combined with the use of the **non-parametric encoder** can in principle minimize the combination of reconstruction error and log-prior better than any specific **parametric encoder**. Besides, there is no generalization error to the **encoder**.

**Disadvantage**: the non-parametric encoder requires running an iterative algorithm while parametric autoencoder approach uses only a fixed number of layers; it is not straight-forward to back-propagate through the non-parametric encoder -- difficult to pretrain a sparse coding model with an unsupervised criterion and then fine-tune it using a supervised criterion.

Even the model is able to reconstruct the data well and provide useful features for a classifier, the samples produced by sparse coding may still be poor.

- Each individual feature may be learned well, but the factorial prior on the hidden code makes the model including random subsets of all of the features in each generated sample.

# Manifold Interpretation of PCA

Linear factor models including PCA and factor analysis can be interpreted as learning a manifold (Hinton97)

*Notes: Flat Gaussiian capturing probability concentration near a low-dimensional manifold*

Let the encoder be $h=f(x)=W^T(x-\mu)$

With the autoencoder view, the decoder computing the reconstruction $\widehat{x}=g(h)=b+Vh$

Task of PCA that learns matrices $W$ and $V$ with the goal of making the reconstruction of $x$ lie as close to $x$ as possible.

* The choices of linear encoder and decoder minimize reconstruction error $E[||x-\widehat{x}||^2]$
* Correspond to $V=W$, $\mu=b=E[x]$ and the columns of $W$ form an orthonormal basis which spans the same subspace as the principal eigenvectors of the covariance matrix $C=E[(x-\mu)(x-\mu)^T]$
* the eigenvalue $\lambda_i$ of $C$ corresponds to the variance of $x$ in the direction of eigenvector $v^{(i)}$. If $x \in R^D$ and $h \in R^d$ with $d < D$, then the oprimal reconstruction error is $\min E[||x-\widehat{x}||^2]=\sum_{i=d+1}^D \lambda_i$

**Linear factor models are some of the simplest generative models and some of the simplest models that learn a representation of data**