<img src="images/ublogo.png"/>

### CSE610 - Bayesian Non-parametric Machine Learning

  - Lecture Notes
  - Instructor - Varun Chandola
  - Term - Fall 2020

### Objective
The objective of this notebook is to discuss Bayesian non-parametric dimensionality reduction methods. We will primarily focus on **Gaussian Process Latent Variable Models** or GPLVM.

<div class="alert alert-info">

**Note:** This material is from several papers published in this area, including the NeurIPS 2004 paper by Neil Lawrence - [Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data](http://papers.nips.cc/paper/2540-gaussian-process-latent-variable-models-for-visualisation-of-high-dimensional-data.pdf).

</div>

### Dimensionality reduction

### Problem setting
Let the observed data be denoted by ${\bf y}$, such that ${\bf y} \in \mathbb{R}^D$, where $D$ is presumably a big number. We are interested in mapping the data to ${\bf x}$, such that ${\bf x} \in \mathbb{R}^d$, such that $d \ll D$. 

Of course, we want ${\bf x}$ to preserve some properties of ${\bf y}$. Usually the dimensionality reduction problem is stated in the context of a data set $Y$, containing $N$ data instances in $\mathbb{R}^D$ and we are interested in transforming $Y$ to $X$ containing the corresponding $N$ data instances in $\mathbb{R}^d$, such that certain properties of $Y$ or of instances in $Y$ are preserved.

### Probabilistic Principal Components Analysis (PPCA)
The *probabilistic PCA* model is a latent variable model in which the likelihood of an observed data instance, ${\bf y}_n$ is given as:
$$
p({\bf y}_n\vert W,\beta) = \int p({\bf y}_n\vert {\bf x}_n,W,\beta)p({\bf x}_n)d{\bf x}_n
$$
where ${\bf x}_n$ is assumed to be a Gaussian distribution with zero mean and unit covariance, i.e., $p({\bf x}_n) = \mathcal{N}(0,I)$ and the observed data is connected to latent data as:
$$
p({\bf y}_n\vert {\bf x}_n,{\bf W},\beta) = \mathcal{N}(W{\bf x}_n, \beta^{-1}I)
$$
where $W$ is a $(D \times d)$ "projection" matrix.

One can learn $W$ from a given data set by maximizing the likelihood of the data $Y$,
$$
P(Y \vert W, \beta) = \prod_{n=1}^N p({\bf y}_n\vert W, \beta)
$$
One can use the Bayes rule to estimate the posterior probability distribution for the latent vector, ${\bf x}_n$, i.e., $p({\bf x}_n\vert {\bf y}_n, W, \beta)$.
> Hint: It will also be a Gaussian!

## Gaussian Process Latent Variable Models

In [7]:
import numpy as np
from sklearn.decomposition import pca
digits = np.load('notebook/lab_classes/gprs/digits.npy')

In [14]:
which = [0,3,6,7,8,9] # which digits to work on
digits = digits[which,:,:,:]
num_classes, num_samples, height, width = digits.shape 
labels = np.array([[str(l)]*num_samples for l in which])
Y = digits.reshape((digits.shape[0]*digits.shape[1],256)).shape

IndexError: index 6 is out of bounds for axis 0 with size 6

In [13]:
digits.shape

(6, 55, 16, 16)

In [8]:
p = pca.PCA(Y) # create PCA class with digits dataset 
p.plot_fracs(20) # plot first 20 eigenvalue fractions 
p.plot_2d(Y,labels=labels.flatten(), colors=colors) 
plt.legend()

NameError: name 'Y' is not defined