# Dimensionality reduction

###### COMP4670/8600 - Statistical Machine Learning - Tutorial

In this lab, we will use dimensionality reduction techniques to explore a dataset of pictures.

### Assumed knowledge
- PCA (lectures)

### After this lab, you should be comfortable with:
- Implementing PCA
- Visualising features derived from dimensionality reduction

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.optimize as opt
%matplotlib inline

## Load the data

For this lab, we will use a dataset of images of Pokemon sprites.

Load the dataset from the file ``04-dataset.csv`` using ``np.loadtxt``. The datafile represents a 2d array where each row is a 64 by 64 pixel greyscale picture. The entries are floats between 0 and 1, where 0 is white and 1 is black.

Note that while the images are 64 by 64 entries, the dataset you load has rows of size 4096 (which is $64\times 64$) to allow the data to be saved as a 2D array.

In [None]:
images = np.loadtxt('04-dataset.csv')

## Toy dataset for debugging

For debugging, it is useful to also have a simple dataset that we know is one-dimensional with some noise. You can use this to test your functions produce sensible output. Below is a function that generates data from two Gaussians in $\mathbb{R}^n$ with unit variance, centered at $\mathbf{1}$ and $-\mathbf{1}$ respectively. (Note: $\mathbf{1}$ is the vector $(1, 1, 1, ..., 1)$ in $\mathbb{R}^n$.)

In [None]:
def gen_data(n_samples=100, n_feat=5):
    """Generate data from two Gaussians
    n_samples = number of samples from each Gaussian
    n_feat = dimension of the features
    """
    X1 = np.ones((n_feat, n_samples)) + np.random.randn(n_feat, n_samples)
    X2 = -np.ones((n_feat, n_samples)) + np.random.randn(n_feat, n_samples)
    X = np.hstack([X1,X2])
    return X

toy_data = gen_data()

## Implementing PCA

### Recap on PCA

Remember from lectures that the goal of PCA is to linearly project data points onto a lower dimensional subspace such that the variance of the projected data is maximised. 

Let the data be the set of data points $\{\mathbf{x}_n\}_{n=1}^N$, $\mathbf{x}_n\in\mathbb{R}^d$, with mean $\bar{\mathbf{x}}=\frac{1}{N}\sum_{n=1}^N\mathbf{x}_n$ and covariance matrix $\mathbf{S}=\frac{1}{N}\sum_{n=1}^N(\mathbf{x}_n-\bar{\mathbf{x}})(\mathbf{x}_n-\bar{\mathbf{x}})^T.$

From lectures, we derived that if we are linearly projecting onto a subspace $m<d$, then the $m$ directions to linearly project on are given by the $m$ eigenvectors of $\mathbf{S}$ whose eigenvalues are the $m$ largest, and the variance along each direction is equal to that eigenvalue.

### Using the SVD to implement PCA

Let us assume that $\bar{\mathbf{x}}=\mathbf{0}$. Then $\mathbf{S}=\frac{1}{N}\sum_{n=1}^N\mathbf{x}_n\mathbf{x}_n^T$. 
However, it turns out that
$$\sum_{n=1}^N\mathbf{x}_n\mathbf{x}_n^T=X^TX$$
where $X\in\mathbb{R}^{N\times d}$ is the data matrix.
Thus to find the eigenvalues and vectors of the covariance matrix, we need to find the eigenvalues and vectors of $\frac{1}{N}X^TX$.

It turns out that if the SVD of $X$ is $X=U\Sigma V^T$, then the eigenvectors of $\mathbf{S}$ that correspond to its $k$ largest eigenvalues are the column vectors of $V$ that correspond to the $k$ largest singular values of $X$.

### Question
Show the left out steps (the two parts where it says "it turns out").

### <span style="color:blue">Answer</span>
<i>--- replace this with your solution, add and remove code and markdown cells as appropriate ---</i>

### Implement PCA
Implement principal component analysis. Your function should take the data matrix and the number of components you wish to calculate and return two matrices:
1. The projection of the data onto the principal components
2. The actual components (eigenvectors) themselves.

Hint: Do not forget to center the data by removing the mean so that you can use the above method. You may find ``np.linalg.svd`` useful.

In [None]:
def pca(X, n_pc=2):
    """Returns the projection onto the principal components (default=2)"""
    
    # TODO
    raise NotImplementedError

def svd2pca(U,S,V, n_pc=2):
    """Returns the projection onto the principal components (default=2). Used for
    when you want to change n_pc and have already computed the SVD once"""
    
    # TODO
    raise NotImplementedError

### Verifying the calculation with the toy data

Below we calculate the projection of the toy data onto the first two principal components.

1. Does PCA pick up the two Gaussians?
2. What are the eigenvalues associated to these principal components? What do they tell you about how much variance these components explain?

In [None]:
print(toy_data.shape)
Z, P, U, S, V = pca(toy_data.T)
fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(111)
ax.plot(Z[:100,0], Z[:100,1], 'ro')
ax.plot(Z[100:,0], Z[100:,1], 'bx')
ax.set_xlabel('First principal component')
ax.set_ylabel('Second principal component')
print(S)

## Eigen-pokemon

If we perform PCA on a dataset, we expect the principal components to lie in the neighbourhood of our datapoints. In particular, if we do this on a dataset of images, we can interpret the principal components as images.

The following function plots a gallery of images.

In [None]:
# Visualising images
def plot_gallery(images, titles, h, w, n_row=2, n_col=6):
    """Helper function to plot a gallery of portraits.
    Arguments: images: a matrix where each row is an image.
    titles: an array of labels for each image.
    h: the height in pixels of each image.
    w: the width in pixels of each image.
    n_row: the number of rows of images to print.
    n_col: the number of columns of images to print."""
    assert len(images) >= n_row * n_col
    assert len(titles) >= n_row * n_col
    plt.figure(figsize=(1.8 * n_col, 2.4 * n_row))
    plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
    for i in range(n_row * n_col):
        plt.subplot(n_row, n_col, i + 1)
        plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
        plt.title(titles[i], size=12)
        plt.xticks(())
        plt.yticks(())
    plt.show()

We can use ``plot_gallery`` to plot the first 30 pokemon images.

In [None]:
plot_gallery(images, np.arange(30), 64, 64, 5, 6)

Perform PCA on the Pokemon dataset to find the first 200 principal components. Visualise the first 100 using ``plot_gallery``.

### Question

What do you notice about the first few principal components? What are they detecting?
Plot the associated eigenvalues. How can you interpret these?

In [None]:
# replace this with your solution, add and remove code and markdown cells as appropriate

### <span style="color:blue">Answer</span>
<i>--- replace this with your solution, add and remove code and markdown cells as appropriate ---</i>

In [None]:
# replace this with your solution, add and remove code and markdown cells as appropriate

## Reconstructing images using PCA

Plot the reconstructions of the first 30 images using 200 principal components, and using the first 15 principal components. Don't forget to add the mean back in. How good are these reconstructions?

In [None]:
# replace this with your solution, add and remove code and markdown cells as appropriate

In [None]:
# replace this with your solution, add and remove code and markdown cells as appropriate