# Clustering and Gaussian Mixture Models (GMMs)

**COMP9418-17s2, W04 Tutorial**

- Instructor: Edwin V. Bonilla
- School of Computer Science and Engineering, UNSW Sydney
- Questions by Daniel Mackinlay and Edwin V. Bonilla
$$
% macros
\newcommand{\indep}{\perp \!\!\!\perp}
$$

In this section we will study the behaviour of Gaussian mixture models using a supplied data set, and see how the models estimated using Maximum Likelihood via the Expectation Maximisation.


## Technical prerequisites

You will need certain packages installed to run this notebook.

If you are using ``conda``'s default
[full installation](https://conda.io/docs/install/full.html),
these requirements should all be satisfied already.

If you are using ``virtualenv`` or other native package management,
you may need to run these commands:

```python
pip install scikit-learn seaborn
```
You will also need to download the preprocessed `usps_gmm_3d.mat` data set
(see data file for this tutorial in WebCMS3)
and put it in the same folder as this notebook.



Once we have done all that, we
import some useful modules for later use.

In [13]:
# Make division default to floating-point, saving confusion
from __future__ import division
from __future__ import print_function

# Necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from random import random
from scipy.io import loadmat
import seaborn as sns
from collections import OrderedDict as odict

# Put the graphs where we can see them
%matplotlib inline
sns.set(style="ticks")

# easier debugging display
np.set_printoptions(edgeitems=5, precision=3, suppress=False)
from pprint import pprint

# Data loading


The file the `usps_gmm_3d.mat` file contains the following variables:

- `xtrain3d` : Three-dimensional PCA representation of training examples of the digits 2, 3 and 5. Each row is a 3-vector corresponding to one digit.
- `xtest3d` : Three-dimensional PCA representation of testing examples of the digits 2, 3 and 5. Each row is a 3-vector corresponding to one digit.
- `ytrain` : Corresponding labels of `xtrain3d`, containing 2, 3 or 5 accordingly.
- `ytest` : Corresponding labels of `xtest3d`, containing 2, 3 or 5 accordingly.
- `mu` : Mean vector (in original data space) used in PCA decomposition.
- `E`: Matrix of eigenvectors used in PCA decomposition.

The details of the *PCA* representation of the data aren't important for this tutorial; It's a low-dimensional representation of the full images - in this case, 256-pixel images are squished down into 3 dimensions. If you are curious about this compression trick, you can use the variables `mu` and `E` to explore it (and even to reconstruct the original digits, which we will do in the final exercise.) But for now, we are interested in using the 3-dimensional representations without worrying too much about the details.

In [60]:
data = loadmat('./usps_gmm_3d.mat')
xtrain3d = data['xtrain3d']
ytrain = data['ytrain']
xtest3d = data['xtest3d']
ytest = data['ytest']
pca_mu = data['mu']
pca_e = data['E']
data.keys()
del(data)

## Exercise
create an index array for each digit in each 3d *x* dataset.
For example, `x2test` should be an array such that `xtest3d[xtest2, :]` is an array containing all the examples of digit 2 and nothing else.

If you have problems with the shapes of the array not matching up, remember that `numpy` arrays have, for example,  `ravel` and `flatten` methods to produce flat arrays suitable for indexing.

In [15]:
# Your code here

## Exercise
Plot each the digits (assigning a different colour to each).

You can choose whether to do this in 2d or 3d.

In [None]:
# Your code here

## Exercise

For the remainder of this assignment we will be concentrating on the digit 2. Plot its distribution separately, in 2 or 3 dimensions.


In [None]:
# Your code here

# Gaussian mixtures
Now we will fit a Gaussian mixture model to the data.

We don't need to write our own code here for one; instead we can use the convenient
[scikit-learn Gaussian Mixture](http://scikit-learn.org/stable/modules/mixture.html) models.


In [1]:
from sklearn import mixture

## Exercise

Fit a Gaussian mixture model (using fit mixture.GaussianMixture) to x2tr.

Repeat with number of components `n_components`=1, 2, 4, 6, 8, 10. For each, use `covariance_type='full'`.
Examine and understand the result.
How do we extract the clusters and their shapes? What does the `predict_proba` method do?
How about `score_samples`?

In [None]:
models = odict()
for n_components in (1, 2, 4, 6, 8, 10):
    # Your code here

## Exercise

Train the mixture model for several different numbers of components and identify the "best" based on test-set performance.
To this end, record the values for each number of components, and for each replicate, in some arrays called `test_loglikelihoods` and `train_loglikelihoods` and `n_components`.

Note that the mixture mode is sensitive to different initializations. Per default numpy will update the random generator for you, so you should expect different results each time.
If you wish to get a reproducible result, you can use a deterministically provided random seed, using the `random_state` parameter.

You will need to repeat the model fit procedure several times for each number of components to get a fair evaluation of its performance.


In [25]:
max_components = 10
n_replicates = 10
test_loglikelihoods = []
train_loglikelihoods = []
n_components = []

# Your code here

## Exercise

Create a scatterplot comparing the loglikelihood across the test and training data fits for each different number of components.
Are there differences between training and test log-likelihood? Why? What would a “good” number of mixture components be?

In [7]:
# Your code here

Next, we define a useful function:

In [8]:
def reconstruct_from_pca(mean):
    """
    Given a length 3 vector `mean`,
    reconstruct it as an image in the original space.
    """
    reconstruction = np.dot(pca_e, mean.reshape((-1, 1))) + pca_mu.T
    return reconstruction.reshape((16, 16))

## Exercise

In [None]:
gmm = mixture.GaussianMixture(
    n_components=3,
    covariance_type='spherical')
gmm.fit(xtrain3d[x2train, :])
for mean in gmm.means_:
    plt.figure()
    plt.imshow(reconstruct_from_pca(mean))
               

Fit a mixture model to `xtrain3d` with K=3 components.
Use argument `covariance_type='diag'` to the `gmm.fit` method.
Project the centres of the learned mixture to the original ambient space using
`reconstruct_from_pca`.

Make an image plot of these centres. Are they significantly different?