# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# MNIST Handwritten Digits

One of the most famous datasets in the statistical machine learning literature is the [MNIST dataset of handwritten digits](http://yann.lecun.com/exdb/mnist/). This optional notebook is an "open-ended" one in which we ask you to apply the principal components analysis (PCA) and $k$-means clustering ideas to the MNIST data.

## Setup

The following cells are just set up, largely copied from [Lab 12, Part 2](./part2.ipynb).

In [None]:
import os
import sys
import re

import numpy as np
import pandas as pd

from IPython.display import display, HTML
import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

from PIL import Image
import base64
from io import BytesIO

def to_base64 (png):
    return "data:image/png;base64," + base64.b64encode (png).decode("utf-8")

def im2gnp (image):
    """Converts a PIL image into an image stored as a 2-D Numpy array in grayscale."""
    return np.array (image.convert ('L'))

def gnp2im (image_np):
    """Converts an image stored as a 2-D grayscale Numpy array into a PIL image."""
    return Image.fromarray (image_np.astype (np.uint8), mode='L')

def gnp2thumbnail (image_np):
    im = gnp2im (image_np)
    memout = BytesIO ()
    im.save (memout, format='png')
    return to_base64 (memout.getvalue ())

def imshow_gray (im, ax=None):
    if ax is None:
        f = plt.figure ()
        ax = plt.axes ()
    ax.imshow (im,
               interpolation='nearest',
               cmap=plt.get_cmap ('gray'))
    
def peek_Sigma (Sigma, ret_df=False):
    k = len (Sigma)
    df_Sigma = pd.DataFrame ()
    df_Sigma['i'] = np.arange (k)
    df_Sigma['sigma_i'] = Sigma
    Sigma_sq = np.power (Sigma, 2)
    Err_sq = np.sum (Sigma_sq) - np.cumsum (Sigma_sq)
    Err_sq[Err_sq < 0] = 0
    Err = np.sqrt (Err_sq)
    Relerr = Err / (Sigma[0] + Err[0])
    df_Sigma['sigma_i^2'] = Sigma_sq
    df_Sigma['err_i^2'] = Err_sq
    df_Sigma['err_i'] = Err
    df_Sigma['relerr_i'] = Relerr
    print ("Singular values:")
    display (df_Sigma.head ())
    print ("  ...")
    display (df_Sigma.tail ())
    
    f, ax = plt.subplots (figsize=(7, 7))
    #ax.set (yscale="log")
    sns.regplot ("i", "sigma_i", df_Sigma, ax=ax, fit_reg=False)
    if ret_df:
        return df_Sigma
    
import bokeh
from bokeh.io import output_notebook
output_notebook ()
print ("Bokeh version:", bokeh.__version__)
#!conda upgrade bokeh

from bokeh.palettes import brewer

def make_color_map (values):
    """Given a collection of discrete values, generate a color map."""
    unique_values = np.unique (values) # values must be discrete
    num_unique_values = len (unique_values)
    min_palette_size = min (brewer['Set1'].keys ())
    max_palette_size = max (brewer['Set1'].keys ())
    assert num_unique_values <= max_palette_size
    palette = brewer['Set1'][max (min_palette_size, num_unique_values)]
    color_map = dict (zip (unique_values, palette))
    return color_map

# http://bokeh.pydata.org/en/latest/docs/user_guide/tools.html#userguide-tools-inspectors
from bokeh.io import show
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import PanTool, BoxZoomTool, ResizeTool, HoverTool, CrosshairTool, ResetTool

def make_scatter2d_images (x, y, names=None, image_files=None, clustering=None):
    source_data = dict (x=x, y=y)
    if names is not None:
        source_data["desc"] = names
        tooltips_desc = """<span style="font-size: 17px; font-weight: bold;">@desc</span>"""
    else:
        tooltips_desc = ""
        
    if image_files is not None:
        source_data["imgs"] = image_files
        tooltips_images = """
            <div>
                <img
                    src="@imgs" height="42" alt="@imgs" width="42"
                    style="float: left; margin: 0px 15px 15px 0px;"
                    border="2"
                ></img>
            </div>
        """
    else:
        tooltips_images = ""
        
    source = ColumnDataSource (data=source_data)
    hover = HoverTool (tooltips="""
        <div>
            {}
            <div>
                {}
                <span style="font-size: 15px; color: #966;">[$index]</span>
            </div>
            <div>
                <span style="font-size: 15px;">Location</span>
                <span style="font-size: 10px; color: #696;">($x, $y)</span>
            </div>
        </div>
        """.format (tooltips_images, tooltips_desc))

    TOOLS = [PanTool (), BoxZoomTool (), ResizeTool (), hover, CrosshairTool (), ResetTool ()]
    p = figure (width=600, height=300, tools=TOOLS)
    
    if clustering is not None:
        color_map = make_color_map (clustering)
        cluster_colors = [color_map[c] for c in clustering]
        p.circle (x='x', y='y',
                  fill_color=cluster_colors,
                  line_color=cluster_colors,
                  size=5, source=source)
    else:
        p.circle (x='x', y='y', size=5, source=source)
    return p

from scipy.cluster.vq import kmeans, vq

## Downloading the MNIST data

We've provided an external module, `mnist.py`, to help you download and unpack the handwritten digits. The following cells do that for you.

> The code below downloads the training part of the MNIST data. There is also a separate testing set, used for evaluating machine learning methods.

In [None]:
# Download and unpack MNIST digits database
%reload_ext autoreload
%autoreload 2
import mnist

(mnist_images_gz, mnist_labels_gz) = mnist.download_mnist ('training')

print ("Images:", mnist_images_gz)
print ("Labels:", mnist_labels_gz)

For this demo, let's extract all the examples of "ones" and "eights" drawn by real people!

In [None]:
images, labels, inds = mnist.load_mnist (mnist_images_gz, mnist_labels_gz,
                                         digits=[1, 8], # the digits to load
                                         return_indices=True)

images *= 255 # Rescales the pixels to an 8-bit color scale.

Start by inspecting a few key properties of the data structures that hold the images and labels.

In [None]:
print (images.shape, type (images), images.dtype, np.min (images), np.max (images))
print (labels.shape, type (labels), labels.dtype, np.unique (labels))

Let's take a look at the `z`-th digit of the dataset.

In [None]:
z = 10
imshow_gray (images[z, :, :])
print ("Label ==", labels[z])

## Apply PCA

**Step 1.** Compute the mean image.

In [None]:
mean_image = np.mean (images, axis=0)
imshow_gray (mean_image)

**Step 2.** Subtract the mean away from each image.

In [None]:
images_adj = images - mean_image

**Step 3.** Form a data matrix.

In [None]:
# Create a data matrix
num_images, height, width = images.shape
X = np.reshape (images_adj, (num_images, height*width))
print (X.shape)

**Step 4.** Compute the SVD.

In [None]:
(U, Sigma, VT) = np.linalg.svd (X, full_matrices=False)

In [None]:
# Plot the spectrum
df_Sigma = peek_Sigma (Sigma, ret_df=True)

plt.plot (df_Sigma['i'],
          df_Sigma['sigma_i'] * np.power (df_Sigma['i']+1, -.01),
          color="red",
          linewidth=1)

In [None]:
# Visualize the first few principal components
k_viz = 4
fig, axs = plt.subplots (1, k_viz, figsize=(10, 10*k_viz))
for k in range (k_viz):
    imshow_gray (np.reshape (VT[k, :], (height, width)), ax=axs[k])

**Step 5.** Project the data onto the first $k$ principal axes.

In [None]:
k = 2
Y = X.dot (VT[:k, :].T)

Let's make an interactive plot of the projection.

In [None]:
thumbnails = [gnp2thumbnail (gnp) for gnp in images]

In [None]:
show (make_scatter2d_images (Y[:, 0], Y[:, 1],
                             names=labels,
                             image_files=thumbnails,
                             clustering=labels))

In [None]:
imshow_gray (images[2705, :, :])

In [None]:
imshow_gray (images[1407, :, :])

## Cluster the low-dimensional representation using $k$-means

In [None]:
num_components = 10 # Number of principal components
Y = X.dot (VT[:num_components, :].T)

num_clusters = 4
centers, distortion = kmeans (Y, num_clusters)
clustering, _ = vq (Y, centers)

In [None]:
show (make_scatter2d_images (Y[:, 0], Y[:, 1],
                             names=labels,
                             image_files=thumbnails,
                             clustering=clustering))

In [None]:
imshow_gray (images[8801, :, :])

In [None]:
imshow_gray (images[93, :, :])

## References

Today's notebook uses a bunch of library modules and coding tricks; if you want to learn more, see these references.

_Image manipulation_
* Working with TIFFs: http://stackoverflow.com/questions/7569553/working-with-tiffs-import-export-in-python-using-numpy
* Displaying PIL images inline: http://stackoverflow.com/questions/26649716/how-to-show-pil-image-in-ipython-notebook
* Convert to grayscale: http://stackoverflow.com/questions/12201577/how-can-i-convert-an-rgb-image-into-grayscale-in-python
* MNIST digit recognition database: http://yann.lecun.com/exdb/mnist/

_PCA in Python_
* http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html