# Important note!

Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
YOUR_ID = "" # Please enter your GT login, e.g., "rvuduc3" or "gtg911x"
COLLABORATORS = [] # list of strings of your collaborators' IDs

In [None]:
import re

RE_CHECK_ID = re.compile (r'''[a-zA-Z]+\d+|[gG][tT][gG]\d+[a-zA-Z]''')
assert RE_CHECK_ID.match (YOUR_ID) is not None

collab_check = [RE_CHECK_ID.match (i) is not None for i in COLLABORATORS]
assert all (collab_check)

del collab_check
del RE_CHECK_ID
del re

**Jupyter / IPython version check.** The following code cell verifies that you are using the correct version of Jupyter/IPython.

In [None]:
import IPython
assert IPython.version_info[0] >= 3, "Your version of IPython is too old, please update it."

# Implementing PCA

For our course's final class, let's take the conceptual ideas of a principal components analysis (PCA) and $k$-means clustering, and put them into practice analyzing a dataset that should be personally meaningful to you!

For this notebook, you will need the following dataset:

* https://t-square.gatech.edu/access/content/group/gtc-3bd6-e221-5b9f-b047-31c7564358b7/peeps3.zip

You'll also need a bunch of modules; might as well preload those now:

In [None]:
import os
import sys
import re

In [None]:
import numpy as np
import pandas as pd

In [None]:
from IPython.display import display, HTML
import matplotlib.pyplot as plt

%matplotlib inline

import seaborn as sns

In [None]:
from PIL import Image
import base64
from io import BytesIO

def to_base64 (png):
    return "data:image/png;base64," + base64.b64encode (png).decode("utf-8")

In [None]:
import bokeh
from bokeh.io import output_notebook
output_notebook ()
print ("Bokeh version:", bokeh.__version__)
#!conda upgrade bokeh

In [None]:
from bokeh.palettes import brewer

def make_color_map (values):
    """Given a collection of discrete values, generate a color map."""
    unique_values = np.unique (values) # values must be discrete
    num_unique_values = len (unique_values)
    min_palette_size = min (brewer['Set1'].keys ())
    max_palette_size = max (brewer['Set1'].keys ())
    assert num_unique_values <= max_palette_size
    palette = brewer['Set1'][max (min_palette_size, num_unique_values)]
    color_map = dict (zip (unique_values, palette))
    return color_map

In [None]:
# http://bokeh.pydata.org/en/latest/docs/user_guide/tools.html#userguide-tools-inspectors
from bokeh.io import show
from bokeh.plotting import figure, ColumnDataSource
from bokeh.models import PanTool, BoxZoomTool, ResizeTool, HoverTool, CrosshairTool, ResetTool

def make_scatter2d_images (x, y, names=None, image_files=None, clustering=None):
    source_data = dict (x=x, y=y)
    if names is not None:
        source_data["desc"] = names
        tooltips_desc = """<span style="font-size: 17px; font-weight: bold;">@desc</span>"""
    else:
        tooltips_desc = ""
        
    if image_files is not None:
        source_data["imgs"] = image_files
        tooltips_images = """
            <div>
                <img
                    src="@imgs" height="42" alt="@imgs" width="42"
                    style="float: left; margin: 0px 15px 15px 0px;"
                    border="2"
                ></img>
            </div>
        """
    else:
        tooltips_images = ""
        
    source = ColumnDataSource (data=source_data)
    hover = HoverTool (tooltips="""
        <div>
            {}
            <div>
                {}
                <span style="font-size: 15px; color: #966;">[$index]</span>
            </div>
            <div>
                <span style="font-size: 15px;">Location</span>
                <span style="font-size: 10px; color: #696;">($x, $y)</span>
            </div>
        </div>
        """.format (tooltips_images, tooltips_desc))

    TOOLS = [PanTool (), BoxZoomTool (), ResizeTool (), hover, CrosshairTool (), ResetTool ()]
    p = figure (width=600, height=300, tools=TOOLS)
    
    if clustering is not None:
        color_map = make_color_map (clustering)
        cluster_colors = [color_map[c] for c in clustering]
        p.circle (x='x', y='y',
                  fill_color=cluster_colors,
                  line_color=cluster_colors,
                  size=5, source=source)
    else:
        p.circle (x='x', y='y', size=5, source=source)
    return p

In [None]:
from scipy.cluster.vq import kmeans, vq

## Recap: Solving the PCA problem

Recall the basic algorithm to compute a PCA, the theory of which is explained in [this lab's accompanying notes](./pca-svd-notes.ipynb) and in the interactive visual demo of which appears at http://setosa.io/ev/principal-component-analysis/.

You are given a set of $m-1$ data points or observations, $X \equiv (\hat{x}_0, \hat{x}_1, \cdots, \hat{x}_{m-1})^T$. Each observation consists of $d$ measured predictors, which we represent by the $d$-dimensional vector $x_i \in \mathbb{R}^d$. You wish to find a $k$-dimensional representation of these points, where $k \leq d$. To do so, you run the PCA procedure, which identifies a $k$-dimensional subspace in terms of $k$ orthogonal vectors ("axes"); these vectors are the _principal components_.

1. If the data are not centered, transform them accordingly. In particular, ensure that their mean is 0, i.e., $\displaystyle \frac{1}{m} \sum_{i=0}^{m-1} \hat{x}_i = 0$.
2. Compute the $k$-truncated SVD, $X \approx U_k \Sigma_k V_k^T$. The truncated SVD is just the subset of singular vectors corresponding to the largest $k$ singular values.
3. Choose $v_0, v_1, \ldots, v_{k-1}$ as the principal components.

## The dataset

If you haven't done so already, download and unpack the dataset into this notebook's working directory.

The data set is a bunch of goofy images. Let's look at one, selected at random. I swear I picked it randomly.

In [None]:
goofy = Image.open ('peeps/scollins46--spencer.tiff', 'r') # Load an image
goofy

Let's convert this image into a Numpy array, and then also to grayscale.

In [None]:
def im2gnp (image):
    """Converts a PIL image into an image stored as a 2-D Numpy array in grayscale."""
    return np.array (image.convert ('L'))

def gnp2im (image_np):
    """Converts an image stored as a 2-D grayscale Numpy array into a PIL image."""
    return Image.fromarray (image_np.astype (np.uint8), mode='L')

def imshow_gray (im, ax=None):
    if ax is None:
        f = plt.figure ()
        ax = plt.axes ()
    ax.imshow (im,
               interpolation='nearest',
               cmap=plt.get_cmap ('gray'))                   
    
goofy_np_gray = np.array (goofy.convert ('L'))
imshow_gray (goofy_np_gray)
print ("What a ham!")

Next, let's load all the images as grayscale into a list of Numpy arrays, `original_images`, along with an array `image_names` to hold a name for each image. (The names are extracted from the image filename.)

> You may need to adjust the filepath below if this code does not work for you.

In [None]:
original_images = []
image_names = []
for base, dirs, files in os.walk ('peeps'):
    for filename in files:
        name_tiff = re.match (r'^(.*)\.tiff$', filename)
        if name_tiff:
            filepath = os.path.join (base, filename)
            im = im2gnp (Image.open (filepath, 'r'))
            key = name_tiff.groups (0)[0]
            
            original_images.append (im)
            image_names.append (key)
            
print ("Found", len (original_images), "goofy images.\n")

It will sometimes be helpful to have a friendly name for each image; the following code collects those names, extracting them from the `image_names[:]` list computed above.

In [None]:
names = []
for key in image_names:
    key_fields = re.match (r'^(.*)--(.*)$', key)
    assert key_fields is not None
    names.append (key_fields.groups ()[1])

Lastly, the latter part of this notebook creates an interactive visualization, for which we will need thumbnail versions of these images. The following code creates those thumbnails. It stores them as a list, `thumbnails[:]`, of Base64-encoded binary `PNG` data, which can be embedded directly into HTML.

In [None]:
thumbnails = []
for gnp in original_images:
    im = gnp2im (gnp)
    memout = BytesIO ()
    im.save (memout, format='png')
    thumbnails.append (to_base64 (memout.getvalue ()))

## Preprocessing the images

To apply PCA, we'll want to preprocess the images in various ways.

To begin with, observe that the images come in all shapes and sizes.

In [None]:
min_rows, min_cols = sys.maxsize, sys.maxsize
max_rows, max_cols = 0, 0
for (i, image) in enumerate (original_images):
    r, c = image.shape[0], image.shape[1]
    print ('%d:' % i, image_names[i], "--", r, "x", c, "pixels")
    
    min_rows = min (min_rows, r)
    max_rows = max (max_rows, r)
    min_cols = min (min_cols, c)
    max_cols = max (max_cols, c)
    
print ("\n==> Least common image size:", min_rows, "x", min_cols, "pixels")

**Exercise 1** (2 points). Suppose the least common image size is $r_0 \times c_0$ pixels, and $s_0 = \min (r_0, c_0)$ is the smallest dimension. Crop each $r \times c$ image so that it is $s_0 \times s_0$ in size. If $r > s_0$, then crop out any extra rows on the bottom of the image; and if $c > s_0$, then center the columns of the image. Store the output images in a 3-D Numpy array called `images[:, :, :]`, where `images[k, :, :]` is the `k`-th image.

In [None]:
def recenter (image, min_dim):
    # min_dim == $s_0$ in the instructions
    r, c = image.shape
    
    # Compute four variables, `top`, `left`, `bot`,
    # and `right` so that the `return` statement
    # returns the recentered image.
    
    # YOUR CODE HERE
    raise NotImplementedError()
    
    return image[top:bot, left:right]

# Quick test
min_dim = min (min_rows, min_cols)
image0 = original_images[0]

print ("Recentering: Before = {} x {} pixels; after = {} x {} pixels.".format (image0.shape[0],
                                                                               image0.shape[1],
                                                                               min_dim,
                                                                               min_dim))
image0_recentered = recenter (image0, min_dim)

fig, axs = plt.subplots (1, 2, figsize=(10, 5))
imshow_gray (image0, ax=axs[0])
imshow_gray (image0_recentered, ax=axs[1])

In [None]:
# Re-center images to a common size
min_dim = min (min_rows, min_cols)
images_recentered = np.zeros ((len (original_images), min_dim, min_dim))
for (k, image) in enumerate (original_images):
    images_recentered[k, :, :] = recenter (image, min_dim)

**Exercise 2** (2 points). Compute an "average" image, taken as the elementwise (pixelwise) mean over all images. Store the result in a `min_dim` $\times$ `min_dim` Numpy array called, `mean_image`.

In [None]:
# Store your result in a variable called `mean_image`
# YOUR CODE HERE
raise NotImplementedError()
    
# How would you describe this "average" person?
imshow_gray (mean_image)
gnp2im (mean_image)

**Exercise 3** (2 points). Recall that PCA requires centered points. Let's do that by subtracting the mean image from every image. Use the recentered images computed in one of the above tests (`images_recentered`) and store the result in a new array, `images`.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
f, axs = plt.subplots (1, 4, figsize=(10, 40))
imshow_gray (images[0, :, :] + mean_image, ax=axs[0])
imshow_gray (images[0, :, :], ax=axs[1]) # Compare this to the original.
imshow_gray (images[-1, :, :] + mean_image, ax=axs[2])
imshow_gray (images[-1, :, :], ax=axs[3]) # Compare this to the original.
print ("why if it isn't the lovely/handsome pair, {} and {}!".format (names[0], names[-1]))

In [None]:
max_abs_sum = np.max (np.abs ((np.sum (images, axis=0))))
max_abs_sum_bound = np.finfo (float).eps * (len (images) ** 2) * np.max (images)
print (max_abs_sum, "<=", max_abs_sum_bound, "?")
assert max_abs_sum <= max_abs_sum_bound

## From image set to a data matrix and back again

For PCA, you need a data matrix. Here is some code to convert our 3-D array of images into a 2-D data matrix, where we "flatten" each image into a 1-D vector by a simple `reshape` operation.

In [None]:
# Create m x d data matrix
m = len (images)
d = min_dim * min_dim
X = np.reshape (images, (m, d))

In [None]:
# To get back to an image, just reshape it again
imshow_gray (np.reshape (X[int (len (X)/2), :], (min_dim, min_dim)))
print (names[int (len (X)/2)])

## Applying PCA

**Exercise 4** (2 points). Compute the SVD of `X`. Store the result in three arrays, `U`, `Sigma`, and `VT`, where `U` holds $U$, `Sigma` holds just the diagonal entries of $\Sigma$, and `VT` holds $V^T$.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
# Sanity check on dimensions
print ("X:", X.shape)
print ("U:", U.shape)
print ("Sigma:", Sigma.shape)
print ("V^T:", VT.shape)

assert X.shape == (len (images), min_dim * min_dim)
assert U.shape == (len (images), len (images))
assert Sigma.shape == (len (images),)
assert VT.shape == (len (images), min_dim * min_dim)

The following code looks at Sigma. The collection of $\sigma_i$ values is also referred to as the _spectrum_ of the matrix.

In [None]:
def peek_Sigma (Sigma, ret_df=False):
    k = len (Sigma)
    df_Sigma = pd.DataFrame ()
    df_Sigma['i'] = np.arange (k)
    df_Sigma['sigma_i'] = Sigma
    Sigma_sq = np.power (Sigma, 2)
    Err_sq = np.sum (Sigma_sq) - np.cumsum (Sigma_sq)
    Err_sq[Err_sq < 0] = 0
    Err = np.sqrt (Err_sq)
    Relerr = Err / (Sigma[0] + Err[0])
    df_Sigma['sigma_i^2'] = Sigma_sq
    df_Sigma['err_i^2'] = Err_sq
    df_Sigma['err_i'] = Err
    df_Sigma['relerr_i'] = Relerr
    print ("Singular values:")
    display (df_Sigma.head ())
    print ("  ...")
    display (df_Sigma.tail ())
    
    f, ax = plt.subplots (figsize=(7, 7))
    #ax.set (yscale="log")
    sns.regplot ("i", "sigma_i", df_Sigma, ax=ax, fit_reg=False)
    if ret_df:
        return df_Sigma

In [None]:
df_Sigma = peek_Sigma (Sigma, ret_df=True)

# Adds a red line to the plot: y ~ sigma_0 / sqrt(i+1)
plt.plot (df_Sigma['i'], df_Sigma['sigma_i'][0]*np.power (df_Sigma['i']+1, -0.5),
          color="red", linewidth=1)

**Exercise 5** (2 points). Does the spectrum of these data decay quickly or slowly? How should that affect your choice of $k$, if you are considering a $k$-truncated SVD?

YOUR ANSWER HERE

**Exercise 6** (2 points). Plot the first few principal components as images. What do they appear to capture?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

**Exercise 7** (2 points). Write some code to compute a new matrix `Y`, which is the original data matrix projected onto the first `num_components` principal components.

> You can use the code cell below, which calls `make_scatter2d_images`, to create an interactive plot of your projection. Does it reveal any interesting groupings?

In [None]:
num_components = 2 # Number of principal components

# Define `Y`:
# YOUR CODE HERE
raise NotImplementedError()

assert Y.shape == (len (X), num_components)

p = make_scatter2d_images (Y[:, 0], Y[:, 1],
                           names=names,
                           image_files=thumbnails)
show (p)

**Exercise 8** (2 points). Run $k$-means on the projected data, `Y[:m, :num_components]`, to try to identify up to `num_clusters` clusters. Store the cluster centers in an array `centers[:num_clusters, :2]` and the cluster labels in an array `clustering[:m]`.

> You may use Scipy's `kmeans()` routine.

In [None]:
num_clusters = 4

# YOUR CODE HERE
raise NotImplementedError()

print (centers)

p = make_scatter2d_images (Y[:, 0], Y[:, 1],
                           names=names,
                           image_files=thumbnails,
                           clustering=clustering)
show (p)

In [None]:
df_kcurve = pd.DataFrame (columns=['k', 'distortion']) 
for i in range(1,10):
    _, distortion = kmeans (Y, i)
    df_kcurve.loc[i] = [i, distortion]
df_kcurve.plot(x="k", y="distortion")

## References

Today's notebook uses a bunch of library modules and coding tricks; if you want to learn more, see these references.

_Image manipulation_
* Working with TIFFs: http://stackoverflow.com/questions/7569553/working-with-tiffs-import-export-in-python-using-numpy
* Displaying PIL images inline: http://stackoverflow.com/questions/26649716/how-to-show-pil-image-in-ipython-notebook
* Convert to grayscale: http://stackoverflow.com/questions/12201577/how-can-i-convert-an-rgb-image-into-grayscale-in-python

_PCA in Python_
* http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html