## PCA in a more complex setting

The goal of Principal Components Analysis is to exploit "suspicious coincidences" in one's data. Here, I will take you through PCA on cat faces to show how a complex item (a cat's face) may be recognized by a very small number of components.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

Run the cell below to load in the cat face data. This is a set of ~9000 64x64 pixel cat faces. 

In [None]:
allIms = np.load('/home/data/eigencatData/cats.npy')
print(allIms.shape)

Consider: Each image is 64 by 64 pixels in size, and each pixel can take on one of 256 shades of gray. How many dimensions are in these data?

Let's first get warmed up by showing an animation of the first 100 cat faces. It gives you a sense of the variability, and they are super cute!

In [None]:
# show the first 100 faces
%matplotlib notebook

# start plot
fig = plt.figure()
ax = fig.add_subplot(111)
plt.ion()
fig.show()
fig.canvas.draw()

# show first 100 cat faces: fill in the range and the index of thisFace
for i in range(100):
    ax.clear()
    thisFace = allIms[i, :, :]
    plt.imshow(thisFace, cmap="gray")
    fig.canvas.draw()
    plt.pause(.25)

As you saw with the simplePCA notebook, the first step in processing is to subtract the average from the data. What does the average cat face look like? Run the cell below to create a running average of the first 250 faces.

In [None]:
# pre-allocate memory for the mean face
thisMean = np.zeros((64, 64))

# start plot
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
plt.ion()
fig.show()
fig.canvas.draw()
pauseTime = 1

for i in range(250):
    pauseTime = pauseTime/2
    thisFace = allIms[i, :, :]
    thisMean += thisFace * (1/allIms.shape[0])
    ax1.clear()
    ax2.clear()
    ax1.imshow(thisFace, cmap="gray")
    ax1.set_title("Image Number: {}".format(i)) 
    ax2.imshow(thisMean, cmap="gray")
    fig.canvas.draw()
    plt.pause(pauseTime)

Let's now compute the average of all 9000 faces (hint: np.mean() is your friend -- remember to use an axis argument!). 

In [None]:
# compute the average cat face
meanCat = # your code here

plt.figure()
plt.imshow(meanCat, cmap="gray")

This is a very interesting result in and of itself - even though no effort was taken to align the cat faces, the faces are all so self-similar that the average itself looks like a cat face! This suggests a considerable amount of redundancy that can be exploited by PCA.

In order to compute PCA on these images, we need to reshape our matrix into a 8942-image by 4096-pixel array.

In [None]:
allIms = np.reshape(allIms, [8942, 4096])
print(allIms.shape)

We will be using the PCA implementation from the sklearn library. Put your data matrix into the parentheses in line 1. This line will compute the principal components. The plot will show the cumulative variance explained for each of the components. 

In [None]:
pca = PCA().fit(#data matrix)

plt.figure()
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

Really interesting, right? Even though we have over 4000 dimensions to our data, we can explain nearly all of the variation with less than 25% of those dimensions!

But what do these components themselves look like? Fill in the code below to compute PCA with the first 200 components. The plot will reveal the first 30 principal components or "eigencats". You can think about these as building blocks - every individual cat can be reconstructed as some linear combination of these eigencats.


In [None]:
# Fit the PCA model with 200 components
pca = PCA(n_components = XXX) # fill in 200 as the number of components
pca.fit(#fill me in)

# plot the eigencats - no need to change anything else
fig, axes = plt.subplots(3, 10, figsize=(9, 4),
                         subplot_kw={'xticks':[], 'yticks':[]},
                         gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i, ax in enumerate(axes.flat):
    ax.imshow(pca.components_[i].reshape(64, 64), cmap='gray')

How well can we reconstruct the cat faces with just 200 components? Run the cell below to find out - the first row shows the first 10 cat faces, and underneath we have the 200-component reconstruction.

In [None]:
pca = PCA(n_components = 200)
pca.fit(#datamat)
components = pca.transform(#datamat)
projected = pca.inverse_transform(components)

# Plot the results
fig, ax = plt.subplots(2, 10, figsize=(10, 2.5),
                       subplot_kw={'xticks':[], 'yticks':[]},
                       gridspec_kw=dict(hspace=0.1, wspace=0.1))
for i in range(10):
    ax[0, i].imshow(XXX[i].reshape(64, 64), cmap='gray') # fill in XXX with data matrix
    ax[1, i].imshow(projected[i].reshape(64, 64), cmap='gray')
    
ax[0, 0].set_ylabel('full-dim\ninput')
ax[1, 0].set_ylabel('200-dim\nreconstruction');

How few components can you get away with? Try the reconstruction again with different numbers of components. Do you and your team mates agree on the minimum number of required components?