# Dimensionality Reduction with PCA

## 1. MNIST Example

Let's project MNIST digits onto two dimensions using PCA and visualize the results.

In [None]:
# setup:
import numpy as np
import urllib2
import matplotlib.pyplot as plt
%matplotlib inline

def load(url):
    """read a CSV from the web, return data and labels"""
    response = urllib2.urlopen(url)
    Xy = np.loadtxt(response, delimiter=',')
    y = Xy[:, -1]
    X = Xy[:, :-1]
    return X, y

trainX, trainy = load('http://cs.wellesley.edu/~sravana/ml/ps1/data/mnist1100/training.txt')
print 'Loaded training data', trainX.shape

# center the data 
trainXmean = np.mean(trainX, axis=0)
trainX -= trainXmean

In [None]:
# run PCA
from sklearn.decomposition import PCA
dimreduce = PCA(n_components=2)
reducedTrainX = dimreduce.fit_transform(trainX)  # produces n by 2 matrix, where n = num of data points
print 'Projected data onto 2 dimensions'

# plot every 50th digit
reducedTrainX_every50 = reducedTrainX[0::50, :]
plt.figure(figsize=(15, 15))
plt.scatter(reducedTrainX_every50[:, 0], reducedTrainX_every50[:, 1])
for i in range(0, reducedTrainX.shape[0], 50):  
    plt.annotate(str(int(digity[i])), (reducedTrainX[i, 0], reducedTrainX[i, 1]), size=20)
plt.show()

### Classification on the reduced space

How does a kNN classifier perform on this reduced representation?

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knnmodel = KNeighborsClassifier(n_neighbors=3)

testX, testy = load('http://cs.wellesley.edu/~sravana/ml/ps1/data/mnist1100/testing.txt')
print 'Loaded test data', testX.shape

# center the data using the training mean
testX -= trainXmean

# baseline: kNN with no dimensionality reduction
knnmodel.fit(trainX, trainy)
print 'Baseline accuracy:', knnmodel.score(testX, testy)

In [None]:
# project test data on new space
reducedTestX = dimreduce.transform(testX)

# kNN with 2-d dimensionality reduction
knnmodel.fit(reducedTrainX, trainy)
print 'Accuracy for 2-d PCA:', knnmodel.score(reducedTestX, testy)

Okay, so 2-d wasn't great, but we're likely throwing away too much information. (2-d is great for visualizations, though.)

Let's try 100 dimensions. This is more than 2, but way less than the original 784.

In [None]:
dimreduce100 = PCA(n_components=100)
reduced100TrainX = dimreduce100.fit_transform(trainX)  # produces n by 100 matrix, where n = num of data points
reduced100TestX = dimreduce100.transform(testX)
print 'Projected data onto 100 dimensions'

# kNN with 100-d dimensionality reduction
knnmodel.fit(reduced100TrainX, trainy)
print 'Accuracy for 100-d PCA:', knnmodel.score(reduced100TestX, testy)

This performs better than the baseline. Success! 