# PCA demo

We are going to use scikit-learn to download MNIST data, apply PCA to compress it, and visualize the original and compressed images.

This demo was adapted from the tutorial here: https://github.com/mGalarnyk/Python_Tutorials/blob/master/Sklearn/PCA/PCA_Image_Reconstruction_and_such.ipynb

## MNIST dataset

MNIST consists of 70000 grayscale images of digits 0-9. Each image has 28X28 pixels. The value of the pixel goes from 0 (black) to 255 (white).

## Declare imports

In [41]:
# Enable plotting inline in the Jupyter notebook
%matplotlib inline 
from sklearn.datasets import fetch_mldata # Used to download MNIST data
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import numpy as np
import matplotlib.pyplot as plt

## Download MNIST data

In [42]:
mnist = fetch_mldata('MNIST original')

In [43]:
mnist # Print string representation of the object

{'COL_NAMES': ['label', 'data'],
 'DESCR': 'mldata.org dataset: mnist-original',
 'data': array([[0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        ...,
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0],
        [0, 0, 0, ..., 0, 0, 0]], dtype=uint8),
 'target': array([0., 0., 0., ..., 9., 9., 9.])}

In [44]:
mnist.data.shape # Print shape of the data. It should be (70000, 784), 784 because the 28x28 matrix has been flattened into a vector.

(70000, 784)

In [45]:
original_num_components = mnist.data.shape[1] 

## Normalize data

In [46]:
scaler = StandardScaler()
normalized_data = scaler.fit_transform(mnist.data)

## Instantiate PCA

In [47]:
variance = 0.95
pca = PCA(n_components=variance, svd_solver='full') # Use as many components as necessary to explain the variance

## Apply PCA on MNIST data

In [48]:
# Convert to zero mean and unit variance

In [49]:
lower_dimensional_data = pca.fit_transform(normalized_data) 

In [50]:
lower_dimensional_data.shape # Print shape of the compressed data

(70000, 332)

In [51]:
compressed_num_components = lower_dimensional_data.shape[1]

## Reconstruct the data using the principal components

In [52]:
reconstructed_data = scaler.inverse_transform(pca.inverse_transform(lower_dimensional_data))

In [53]:
reconstructed_data.shape

(70000, 784)

## Compare original image with reconstructed image

In [None]:
data_index = 1
IMAGE_SHAPE = (28, 28)
GRAYSCALE_RANGE = (0, 255)
plt.figure(figsize=(8,4));

# Original Image
plt.subplot(1, 2, 1);
plt.imshow(mnist.data[data_index].reshape(IMAGE_SHAPE),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=GRAYSCALE_RANGE);
plt.xlabel(str(original_num_components) + ' components', fontsize = 14)
plt.title('Original Image', fontsize = 20);

# 154 principal components
plt.subplot(1, 2, 2);
plt.imshow(reconstructed_data[data_index].reshape(IMAGE_SHAPE),
              cmap = plt.cm.gray, interpolation='nearest',
              clim=GRAYSCALE_RANGE);
plt.xlabel(str(compressed_num_components) + ' components', fontsize = 14)
plt.title('95% of Explained Variance', fontsize = 20);