# PCA overview

We will first do a PCA on the extracted 4096 features, which is represented as a 64x64 image. For each ward, the images are compiled to form a sample, the mean is subtracted and a PCA is performed on the resulting matrix. 

url for tutorial: https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60

## Some libraries

In [17]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%matplotlib inline

The following will be removed once we have our feature set

In [4]:
# download mnist database of handwritten digits (784 features, 60 000 training set, 10000 test set)
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
#images has shape 70000x 784, thus 70000 images with 784 dimensions
#split data into train and test (we won't eventually do this)
from sklearn.model_selection import train_test_split
# test_size: what proportion of original data is used for test set
train_img, test_img, train_lbl, test_lbl = train_test_split( mnist.data, mnist.target, test_size=1/7.0, random_state=0)

## Standardize the data

In [8]:
# scaling features so some don't dominate the pca
# scale to unit scale: mean 0 and var of 1
# they fit to training set and transform both train and test set (not sure what we will do eventually...)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
# Fit on training set only.
scaler.fit(train_img)
# Apply transform to both the training set and the test set.
train_img = scaler.transform(train_img)
test_img = scaler.transform(test_img)
# test_img
# train_img

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Import and apply PCA

In [11]:
from sklearn.decomposition import PCA
# Make an instance of the Model
pca = PCA(.95) #want 95% of variation

#again they only apply it to the training data
pca.fit(train_img)

PCA(n_components=0.95)

Find how many components the PCA choose, in this case it used 327 components

In [12]:
pca.n_components_

327

In [13]:
#apply mapping transform to both the training and test set

In [14]:
# note, these are not the pc's, but rather the dot product of the pc's and the original feature values
train_img = pca.transform(train_img)
test_img = pca.transform(test_img)
# train_img

In [16]:
# dimensions are now 60000x327 for the training images
# train_img.shape

(60000, 327)