# Topics

## 1. PCA Applied to Face Images and "Whitening" 
## 2. PCA and SVM on Face Recognition
## 3. The Confusion Matrix

In [1]:
%matplotlib inline

''' Initial Imports'''

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

from sklearn.decomposition import PCA
from sklearn.svm import SVC


# \*\*\*\*\*\*\*\*\*\*\*\* 
# Place the data set on google drive 
# for some reason, it takes a long time to download 
# the data. 
# \*\*\*\*\*\*\*\*\*\*\*\*

## To Downloading Faces (233MB)

## (This takes a while and you should start now):

http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz 


In [3]:
"""
sklearn face data base

"""

from sklearn import datasets

###############################################################################
# Download the data, if not already on disk and load it as numpy arrays

lfw_people = datasets.fetch_lfw_people(min_faces_per_person=70, \
                                       resize=0.4)

# introspect the images arrays to find the shapes (for plotting)
n_samples, h, w = lfw_people.images.shape
print('height and width of images:', h, w)

# The images in X have been collapsed into a 1D array
# just like for the handwritten digits
X = lfw_people.data

# X.shape[0] tells you the number of images (faces);
# this is the same as n_samples ahove
# X.shape[1] gives the number of pixels for each image
# or, "features"

print('X.shape', X.shape)
n_features = X.shape[1]


# the label/target to predict is the id of the person -- y is an integer
y = lfw_people.target
# target_names are actually names
target_names = lfw_people.target_names
print('target_names.shape', target_names.shape)
print('target_names', target_names)

# n_classes gives the number of people 
# Different from the number of faces (n_samples)!!
n_classes = target_names.shape[0]

print("Total dataset size:")
print("n_samples (number of faces): {0}".format(n_samples))
# n_features = 1850, which is 50x37, the dimension of the images.
print("n_features (number of pixels): {0}".format(n_features))
print("n_classes (number of people): {0}".format(n_classes))


RuntimeError: Failed to read the image file /Users/Tao/scikit_learn_data/lfw_home/lfw_funneled/George_W_Bush/George_W_Bush_0358.jpg, Please make sure that libjpeg is installed

## "Whitening"

## Breakout Exercise:

- ## Do PCA on the first 500 images.  Use only 4 components
- ## Print out the PCA components (eigenvectors)
- ## Print out the PCA variances (eigenvalues)
- ## Calculate yourself the variance of along the 0th and 1st PCA axes

## Turn on whitening in your PCA instantiation: 

          whiten = True
          
## Run the cell again -- can you tell what changed?

## Discussing the solution

- ### The importance of whiten = True
    
- ### To understand why this helps: 

    ### Imagine just doing this with just 2 PCA components.  With whitening, in that 2D PCA space, the variance in each PCA direction is 1.  Then you expect to see clusters that are similar in extent in either direction -- meaning they would apear more circular.  This makes it easier to draw boundaries between clusters.

- ### This is a helpful resource 

    ### To see what whitening does:

    ### From http://ufldl.stanford.edu/tutorial/unsupervised/PCAWhitening/

    ### Look under the Section "Whitening", subsection "2D example"
    
   ### Can plot the data in the 2D space of the first two PCA components, with and without whitening to show the effect -- either for the digit data (not much difference) or the face data 


## Breakout Exercise:

### Write a function, plot_faces() [similar to plot_dig()] to plot and label the faces with their correct names. (You may consider expanding the functionality of plot_faces() a bit so that it can label each image with both the correct names and predicted names.)


## Breakout Exercse: Using this data set (the 7 people, each with at least 70 faces to do face recognition)

- ### First do training-testing split :

    split the sample (X) into two parts -- X_train and X_test.  For now let X_train be the first 1000 images, and X_test be the rest of the images (288).
    
    Perform PCA and SVM on X_train.  Then project X_test onto the PCA axes, and then use the trained SVM to predict for X_test.
    
    This is referred to as "k-fold."  Along with "leave-one-out", this is another way to test how good your algorithm is before using it in the "real world"

    (To connect with what we did for classifying digit image: in classify_dig_svm() from Week15-1, we did a (n-1, 1) split; here we do a train-test split or roughly (3:1), otherwise it's the same.  There the return value is y_pred[0], because there is only one element.  Here's it should just be y_pred since it's an array.)


- ### Find the success rate by comparing the prediction for X_test and the correct labels/targets for X_test.


## Precision vs. Recall

### "The precision-recall curve shows the tradeoff between precision and recall for different threshold. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate.

### Precision (P) is defined as the number of true positives ($T_p$) over the number of true positives plus the number of false positives ($F_p$).

### $P = \frac{T_p}{T_p+F_p}$

### Recall (R) is defined as the number of true positives ($T_p$) over the number of true positives plus the number of false negatives ($F_n$).

### $R = \frac{T_p}{T_p + F_n}$

### These quantities are also related to the ($F_1$) score, which is defined as the harmonic mean of precision and recall.

### $F1 = 2\frac{P \times R}{P+R}$"


### Finally,

### "The support is the number of occurrences of each class in y_true."

(For more details, see

http://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

and

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html#sklearn.metrics.precision_recall_fscore_support)


## End of Week15-3