# Week 8 Module 4 Assignment 2
## Francis Yang 11/25/2022
### Face Recognition (from Python Data Science Handbook by Jake VanderPlas)

As an example of support vector machines in action, let's take a look at the facial recognition problem. We will use the Labeled Faces in the Wild dataset, which consists of several thousand collated photos of various public figures. A fetcher for the dataset is built into Scikit-Learn:

In [1]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.datasets import fetch_lfw_people

faces = fetch_lfw_people(min_faces_per_person=60)

print(faces.target_names)
print(faces.images.shape)

['Ariel Sharon' 'Colin Powell' 'Donald Rumsfeld' 'George W Bush'
 'Gerhard Schroeder' 'Hugo Chavez' 'Junichiro Koizumi' 'Tony Blair']
(1348, 62, 47)


Each image contains [62×47] or nearly 3,000 pixels. We could proceed by simply using each pixel value as a feature, but often it is more effective to use some sort of preprocessor to extract more meaningful features; here we will use a principal component analysis (we will learn about PCA later) to extract 150 fundamental components to feed into our support vector machine classifier. We can do this most straightforwardly by packaging the preprocessor and the classifier into a single pipeline:

In [2]:
from sklearn.svm import SVC
from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline

pca = PCA(n_components=150, whiten=True, random_state=42)
svc = SVC(class_weight='balanced')

pipe = Pipeline([('pca',pca),('estimator',svc)])

Finally, we can use a grid search cross-validation to explore combinations of parameters. Here we will adjust C (which controls the margin hardness) and gamma (which controls the size of the radial basis function kernel), and determine the best model:

In [3]:
faces.keys()

dict_keys(['data', 'images', 'target', 'target_names', 'DESCR'])

In [4]:
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

X_train, X_test, y_train, y_test = train_test_split(
    faces.data, faces.target, random_state=1)

param_grid = {'estimator__C': np.arange(1, 15, 3),
              'estimator__gamma': np.linspace(0.0001, 0.01, num=15)}

grid = GridSearchCV(pipe, param_grid, cv=5)
grid.fit(X_train, y_train)

grid.best_params_

{'estimator__C': 4, 'estimator__gamma': 0.002928571428571429}

In [5]:
grid.best_score_

0.8368287567673024

We can get a better sense of our estimator's performance using the classification report, which lists recovery statistics label by label:

In [7]:
from sklearn.metrics import classification_report

model = grid.best_estimator_
yfit = model.predict(X_test)

print(classification_report(y_test, yfit,
                            target_names=faces.target_names))

                   precision    recall  f1-score   support

     Ariel Sharon       0.76      0.93      0.84        14
     Colin Powell       0.78      0.91      0.84        54
  Donald Rumsfeld       0.95      0.67      0.78        30
    George W Bush       0.90      0.93      0.92       134
Gerhard Schroeder       0.83      0.81      0.82        31
      Hugo Chavez       0.94      0.75      0.83        20
Junichiro Koizumi       1.00      0.83      0.91        12
       Tony Blair       0.90      0.88      0.89        42

         accuracy                           0.87       337
        macro avg       0.88      0.84      0.85       337
     weighted avg       0.88      0.87      0.87       337

