<img src="../src/packt-banner.png" alt="">

### Exercise 3: Build a SVM modle for Face Recognition Problem
##### (25 points) --> your total will divided by 5 to get 5 points for this exercise.
---

We will use a very famous dataset, called Labelled Faces in the Wild, which
consists of 1288 faces of famous people, and it is available at http://viswww.cs.umass.edu/lfw/lfw-funneled.tgz.

However, note that it can be easily imported via scikit-learn from the datasets class.
Each image consists of 1850 features: we could proceed by simply using each of them in the model.



Fitting a SVM to non-linear data using the Kernel Trick produces non- linear decision boundaries.
In particular, we seek to:
* Build SVM model with radial basis function (RBF) kernel
* Use a grid search cross-validation to explore ran- dom combinations of parameters.

### Step to do:

1. Loading the dataf from sklearn.datasets:

In [42]:
import seaborn as sns 
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.datasets import fetch_lfw_people
faces = fetch_lfw_people(min_faces_per_person=60)

2. Since the data can be accessed from the sklearn.datasets module, you need to explore the dataset.
    - (refer to the first 6 steps in Lab_1 could help you)

a- Print the field names (that is, the keys to the dictionary) (1 point)

In [43]:
# What fields are in the dictionary?
print(faces.keys())

dict_keys(['data', 'images', 'target', 'target_names', 'DESCR'])


b- Print the dataset description contained (2 point)

In [44]:
# write your code here
print(faces['DESCR'])

.. _labeled_faces_in_the_wild_dataset:

The Labeled Faces in the Wild face recognition dataset
------------------------------------------------------

This dataset is a collection of JPEG pictures of famous people collected
over the internet, all details are available on the official website:

http://vis-www.cs.umass.edu/lfw/

Each picture is centered on a single face. The typical task is called
Face Verification: given a pair of two pictures, a binary classifier
must predict whether the two images are from the same person.

An alternative task, Face Recognition or Face Identification is:
given the picture of the face of an unknown person, identify the name
of the person by referring to a gallery of previously seen pictures of
identified persons.

Both Face Verification and Face Recognition are tasks that are typically
performed on the output of a model trained to perform Face Detection. The
most popular model for Face Detection is called Viola-Jones and is
implemented in the OpenCV li

3. Print the data, its shape, and the target names. ( 3 points)

In [45]:
# What does the data look like?
print(faces['data'])

[[0.53333336 0.52418303 0.49673203 ... 0.00653595 0.00653595 0.00130719]
 [0.28627452 0.20784314 0.2535948  ... 0.96993464 0.95032686 0.9346406 ]
 [0.31633988 0.3895425  0.275817   ... 0.4261438  0.7895425  0.9555555 ]
 ...
 [0.11633987 0.11111111 0.10196079 ... 0.5660131  0.579085   0.5542484 ]
 [0.19346406 0.21045752 0.29150328 ... 0.6875817  0.6575164  0.5908497 ]
 [0.12418301 0.09673203 0.10849673 ... 0.12941177 0.16209151 0.29150328]]


In [46]:
# what is the shape of the data?
faces['data'].shape

(1348, 2914)

In [47]:
# What is the target names?
faces['target_names']


array(['Ariel Sharon', 'Colin Powell', 'Donald Rumsfeld', 'George W Bush',
       'Gerhard Schroeder', 'Hugo Chavez', 'Junichiro Koizumi',
       'Tony Blair'], dtype='<U17')

4. Divide the data into features (X) using the faces.data and target (y) using faces.target (2 points)

In [48]:
# Write your code here
X = faces['data']
y = faces['target']

5. Splitting the data into training and testing sets. (2 point)

We train the model with 70% of the samples and test with the remaining 30%.

In [49]:
# Write your code here
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# print the sizes of our training and test set to verify if the splitting has occurred properly.
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)



(943, 2914)
(405, 2914)
(943,)
(405,)


6. Declare SVM model with kernel='rbf', class_weight='balanced' (2 points)

In [None]:
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA

# write your code here
pipe = Pipeline([
    ("pca", PCA(n_components=100, whiten=True, random_state=42)), 
    ("svm", SVC(kernel="rbf", class_weight="balanced"))
])

7. Use a grid search cross-validationwith 10 CV to explore random combinations of parameters. (3 points)
    - we will adjust C, which controls the margin
    - and Gamma (Î³), which controls the size of the radial basis function kernel, and determine the best model.

In [None]:
from sklearn.model_selection import GridSearchCV
param_grid = {"svm__C": [1,5,10,50],"svm__gamma": [0.001,0.0005,0.01,0.1]}

# write your code here for GridSearchCV:

grid = GridSearchCV( estimator=pipe, param_grid=param_grid, cv=10, n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
best_model = grid.best_estimator_
print(grid.best_params_, grid.best_score_)


Fitting 10 folds for each of 16 candidates, totalling 160 fits
{'svm__C': 1, 'svm__gamma': 0.01} 0.8292273236282195


8. predict on the test set, using the best model from above step (best_estimator_) (5 points)

In [52]:
# write your code here

y_pred = best_model.predict(X_test)



9. Model performances:
Run the following code to print the model evaluation metric

In [53]:
from sklearn.metrics import classification_report
labels = list(faces.target_names)
print(classification_report(y_test,y_pred,target_names=labels))

                   precision    recall  f1-score   support

     Ariel Sharon       0.87      0.76      0.81        17
     Colin Powell       0.72      0.90      0.80        84
  Donald Rumsfeld       0.81      0.81      0.81        36
    George W Bush       0.88      0.88      0.88       146
Gerhard Schroeder       0.75      0.75      0.75        28
      Hugo Chavez       1.00      0.67      0.80        27
Junichiro Koizumi       1.00      0.81      0.90        16
       Tony Blair       0.86      0.75      0.80        51

         accuracy                           0.83       405
        macro avg       0.86      0.79      0.82       405
     weighted avg       0.84      0.83      0.83       405



10. What do you observe about the model performances? (5 points)

#### Write your answer here

Precision score mean how many of the things model predicted as positive are actually correct. Recall means how many of the real positive things model was able to find. And f1 score is a mix of both precision and recall.

The model's overall f1-score which is a balance between precision and recall was 83% which is overally good performance. Overall precision was 84$ and recall was 83 percent.

People who had more training samples, like George W. Bush and Colin Powell, have higher and more consistent recall scores, while those with fewer samples, such as Hugo Chavez and Junichiro Koizumi, show lower recall score. This suggests that the model performs better when it has more examples to learn from.

Overall, the model shows strong performance but could be improved by balancing the dataset or fine-tuning parameters to better handle less-represented classes.


