## ChatGPT Face Detector
Below are two programs about face recognition obtained from ChatGPT using various techniques of machine learning. 

### Background Context
##### K-Nearest Neighbors (KNN)
A popular supervised machine learning algorithm used in regression and classification. The model prediction is based on the similarities between the unseen data from the test set and its k nearest neighbors in the training set.<br>
- **Advantage:** simple and easy to implement <br>
- **Disadvantage:** lazy learner (train while making prediction) so slower and more costly (memory)

##### Convolutional Neural Network (CNN)
A type of deep learning neural network architecture used for image and speech processing. By using multiple interconnected layers, they can extract useful features from the input data and use them to make predictions<br>
- **Advantage:** multiple layers enable capture and recognize variations of data<br>
- **Disadvantage:** high complexity (expensive to train and use)<br>

### About the Dataset
The file contains two set of data: training and testing in the `train` and `val` folders respectively. Each of them contains two set of randomly chosen images displaying two different type of facial expressions: happy and sad. 

Data size: 
|  | train | test |
| --- | --- | --- |
| happy men | 25 | 5 |
| happy women | 25 | 5 |
| sad men | 25 | 5 |
| sad women | 25 | 5 |

[Image Source](https://stock.adobe.com/)

### Your tasks: 
KNN Model:
- Explain what does `Accuracy` tells you.
- Compute `precision` and `recall` and explain what they mean.
<br>

CNN Model:
- Train the model with the given dataset. What could do potentially improves `accuracy` of the model?
- Explain what does `loss` and `accuracy` tells you.
- Compute TP, TN, FP, FN
- Compute Precision and Recall
- How much of the True Positive were Male? Female?
- How much of the True Negative were Male? Female?
- Create a bar chart to show these proportion in terms of percentage.

### Format:
- For questions that require justification, include all your answers in a (one) Markdown cell after each program.
- For questions that require programming output, make sure it's clear what each output is. 

In [1]:
#use KNN
# Import necessary libraries
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

# Load the LFW dataset
# downloads and returns the images of people's face and labels
# only include people with at least 70 images in the dataset
lfw = datasets.fetch_lfw_people(min_faces_per_person=70)

print(lfw)

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(lfw.data, lfw.target, test_size=0.2, random_state=42)

# Train a K-Nearest Neighbors classifier
# n_neighbors: use the 5 nearest neighbors to make the predictions
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train, y_train)

# Make predictions on the testing set
y_pred = knn.predict(X_test)

# Calculate the accuracy of the classifier
# comparing the predicted labels (y_pred) with the true labels (y_test)
accuracy = accuracy_score(y_test, y_pred)

# Print the accuracy of the classifier
print("\nAccuracy:", accuracy)

# accuracy gives the proportion of accurately predicted sample from the total number of sample
# the closer to 1 the better

{'data': array([[0.9973857 , 0.9973857 , 0.99607843, ..., 0.38431373, 0.3869281 ,
        0.3803922 ],
       [0.14509805, 0.1633987 , 0.21437909, ..., 0.44575164, 0.4509804 ,
        0.58300656],
       [0.34379086, 0.3503268 , 0.4366013 , ..., 0.7163399 , 0.7202614 ,
        0.7176471 ],
       ...,
       [0.35947713, 0.34901962, 0.32026145, ..., 0.21699347, 0.21568628,
        0.17777778],
       [0.19346406, 0.21176471, 0.2901961 , ..., 0.6862745 , 0.654902  ,
        0.5908497 ],
       [0.12287582, 0.09803922, 0.10980392, ..., 0.12941177, 0.1633987 ,
        0.29150328]], dtype=float32), 'images': array([[[0.9973857 , 0.9973857 , 0.99607843, ..., 0.26928106,
         0.23267974, 0.20261438],
        [0.9973857 , 0.99607843, 0.99477124, ..., 0.275817  ,
         0.24052288, 0.20915033],
        [0.9882353 , 0.97647065, 0.96732026, ..., 0.26928106,
         0.24052288, 0.21830066],
        ...,
        [0.3372549 , 0.2784314 , 0.20522876, ..., 0.4117647 ,
         0.39869282, 0.37

### What does accuracy tell us 
The accuracy tells us how well the KNN identifier was able to perfom on the testing set of the LFW data. It gives us a percentage of how many of the instances the responses from the KNN were correct out of the total instances tested. 

### What does Precision tell us
The precision score is an indicator to the accuracy of which the model is able to produce true positives. It takes the total positves by adding false and true positive and finds the ration of the true positives. The higher the precision, the less false positive predictions.

### What does the recall tell us
The recall takes the true positive instances and divides it by the total actual positives. The higher the recall the better the model is at finding all positive instances whether if its a true positive or a false negative. 

In [2]:
#precision score and recall score
p_score = precision_score(y_test,y_pred,average="weighted")
r_score = recall_score(y_test, y_pred,average="weighted")

print("\nPrecision:", p_score)
print("\nRecall:", r_score)


Precision: 0.5821667345464655

Recall: 0.5852713178294574


In [3]:
#use CNN
import tensorflow as tf
from tensorflow.keras.preprocessing.image import ImageDataGenerator

# Define the directory paths for the training and validation datasets
train_dir = 'train'
val_dir = 'val'

# Define the number of classes (happy and sad)
num_classes = 2

# Define the input shape of the images
input_shape = (160, 160, 1)

# Define the batch size for the data generators
batch_size = 5

# Define the data generators for the training and validation datasets
train_datagen = ImageDataGenerator(
    rescale=1./255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True
)

val_datagen = ImageDataGenerator(rescale=1./255)

train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=input_shape[:2],
    color_mode='grayscale',
    batch_size=batch_size,
    class_mode='categorical',
    shuffle=False
)

val_generator = val_datagen.flow_from_directory(
    val_dir,
    target_size=input_shape[:2],
    color_mode='grayscale',
    batch_size=batch_size,
    class_mode='categorical',
    shuffle=False
)

# Define the model architecture
model = tf.keras.models.Sequential([
    tf.keras.layers.Conv2D(32, (3,3), activation='relu', input_shape=input_shape),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
    tf.keras.layers.MaxPooling2D((2,2)),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(512, activation='relu'),
    tf.keras.layers.Dense(num_classes, activation='softmax')
])

# Compile the model
model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# Train the model
history = model.fit(
    train_generator,
    steps_per_epoch=train_generator.samples//batch_size,
    epochs=10,
    validation_data=val_generator,
    validation_steps=val_generator.samples//batch_size
)

# Save the model
model.save('face_classification_model.h5')

2023-04-23 22:43:36.979418: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-23 22:43:37.077242: I tensorflow/tsl/cuda/cudart_stub.cc:28] Could not find cuda drivers on your machine, GPU will not be used.
2023-04-23 22:43:37.077937: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Found 100 images belonging to 2 classes.
Found 20 images belonging to 2 classes.
Epoch 1/10


2023-04-23 22:43:42.599365: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
	 [[{{node Placeholder/_0}}]]




2023-04-23 22:43:47.649365: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
	 [[{{node Placeholder/_0}}]]


Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


**Accuracy:** Proportion of correct prediction out of total prediction of the training/testing data.
- The higher the better
<br>

**Loss:** Measures the difference between the predicted and true output of the training/testing data. 
- The lower the better

### How can we improve accuracy?
A way we can improve the accuracy of the model is to use a partially pre-trained model so the model has a better starting point and therefore makes the learning curve a little less steep.

### What do loss and accuracy tell us
The loss measures how well the model can predict the correct output given an input while accuracy,is a porportion checking the total number of correct samples over the total number of samples. 

In [14]:
import numpy as np
import matplotlib.pyplot as plt 

model_eval_results = model.evaluate(val_generator)

y_pred = model.predict(val_generator)
y_pred_c = np.argmax(y_pred,axis=1)
y_test = val_generator.classes

TN = np.sum(np.logical_and(y_pred_c ==0,y_test ==0))
FN = np.sum(np.logical_and(y_pred_c ==0,y_test ==1))
TP = np.sum(np.logical_and(y_pred_c ==1,y_test ==1))
FP = np.sum(np.logical_and(y_pred_c ==1,y_test ==0))

#precision and recall formulas 
p = precision_score(y_test,y_pred_c,average="weighted")
r = recall_score(y_test, y_pred_c,average="weighted")

print("TP -> ", TP ,"\nTN ->",TN,"\nFP ->",FP,"\nFN ->",FN)
print("Precision ->",p)
print("Recall >",r)



2023-04-23 23:07:46.331025: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
	 [[{{node Placeholder/_0}}]]




2023-04-23 23:07:46.732201: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
	 [[{{node Placeholder/_0}}]]


TP ->  2 
TN -> 5 
FP -> 5 
FN -> 8
Precision -> 0.33516483516483514
Recall > 0.35
