# Abstract

##### Facial recognition has long remained one of Data Science's most difficult areas to approach. Whereas other types of data have easy to define features and relative simplicity, facial data includes a great deal of hidden or noisy information. Due to this, facial recognition remains a daunting field with no single approach guaranteed to achieve the desired result. And although the human brain excels at instinctively deriving difficult to define features at a glance, programs still struggle to extract something as basic as gender. Yet, should a model be developed that performs as well as humans in all conditions, it would  vastly increase efficiency in all sorts of fields. Basic examples include, medical diagnoses based on facial features, the removal of the need for identification documents, the increased ease of entering one's favorite sites etc. In light of this, the task was to perform exploratory analysis on a number of preprocessing techniques, combined with an analysis of the best performing, and the best performing hyperparameter for said models. Out of four preprocessing techniques (Label Balancing, SIFT,  PCA, RFS) we determined that Label Balancing with oversampling was the best for generalization, while the other techniques lowered training time in exchange for a far greater error rate. Out of four models explored (GBC, CNN, RFS, SVM), GBC and CNN were chosen for similar levels of high accuracy in addition to differing training methods. We then determined the best hyperparameters for each model and visualized how each model functioned at peak performance.

# Introduction

# Background

# Data

##### The Dataset was taken from Kaggle and could be uploaded in this link. The Faces dataset contains 20000+ cropped & aligned facial images with age, race and gender labels. Age label modified to 9 stages based on The Stages  of Human Life Cycle. Project has 6000 testing examples 3500 validation and 10500 training. After balancing training examples, we took 5000 samples proportionally. The pixel values are integer between 0 and 255. To normalize data, we divide it by 255. Each image has a shape of (200,200,3). Feature extracted by PCA.

# Methods

## Preprocessing

### Balance via Oversampling

### SIFT - (Scale Invariant Feature Transform)

##### A technique for simplifying the complexity of an image by transforming it into a histogram of commonly found features. The features within an image are defined as keypoints within the SIFT algorithm. A key point is defined as a local extrema within an image that is found by comparing a pixel with its neighbors for drastic shifts in pixel values. Next, a descriptor is taken of the local area around each key point which consists of a 128 bin feature vector. This vector describes the local area and a direction, allowing the keypoint to be applicable despite rotation. 128 bin descriptors are collected from training images and clustered via K-means to produce common descriptors. Then each image can be transformed into a histogram with each bin representing the number of times a common feature was detected within the image.

### PCA

### RFS

## Models

### Neural Network

##### Each neural network was run with a maximum of 10 epochs with the optimizer adam and sparse categorical cross entropy loss. A callback was implemented with a patience of 5 and monitored the validation accuracy. This was so that the model would return the weights for the best validation accuracy should the model run for 5 epochs without improvement. Testing will be done on the age variable due to it having the most unbalanced and varied classes.
##### Metrics used will be base accuracy, macro f1-score, macro recall, and macro-precision. This is so that we may compare how the model is doing on the entire validation dataset as well as whether it has equal metrics across all classes.

#### Unbalanced Dataset

##### Although not recommended, the effects of an unbalanced dataset towards a model should be investigated. The effects of the raw dataset upon various metrics are illustrated in the figures below.

<table>
<tr width = "2000">
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/UnbalancedModel/CLValReport.PNG"/>
            <figcaption>Fig.1 Classification report for the validation set on age</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/UnbalancedModel/ConMatValAge.PNG" width="400" />
            <figcaption>Fig.2 Classification Matrix on the validation set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/UnbalancedModel/ModelAcc.PNG" width="400" />
            <figcaption>Fig.3 Model Accuracy per Epoch</figcaption>
        </figure>
    </td>
</tr>
</table>

##### For figure 1 accuracy of 0.56 better than all other models. However, note the f1 scores for classes 1,2, 3 compared to the f1 scores for class 0, 5, 6, 7. Classes 1, 2, 3 were the least frequent labels, while classes 0, 5, 6, 7 were far more common. Since the least common classes have no predictions whatsoever, we can conclude that the model did not receive enough training labels for those classes. Another important metric to consider is the macro avg, which is the average accuracy across all classes if weighted the same. This will be compared to the next model which will have balanced classes.
##### Figure 2 further illustrates the issues presented in the classification report. The matrix clearly shows how most predictions by percentage were clustered around the most common labels.
##### Figure 3 can be considered deceptive if one did not check the classification report. Although it displays high accuracy compared to later dummy models, it shows total accuracy which favors the larger classes by default.  {DO AGAIN}

#### Balanced Dataset

##### This dataset was created via the oversampling of unbalanced class labels until they reached the quantity of the most frequent class. Then, a balanced sampling of 5000 was taken from these labels. The results of the model training are shown below.

<table>
<tr width = "2000">
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/BalancedModel/CLValReport.PNG"/>
            <figcaption>Fig.4 Classification report for the validation balanced set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/BalancedModel/CLMatrixVal.PNG" />
            <figcaption>Fig.5 Classification Matrix on the validation balanced set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/BalancedModel/AccPlot.PNG" />
            <figcaption>Fig.6 Model Accuracy per Epoch</figcaption>
        </figure>
    </td>
</tr>
</table>

##### Analyzing figure 4, the accuracy is far lower than the unbalanced dataset. However, the macro average is far higher compared to the unbalanced model. In addition, the least common labels are being predicted to a far greater extent than the unbalanced model. While the most common labels have a lower accuracy, this is only because the model isn't blindly predicting the most common labels to be correct.
##### As for figure 5, the prediction distribution is far more balanced compared the the unbalanced dataset. This is important as the model is meant to predict all labels well, not just one. 
##### Figure 6 displays the main issue that preprocessing and hyperparameter tuning will attempt to address. Due to limited memory, only 5000 samples may be trained at maximum. Naturally, this has caused great levels of overfitting in the attempt to classify (200,200,3) images. In addition, it also shows the model overfitting within the first half and plateauing afterwards.
##### Overall, using a balanced dataset is necessary for generalization and later runs will attempt to address overfitting.

#### Normalized Dataset

##### Each image was normalized in accordance to its highest and lowest pixel values.

<table>
<tr width = "2000">
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/NormalizedModel/CLValReport.PNG"/>
            <figcaption>Fig.7 Classification report for the validation normalized set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/NormalizedModel/AccPlot.PNG" width="400" />
            <figcaption>Fig.8 Classification Matrix on the validation normalized set</figcaption>
        </figure>
    </td>
</tr>
</table>

##### In figure 7, compared to the control balanced dataset, the normalized dataset is 0.03 less accurate. In addition, the macro accuracy is 0.05 less accurate than the non-normalized dataset on the f1-score. With this, it can be concluded that a non-normalized dataset should be used for the rest of the techniques. What should be noted is that in figure 8, the model has reached maximum accuracy in later epochs, meaning this dataset takes longer to overfit.
##### Despite this, the normalized dataset's overall performance has indicated that is it a poor preprocessing technique for the model.

#### PCA

##### For the PCA dataset, the image arrays were first flattened into the shape of (1,120000). When running the training images through the PCA dimension reduction technique, PCA was initialized to only keep enough components to reach a minimum of 0.90 explained variance.

<table>
<tr width = "2000">
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/PCAModel/CLValReport.PNG"/>
            <figcaption>Fig.9 Classification report for the validation PCA set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/PCAModel/ValMatrix.PNG" />
            <figcaption>Fig.10 Classification Matrix on the validation balanced set</figcaption>
        </figure>
    </td>
</tr>
</table>

##### Based on the accuracy and macro averages on figure 9, it is clear that the PCA dataset performs poorly compared to the balanced dataset. This is despite the fact that 0.90 explained variance was present within the dataset.
##### The reasoning becomes clearer when looking at figure 10. However PCA affected the dataset, it has resulted in the model predicting mostly on label 5, the most common one. Thus, PCA should not be included in Neural Network's preprocessing.

#### RFS

##### Random Forest Selection is an ensemble method used for feature selection. It does so by utilizing a number of Decision trees and calculating how much each feature decreases the impurity. From all the trees, it can determine the importance of a feature. Lastly, RFS returns the features that have a greater importance than the mean importance. Out of 120,000 pixel features, RFS chose 36,407 features. This is nearly 1/4th of the original dataset's complexity.

<table>
<tr width = "2000">
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/RFSModel/CLValReport.PNG"/>
            <figcaption>Fig.11 Classification report for the validation RFS set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/RFSModel/LossPlot.PNG" />
            <figcaption>Fig.12 Loss plot for validation RFS set</figcaption>
        </figure>
    </td>
</tr>
</table>

##### Compared to the balanced model, RFS achieves slightly worse results. The accuracy differs by -0.03 and the macro average differs by -0.02 for the f1 score. While RFS won't be used for hyperparameter tuning for the sake of higher accuracy, it can serve as a reliable and small dataset.
##### According to figure 12, the model learns quickly as evidenced by the training line. While the line does plateau by epoch 10, the model achieves near perfect accuracy by the end. However, the validation line suggests that the model does overfit early like the other models. Thus, it is unlikely that this solves the overfitting issues present in the control balanced dataset.

#### SIFT 

##### SIFT is a form of feature selection which simplifies images into histograms of commonly found image feaures. It does so by finding common image descriptors in the training set and utilizing the common descriptors to transform all images into histograms.

<table>
<tr width = "2000">
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/SIFTModel/CLValReport.PNG"/>
            <figcaption>Fig.13 Classification report for the validation SIFT set</figcaption>
        </figure>
    </td>
    <td width="500">
        <figure>
            <img src="../Reports/figures/NeuralNetwork/SIFTModel/ValMatrix.PNG" />
            <figcaption>Fig.14 Classification Matrix on the validation SIFT set</figcaption>
        </figure>
    </td>
</tr>
</table>

##### The accuracy and macro f1-score was far lower than the balanced dataset. This suggests that while SIFT did transform the data, it did not keep enough relevant information for the training model to successfully generalize. This is further evidenced by the fact that it has a distributed accuracy, but low accuracy overall. Thus, we will not be using SIFT despite its fast training speed.
##### One thing to note from figure 14, is its similarity to the classification matrix on the PCA dataset. From this, it seems that drastically reducing an image's dimensions can cause the model to misclassify most images in favor of the most common classes.

#### Preprocessing Results

<table>
<tr width = "2000">
    <td>
        <figure>
            <img src="../Reports/figures/NeuralNetwork/PreProcessingResults.PNG"/>
            <figcaption>Fig.14 Results matrix for preprocessing on validation</figcaption>
        </figure>
    </td>
</tr>
</table>

#### Balanced versus Unbalanced
##### While the unbalanced dataset had greater accuracy than the balanced dataset, the balanced dataset had superior macro precision, recall, and f1-score. Balanced datasets would be chosen from then on.
#### Normalized versus Raw data
##### The normalized dataset had worse results in addition to requiring additional memory to store float64 values instead of int8. Raw data should be chosen from then on.
#### Balanced versus RFS, PCA, SIFT
##### All metrics resulting from the preprocessing techniques were worse than corresponding metrics in the non preprocessed balanced dataset. Thus, no preprocessing techniques would be utilized.
#### Verdict:
##### The non-normalized, balanced, non-preprocessed dataset has the best performance out of all iterations.

# Evaluation 

# Conclusion

# Attribution

# Bibliography

# Appendix