# Human Age Ethnicity Detection
![image](https://www.internationalairportreview.com//wp-content/uploads/facial-recognition-3.jpg)

# Business Understanding
Image classification is one of the algorithms used in deep learning for machine learning. In this project, I will be detecting human age from face images. I have downloaded the dataset from https://susanqq.github.io/UTKFace/. The website has about 24,000 images of human faces that are ages 0-116. 

Detecting ages from face images can be used in several cases:
- Screening minors so that there won’t be any underage people buying alcohol/tobacco/etc.
- Medical purpose: seeing if there is a big difference in your age and how old you look - it may have something to do with health problem
- Checking if user submited recent picture in apps by comparing age and age detected by image
- Targeted advertisement: can check which age group buys certain products more

I have binned the ages in 5 categories: 0-20, 21-26, 27-40,  21-35, 36-50, 51+. I binned the ages in this way to make sure that I have balanced classes.

Accuracy will be used to measure the model. I want the models to predict as many of the correct ages the model can get.


# Ethical Issue on Face Recognition
Face Recognition has been one of the popular software using deep learning. However, face recognition is one of the most controversial machine learning algorithms. I am going to talk a little about the ethical issues to show that this project will not be used for any such problematic reason.

One of the major ethical issues is that many of these images are used without consent. This causes many problems since people do not like their faces used without their knowledge. Even when asked for consent, many people feel uncomfortable using their photos for research. The website I have downloaded this from have stated that they did not get consent. Thus, I will not be using the images other than just this project. 

Another major ethical issue is using ethnicity in face recognition modeling. Many have wrongly used deep learning to discriminate against certain races. In this project, I have not used ethnicity as one of the feature predictions and just focused solely on age.
I have used https://www.nature.com/articles/d41586-020-03187-3 as a resource. There are many other articles stating ethical issues on face recognition deep learning. Please search for more if interested.


# Data Process
To look at the codes for data processing, look at [Data_Process.ipynb](Data_Process.ipynb) notebook.

From the website https://susanqq.github.io/UTKFace/, I downloaded the three zip folders named part1, part2, and part3. After unzipping the folders in a folder called ‘Human_Face_Regonition_Images’, I created another folder called images to combine all the images into one folder. 

Before splitting the images into train, validation, and test, I cropped the faces from the images. I noticed that most of the images contain other parts that is not a face, such as body and background. By using MTCNN library, I cropped faces in all the images and saved the images in a new folder called cropped images.

![2021-10-24 (2)](https://user-images.githubusercontent.com/87672665/138618238-a9286764-776c-4a68-b661-486c63bae6c3.png)

From the cropped images folder, I created another folder called split to randomly split the images into three different folders: train, validation, test. Now, we have processed our data for modeling.

![output5](https://user-images.githubusercontent.com/87672665/137989104-a2f31c28-a0d1-4a04-a967-72bc5232d937.png)

The distribution of the ages are not normally distributed. From the graph above, you can see that the data has more images that are ages 1 and 26.

![output7](https://user-images.githubusercontent.com/87672665/137989054-04ed89a4-19af-41a0-9f9c-4ff5eac4f5fe.png)

Since the distributions are not proportional, I binned the ages into 5 classes to make the number of images be similar between each class.

# Results

To look at the codes for baseline model and CNN models, look at [baseline_model_cnn_models.ipynb](baseline_model_cnn_models.ipynb) notebook.

To look at the codes for pretrained models, look at [pretrained_models.ipynb](pretrained_models.ipynb) notebook.

## Baseline Model - Dense Neural Networks

For our baseline model, we are going to use Dense Neural Networks with 5 layers including input and output layers.

Let's look at the summary of the baseline model.

In [2]:
from keras import models

baseline_model = models.load_model('models/baseline_model.h5')
baseline_model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_5 (Dense)              (None, 128)               25165952  
_________________________________________________________________
dense_6 (Dense)              (None, 64)                8256      
_________________________________________________________________
dense_7 (Dense)              (None, 32)                2080      
_________________________________________________________________
dense_8 (Dense)              (None, 16)                528       
_________________________________________________________________
dense_9 (Dense)              (None, 5)                 85        
Total params: 25,176,901
Trainable params: 25,176,901
Non-trainable params: 0
_________________________________________________________________


The baseline model gave ~33% of accuracy on training data and ~21% on validation data. (You can check out the process in [baseline_model_cnn_models.ipynb](baseline_model_cnn_models.ipynb) notebook.)

## CNN Models

To increase the accuracy of the model, we are going to add convolutional layers.

In the first CNN model, we added 4 convolutional layers and 4 dense layers (including input and output layers).

In [5]:
cnn_model1 = models.load_model('models/cnn_model1.h5')
cnn_model1.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 254, 254, 128)     3584      
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 127, 127, 128)     0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 125, 125, 64)      73792     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 62, 62, 64)        0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 60, 60, 512)       295424    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 30, 30, 512)       0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 28, 28, 128)       5

The first CNN model's accuracy score for training was ~55% and for validation was ~52%, a 31% increase from our baseline model. (You can check out the process in [baseline_model_cnn_models.ipynb](baseline_model_cnn_models.ipynb) notebook.)

Let's try adding more layers, adding drop layers, and changing the number of neurons in some layer.

In [6]:
cnn_model2 = models.load_model('models/cnn_model2.h5')
cnn_model2.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_4 (Conv2D)            (None, 254, 254, 32)      896       
_________________________________________________________________
max_pooling2d_4 (MaxPooling2 (None, 127, 127, 32)      0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 125, 125, 64)      18496     
_________________________________________________________________
max_pooling2d_5 (MaxPooling2 (None, 62, 62, 64)        0         
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 60, 60, 256)       147712    
_________________________________________________________________
max_pooling2d_6 (MaxPooling2 (None, 30, 30, 256)       0         
_________________________________________________________________
dropout (Dropout)            (None, 30, 30, 256)      

The second CNN model's accuracy score for training was ~54% and for validation was ~53%, a 1% increase from the previous model. (You can check out the process in [baseline_model_cnn_models.ipynb](baseline_model_cnn_models.ipynb) notebook.)

Let's see if adding BatchNormalization() and regularization will increase the accuracy score.

In [7]:
cnn_model3 = models.load_model('models/cnn_model3.h5')
cnn_model3.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_14 (Conv2D)           (None, 254, 254, 32)      896       
_________________________________________________________________
max_pooling2d_14 (MaxPooling (None, 127, 127, 32)      0         
_________________________________________________________________
conv2d_15 (Conv2D)           (None, 125, 125, 64)      18496     
_________________________________________________________________
batch_normalization_4 (Batch (None, 125, 125, 64)      256       
_________________________________________________________________
max_pooling2d_15 (MaxPooling (None, 62, 62, 64)        0         
_________________________________________________________________
conv2d_16 (Conv2D)           (None, 60, 60, 256)       147712    
_________________________________________________________________
max_pooling2d_16 (MaxPooling (None, 30, 30, 256)      

The third CNN model's accuracy score for training was ~48% and for validation was ~47%, worst than our baseline model. (You can check out the process in [baseline_model_cnn_models.ipynb](baseline_model_cnn_models.ipynb) notebook.)

## Pretrained Models

Now, we are going to use pretrained models VGG16 to increase our accuracy.

For the first model, we are going to have a pretrained VGG16 model with just one more Dense layer with 512 neurons.

In [7]:
pretrained_model1 = models.load_model('models/pretrained_model1.h5')
pretrained_model1.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Functional)           (None, 8, 8, 512)         14714688  
_________________________________________________________________
flatten_1 (Flatten)          (None, 32768)             0         
_________________________________________________________________
dense_2 (Dense)              (None, 512)               16777728  
_________________________________________________________________
dense_3 (Dense)              (None, 5)                 2565      
Total params: 31,494,981
Trainable params: 16,780,293
Non-trainable params: 14,714,688
_________________________________________________________________


The first model with pretrained model's accuracy score for training was ~62% and for validation was ~56%, increase of 35% from the baseline model. (You can check out the process in [pretrained_models.ipynb](pretrained_models.ipynb) notebook.)

Let's see if adding more dense layers will increase the accuracy.

In [8]:
pretrained_model2 = models.load_model('models/pretrained_model2.h5')
pretrained_model2.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Functional)           (None, 8, 8, 512)         14714688  
_________________________________________________________________
flatten (Flatten)            (None, 32768)             0         
_________________________________________________________________
dense (Dense)                (None, 512)               16777728  
_________________________________________________________________
dense_1 (Dense)              (None, 64)                32832     
_________________________________________________________________
dense_2 (Dense)              (None, 128)               8320      
_________________________________________________________________
dense_3 (Dense)              (None, 256)               33024     
_________________________________________________________________
dense_4 (Dense)              (None, 128)               3

The second model with pretrained model's accuracy score for training was ~62% and for validation was ~59%, increase of 3% from the previous model. (You can check out the process in [pretrained_models.ipynb](pretrained_models.ipynb) notebook.)

Let's see if adding CNN layers will increase the accuracy.

In [10]:
pretrained_model3 = models.load_model('models/pretrained_model3.h5')
pretrained_model3.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Functional)           (None, 8, 8, 512)         14714688  
_________________________________________________________________
conv2d (Conv2D)              (None, 8, 8, 64)          294976    
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 4, 4, 64)          0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 4, 4, 64)          36928     
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 2, 2, 64)          0         
_________________________________________________________________
flatten (Flatten)            (None, 256)               0         
_________________________________________________________________
dense (Dense)                (None, 512)               1

The third model with pretrained model's accuracy score for training was ~61% and for validation was ~58%, similar validation accuracy as the previous model. (You can check out the process in [pretrained_models.ipynb](pretrained_models.ipynb) notebook.)

Let's see if adding BatchNormalization and regularization will increase the accuracy.

In [11]:
pretrained_model4 = models.load_model('models/pretrained_model4.h5')
pretrained_model4.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
vgg16 (Functional)           (None, 8, 8, 512)         14714688  
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 8, 8, 64)          294976    
_________________________________________________________________
max_pooling2d_2 (MaxPooling2 (None, 4, 4, 64)          0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 4, 4, 64)          36928     
_________________________________________________________________
max_pooling2d_3 (MaxPooling2 (None, 2, 2, 64)          0         
_________________________________________________________________
batch_normalization (BatchNo (None, 2, 2, 64)          256       
_________________________________________________________________
activation (Activation)      (None, 2, 2, 64)         

The fourth model with pretrained model's accuracy score for training was ~52% and for validation was ~50%. It seems like adding BatchNormalization and regularization decreased the accuracy. (You can check out the process in [pretrained_models.ipynb](pretrained_models.ipynb) notebook.)

# Final Model