# Machine Learning Engineer Nanodegree
## Capstone Proposal
Sergio Noviello
September 7th, 2018

## Definition

Cervical cancer is a cause of death in women worldwide. In females is the 14th most common cancer. The mortality rates were drastically reduced with the introduction of smear tests, but in some countries is still very crucial to determine the appropriate method of treatment that is based on the position of the cervix. The wrong treatment will not work and will also be very expensive. This project is based on a competition that was launched on Kaggle a year ago. 


## Problem statement

I this project, I developed a deep learning model that is capable to classify cervical images and predict the probability of each image to be of type 1, 2 or 3.
This model will be very helpful for healthcare providers in order to identify patients with a cervix of type 2 or type 3 that require further testing.


The steps required are as follows: 
* Download the train and test set from Kaggle.com
* Split the training set into train and validation set
* Convert images into numpy arrays
* Augment data by flipping, rotating and changing offset of the images in the train set
* Build a deep learning model using a pre-trained model (ResNet50) as input
* Evaluate the model using the validation set
* Classify the images in the test set

## Metrics

multi-class logarithmic loss. For each image, a set of predicted probabilities (one for every category) needs to be calculated. The formula is shown in Figure 1:

![Fig.1](notes/multi_log_loss.png)
               
 where N is the number of images in the test set, M is the number of categories, log is the natural logarithm, (y_{ij}) is 1 if observation (i) belongs to class (j) and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.
 
## Exploratory Data Analysis

#### number of images in the whole dataset

The training set consists of 8218 images, divided in 3 classes as shown in the breakdown below. 
53% of the images are in type_2, 29.5% from type_2 and 17.5 from type_1

* additional type_1: 1191
* **total type_1: 1442**


* type_2: 782
* additional type_2: 3567
* **total type_2: 4349**


* type_3: 451
* additional type_3: 1976
* **total type_3: 2427**

![Fig.2](notes/eda1.png)

#### image sizes in the train set
There are 8 different types of image sizes, the majority of images are 4128x3096 (Figure 3)

![Fig.3](notes/eda2.png)

#### image sizes in the test set
The test dataset has similar composition in terms of image sizes than the training set (Figure 4)

![Fig.4](notes/eda3.png)


#### Offsets and angles
The relevant part of the image is not always centered and also the angle can vary as shown in Figure 5. 

In this project I have used data augmentation to make minor alterations to the existing dataset. Minor changes such as flips or translations or rotations.

![Fig.5](notes/eda4.png)


## CNN Architecture

### Benchmark model

I started with a simple model. I have used 1 convolutional layer with 3x3 filters followed by a relu activation layer. Finally I flatten the layer into a monodimensional vector and add a fully connected layer which is responsible to assign probabilities for each label using the softmax algorithm. 

![Fig.6](notes/benchmark_model.png)

### Evaluate benchmark model

I have then evaluated the simple initial model by making prediction against the validation set. The total loss was: 11.512925148010254 

### Tuning the model 

I have gradually started to add enhancements to the benchmark model and calculated the total loss to see if they would improve the score. 

The first improvement was to use a technique called 'transfer learning'. Instead of building the model from scratch I have used the weights of a pre-trained model called ResNet50 that is included in the Keras library. The model is trained on the Imagenet dataset. Imagenet is a database of about 1.5million images each containing multiple labelled objects. This technique has been proved to be effective because the lower layers of these pre-trained models already contain many generic features such edge recognition and color detectors, features that are common in different datasets. 

The architecture of the model is quite complex and illustrated in the image below. 

![Fig.7](notes/final_model.png)


#### Pre-processing and data augmentation

In this step I have splitted the dataset into 80% training and 20% validation set. The reason for this is that is not good practice to evaluate a model on the dataset used for training. 

It's also important to standardize the data to have zero mean and unit variance. 
During training of a neural network the initial inputs are multiplied by the weights and added to the biases.
Neural networks share many parameters and if they are not scaled and in the same range for some part of the image a weight is big and to another it's too small. 
For this reason I have used the ImageDataGenerator class from Keras, which by default have the parameters featurewise_center and featurewise_std_normalization set to True. 

The same class also allows you to augment the data by flipping, rotating or shifting horizontally or vertically the offset. 


#### Reduce overfitting

During the training of this model I have noticed a gap between train and validation loss which is usually an indication of overfitting. 


![Fig.8](notes/overfitting.png)

I have decided to add to my model a fully connected layer followed by a dropout layer. Dropout is a regularization technique for reducing overfitting. During training some nodes, chosen with 1-p probability are left out the network. 
The reasons it prevents overfitting is that it adds a penalty to the loss function so that it does not learn interdependent set of features weights. 


#### Optimizers and learning rate

When training a neural network the learning rate can be fixed or it can be reduced as the training progresses. 
For this I have chosen an adaptive learning rate method called Adam. The name is derived from adaptive moment estimation.

Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradient. 

Instead of adapting the parameter learning rates based on the average first moment as in RMSProp, Adam also makes use of the average of the second moments of the gradients. 


For the initial learning rate I started with a lower value of 1.0e-4 but I noticed that the model was converging slowly. For the other parameters I chose the default values 

* initial learning rate=1.0e-2
* beta_1=0.9 (the exponential decay rate for the first moment estimates)
* beta_2=0.999 (the exponential decay rate for the second-moment estimates)
* epsilon=1e-08 (a small number to prevent any division by zero)


#### Batch size

Batches are used because in training a neural network is oot efficient to pass the entire dataset at once.
Dividing the number of images in the training set by the batch size will give us the number of interations needed to completed one epoch. One epoch is complete when the entire dataset is passed forward and backward through the network.

For my model I have chosen a batch size of 32. I have tried a higher batch size but I noticed that it was slightly overfitting the data. 

For the number of epochs I started with 30 and I used the class ModelCheckpoint from Keras to check at the end of each epoch if the validation loss was improving. I have noticed that after 5 epochs the validation loss did not improve, therefore I have decided to set the number of epochs to five. 


### Training the model 

I have trained the model on the cloud using an Amazon EC2 instance p2.xlarge, which uses gpu instead of a cpu. 
It took nearly 5 hours to complete 5 epochs. 

Figure 9 shows a visualization of the train and validation loss during the process of training the model.


![Fig.9](notes/training.png)

### Conclusions

The total loss of the final model against the validation set was 1.32852661209. 

I have also used the model to make predictions against the test set and submitted the predictions to Kaggle.com
Figure 10 shows a score of 1.00197 which would have put me at position 170 in the private leaderboard.  


![Fig 10](notes/lboard2.png)


### References

* [kaggle competition](https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening)
* [cervix types classification](https://kaggle2.blob.core.windows.net/competitions/kaggle/6243/media/Cervix%20types%20clasification.pdf)
* [keras documentation](https://keras.io/)
* [log loss](https://github.com/ottogroup/kaggle/blob/master/benchmark.py)
* [split data](https://github.com/keras-team/keras/issues/5862)