# Machine Learning Engineer Nanodegree
## Capstone Project
Sergio Noviello
September 27th, 2018

## I. Definition


### Project Overview
Cervical cancer is a cause of death in women worldwide. In females, it is the 14th most common cancer. The mortality rates were drastically reduced with the introduction of the smear test. However, it is still important for doctors to visually review cervical images in order to correctly classify the cervix type. An incorrect classification will result in a treatment that will be ineffective and will also be very expensive. 

This project is based on a competition that was launched on Kaggle a year ago. 


### Problem Statement
The goal of this project is to develop a deep learning model that is capable to classify cervical images and predict the probability of each image to be of type 1, 2 or 3.
This model will be very helpful for healthcare providers in identifying patients with a cervix of type 2 or type 3 that will lead to further testing.

### Metrics

I have used the multi-class logarithmic loss as evaluation metric. The reason behind this choice is that log loss increases as the predicted probability diverges from the actual label. Compared to accuracy, where we count the number of predictions equal to the actual values, log loss takes into account how much the predictions vary from the actual value.  

For each image, a set of predicted probabilities (one for every class) needs to be calculated. The formula is shown in Figure 1:

**Fig.1**
![Fig.1](notes/multi_log_loss.png)
               
N is the number of images in the test set, M is the number of categories, log is the natural logarithm, (y_{ij}) is 1 if observation (i) belongs to class (j) and 0 otherwise, and pij is the predicted probability that observation i belongs to class j.

## II. Analysis

### Data Exploration
#### number of images in the whole dataset

The training set consists of 8218 images, divided in 3 classes as shown below. 
53% of the images are in type_2, 29.5% from type_2 and 17.5% from type_1 (Figure 2)

|Type|Base|Additional|Total|
|----|----|----------|-----|
|type 1|251|1191|**1442**|
|type 2|782|3567|**4349**|
|type 3|451|1976|**2472**|

**Fig.2**
![Fig.2](notes/eda1.png)

### Exploratory Visualization

#### image sizes in the train set
There are 8 types of image size, the majority are 4128x3096 (Figure 3)

**Fig.3**
![Fig.3](notes/eda2.png)

#### image sizes in the test set
The test dataset has a similar composition in terms of image sizes compared to the training set (Figure 4)

**Fig.4**
![Fig.4](notes/eda3.png)


#### Offsets and angles
The relevant part of the image is not always centered and also the angle can vary as shown in Figure 5. 

In this project I have used data augmentation to make minor alterations to the existing dataset. Minor changes such as flips or translations or rotations.

**Fig.5**
![Fig.5](notes/eda4.png)

### Algorithms and Techniques

For this project I have built a neural network using a library called Keras. 

Instead of building the model from scratch, in this project I have decided to use a technique called _transfer learning_ which consists in using the weights of a pre-trained model as input of another model. 

The pre-trained model I have chosen is **ResNet50**.

ResNet50 proved to be very performant in many classification competitions. The problem is trying to solve is that networks that are very deep can results in a degrading performance. This problem is known as vanishing gradient problem. 

During the process of back propagation gradients of loss, with respect to the weights, are calculated, but the gradients tends to get smaller and smaller as we keep on moving backward in the network. This means that the neurons in the earlier layers learn very slowly as compared to the neurons in the later layers. The early layers are important because they represent the building blocks and they can really effect the accuracy of the next layers. 

ResNet is based on the idea of residual learning. Compared to other networks where an output of a layer is passed to the next layer, ResNet also adds up the outputs of layer 1 to the outputs of layer 2.

This is denoted with H(x) = F(x)+x, where F(x) and x represents the stacked non-linear layers and the identity function(input=output) respectively.

ResNet50 is trained on the Imagenet dataset. Imagenet is a database of about 1.5 million images each containing multiple labelled objects. This technique has been proved to be effective because the lower layers of these pre-trained models already contain many generic features such edge recognition and color detectors, features that are common in different datasets. 

A visualization of this model is shown in the Appendix section below. 


### Benchmark

A simple model was created as benchmark. I have used one convolutional layer with 3x3 filters followed by a relu activation layer. Finally I have flattened the layer into a monodimensional vector and added a fully connected layer which is responsible to assign probabilities for each class using the softmax algorithm. 

```python
def build_model():    
    model = Sequential()
    model.add(Conv2D(32, (3, 3), input_shape=(IMAGE_SIZE, IMAGE_SIZE, CHANNELS)))
    model.add(Activation('relu'))
    model.add(Flatten())
    model.add(Dense(3, activation='softmax'))
    model.summary()
    return model
```

**Fig.6**
![Fig.6](notes/benchmark_model.png)

## III. Methodology

### Data Preprocessing


In this step I have split the dataset into training (80%) and validation set (20%). The reason for this, is that it is considered good practice to evaluate a model on data that is different to the data used in the training process. 

The images were resized to 200x200 using the library **opencv** and converted to numpy arrays. Considering the size of these multi-dimensional arrays is not efficient to store them in memory, therefore I have used python generators. 
Generators are functions that behave like iterators, object that can be iterated upon. The benefit is the lazy evaluation, iterators don’t compute the value of each item when instantiated. They only compute it when you ask for it.

Another important aspect when training a neural network is to standardize the data to have zero mean and unit variance. 
During the training of a neural network the initial inputs are multiplied by the weights and added to the biases.
Some parameters are shared in a neural network and, if they are not scaled and in the same range, for some part of the image a weight will appear big and to another it will be too small. 

For this reason I have used the ImageDataGenerator class from Keras, which by default have the parameters featurewise_center and featurewise_std_normalization set to true. 

The same class also allows you to augment the data by flipping, rotating or shifting horizontally or vertically the offset. 

The parameters I have used as as follows: 

|Parameter|Value|
|---------|-----|
|rotation_range|25|
|width_shift_range|0.1|
|height_shift_range|0.1|
|shear_range|0.2|


### Implementation

The implementation of this project can be divided in the following tasks:  

1. Download the train and test set from Kaggle.com
2. Split the training set into train and validation set
3. Convert images into multi-dimensional arrays
4. Augment data by flipping, rotating and changing the offset of the images in the training set
5. Build a deep learning model using a pre-trained model (ResNet50) 
6. Train the model
7. Evaluate the model using the validation set
8. Classify the images in the test set

**1. Download the train and test set from Kaggle.com**
For this task I have downloaded the datasets from the website Kaggle.com (links are in the README file). 
The training dataset consisted in two parts: base and additional. The base dataset contains 1484 images, which is not ideal for an image classification problem. I have decided to add the additional data using a simple python function: 

```python
def move_files(src_dir, dest_dir):
    for file in glob.glob(src_dir):
        if file not in glob.glob(dest_dir):
            shutil.copy(file,dest_dir)
            

move_files('data/additional_Type_1_v2', 'data/train/Type_1')
move_files('data/additional_Type_2_v2', 'data/train/Type_2')
move_files('data/additional_Type_3_v2', 'data/train/Type_3')    
```

**2. Split the training set into train and validation set**

For this task I have decided to split the training and validation set into separate folders because the library Keras relies on a specific directory structure. I have used a python function that I found on github (link in the reference section below)

**3. Convert images into multi-dimensional arrays**

As shown in the exploratory data analysis images had different dimensions. I have decided to resize them to 200x200. 
Resize an image mean changing the width and height of the image in terms of pixels. 
I have used opencv, a python library that allows you to read images from file, resize them and convert them into multi dimensional arrays.

```python
image = cv2.imread(file_path)
image = cv2.resize(image, (IMAGE_SIZE, IMAGE_SIZE))
image = image.astype(np.float32) / 255.0
X[cnt, :, :, :] = image
class_index = labels.index(image_class)
Y[cnt, class_index] = 1
cnt += 1
if cnt == batch_size:
    yield (X, Y)
```

**4. Augment data by flipping, rotating and changing the offset of the images in the training set**

For this step I used the ImageDataGenerator class from Keras. It allows to specify parameters like the rotation angle, the shift of the horizontal and vertical offset, the zoom level and many others. The complete list of parameters is available in the Keras documentation (link in the reference section below). 


```python
def create_data_generators(validation_set = False): 
    data_generator = ImageDataGenerator(rescale=1./255., 
                                        rotation_range=25, 
                                        width_shift_range=0.1, 
                                        height_shift_range=0.1, 
                                        shear_range=0.2,
                                        horizontal_flip=True, 
                                        fill_mode="nearest")

    return data_generator.flow_from_directory(TR_DIR, 
                                              target_size=(IMAGE_SIZE, IMAGE_SIZE), 
                                              shuffle=True, 
                                              seed=SEED,
                                              class_mode='categorical', 
                                              batch_size=BATCH_SIZE)
   
  
```

**5. Build a deep learning model using a pre-trained model (ResNet50)**

In this step I have created a method that builds the ResNet model using the class provided by Keras and I have then passed the output of this model to a second model  

```python
def build_model():
    base_model = ResNet50(include_top=False, 
                          weights='imagenet', 
                          input_tensor=None, 
                          input_shape=(IMAGE_SIZE, IMAGE_SIZE, CHANNELS))
    
    x = Flatten()(base_model.output)
    x = Dense(32, activation='relu')(x)
    x = Dropout(0.5)(x)
    output = Dense(3, activation='softmax')(x)
    model = Model(inputs=base_model.input, outputs=output)
    
    model.summary()
    return model
    ```

**6. Train the model**
This step consisted in training the model using the method __fit_generator__ from Keras. In order to evaluate the performance during the training I have used the class **ModelCheckpoint** that allows you to save evaluate the accuracy and loss at the end of each epoch and save the model if improved. 

```python
def train_model(model, train_generator, validation_generator, step_train, step_valid):
    checkpoint = ModelCheckpoint(SAVED_MODEL,  
                             monitor='val_loss',
                             verbose=1, 
                             save_best_only= True,
                             mode='auto') 
                        
    
    history = model.fit_generator(
            train_generator,
            steps_per_epoch=step_train,
            epochs=NUM_EPOCH,
            callbacks=[checkpoint],
            validation_data=validation_generator,
            validation_steps=step_valid)
            
    model.save(SAVED_MODEL)
    return history```
    
**7. Evaluate the model using the validation set**
In this step I have iterated through the images in the validation set and I have calculated the log loss for each batch. 

```python
Y_pred = model.predict(X)
loss = logloss_mc(Y_true.astype(np.int), Y_pred)
```

**8. Classify the images in the test set**
The last task was to use the trained model to make predictions on the test set and store the results in a pandas dataframe. The dataframe will have the name of the image and the probabilty for each class (type1, type2, type3)

```python
y_pred = model.predict(X_test)
```


### Refinement

During the training of this model I have noticed a gap between train and validation loss which is usually an indication of overfitting (Figure 8). 

**Fig.8**
![Fig.8](notes/overfitting.png)

I have decided to add to my model a fully connected layer followed by a dropout layer. Dropout is a regularization technique for reducing overfitting. 

During training some nodes, chosen with 1-p probability, are left out the network. 
This helps preventing overfitting because it adds a penalty to the loss function so that it does not learn interdependent set of features weights. 
Keras allows to add dropout layers easily

```python
x = Dropout(0.5)(x)
```

#### Optimizers and learning rate

When training a neural network the learning rate can be fixed or it can be reduced as the training progresses. 
For this I have chosen an adaptive learning rate method called **Adam**. The name is derived from _adaptive moment estimation_.

Adam computes individual adaptive learning rates for different parameters from estimates of first and second moments of the gradient. 

Instead of adapting the parameter learning rates based on the average first moment as in RMSProp, Adam  makes use of the average of the second moments of the gradients. 

For the initial learning rate I have initially selected a lower value of **1.0e-4** but I noticed that the model was converging slowly, so I have decided to increase the learning rate to **1.0e-2**. 

For the other parameters I have used the default values:

|Parameter|Value|Description|
|---------|-----|-----------|
|beta_1|0.9|the exponential decay rate for the first moment estimates|
|beta_2|0.9999|the exponential decay rate for the second-moment estimates|
|epsilon|1e-08|a small number to prevent any division by zero|


```python
def compile_model(model):
    opt4 = optimizers.Adam(lr=LEARN_RATE, 
                           beta_1=0.9, 
                           beta_2=0.999, 
                           epsilon=1e-08, 
                           decay=0.0)
    
    model.compile(optimizer=opt4, loss='categorical_crossentropy', metrics=['accuracy'])  

```


#### Batch size

Batches are used because, during the training process of a neural network, is not efficient to pass the entire dataset at once.
Dividing the number of images in the training set by the batch size will give us the number of interations needed to completed one epoch. One epoch is complete when the entire dataset is passed forward and backward through the network.

For my model I have chosen a batch size of 32. I have tried a higher batch size but I noticed that it was slightly overfitting the data. 

For the number of epochs I started with 30 and I used the class ModelCheckpoint from Keras to check at the end of each epoch if the validation loss was improving. I have noticed that after 5 epochs the validation loss did not improve, therefore I have decided to set the number of epochs to five. 


## IV. Results


### Model Evaluation and Validation

In order to validate the robustness of the model I have used the model to make predictions against the test set and submitted the predictions to Kaggle.com. Predictions were also calculated on another test set published during the seconda stage of the competition to make sure the model generalizes well to unseen data.

Figure 10 shows a score of **1.00197** which would have resulted in position 170 on the private leaderboard.  

**Fig.10**
![Fig.10](notes/lboard2.png)


### Justification

The comparison between benchmark and final model in terms of log loss shows a significant improvement. 
The benchmark model was also less robust. I have noticed different results with different split of the training and validation set, showing that the model was not generalizing and overfitting the training data. 

|Model|log_loss|
|-----|--------|
|benchmark|11.51|
|final|1.33|


## V. Conclusion


### Free-Form Visualization


Figure 11 shows an actual type 3 cervix and figure 12 shows an image classified as type 3 by the model

**Figure. 11 (actual type 3 cervix)** 
![Fig.11](notes/1259.jpg)


**Figure. 12 (predicted type 3 cervix)**
![Fig.12](notes/10006.jpg)


### Reflection

The creation of a benchmark model was pretty straight forward. The most challenging part was to optimize the model 
and make the model more robust. In deep learning there isn't a universal recipe that can be applied to image classification problems, the final results really depend on the type of data you have, but there are certainly guidelines to follow while training the model. Observing the accuracy, in my case the batch loss, during training is important to understand if the learning rate is set correctly. I experienced both cases where the rate was too high resulting in a quick drop in the total loss followed by no improvements or a low learning rate that it requires the model to train for long time in order to see improvements. 

It was also very important for this project the distribution of the data. During few experiments I noticed a big gap between train and validation loss, and I was also getting unexpected results on the test data. 

The assumption is that if the model doesn't have enough data to understand the features of each individual class, the final results will be inaccurate. The model also won't produce consistent results on unseen data. 

### Improvement

One improvement that I can think of is to train the model for longer time. It's probably better to use an instance on Amazon with high computational power and gpu technology. 

Another aspect that can be improved is about image manipulation. Many images in the dataset show areas that don't contain relevant information. A good work in my opinion was done by one of the partecipants on Kaggle: https://www.kaggle.com/zahaviguy/identifying-the-cervix-region-of-interest


### References

* [kaggle competition](https://www.kaggle.com/c/intel-mobileodt-cervical-cancer-screening)
* [cervix types classification](https://kaggle2.blob.core.windows.net/competitions/kaggle/6243/media/Cervix%20types%20clasification.pdf)
* [keras documentation](https://keras.io/)
* [log loss](https://github.com/ottogroup/kaggle/blob/master/benchmark.py)
* [split data](https://github.com/keras-team/keras/issues/5862)
* [data augmentation](https://machinelearningmastery.com)

### Appendix

RestNet50 Architecture

![Fig.7](notes/resnet50.png)