# Malaria parasite detection using ensemble learning in Keras

## Task 1: Loading the cell image data

Ensemble learning combines the predictions of multiple models to improve the prediction accuracy.

There are several ways to perform ensemble learning, and a reasonable summary is available on https://en.wikipedia.org/wiki/Ensemble_learning. Simply speaking, there are two major classes of ensemble learning:
- Bagging: fit independent models and then average their predictions
- Boosting: fit several models sequencially and then average their predictions

In this project we will be using a simplified form of the bagging approach:

- fit a collection of independent models
- gather prediction from each
- apply a voting procedure

Given the predictions from several models, there are two voting procedures
- Hard voting: get the most common class predicted
- Soft voting: get the argmax of the sum of predicted probabilities. It can be either weighted or unweighted.
  - Weighted: each predicted probability is multiplied by a preset weight
  - Unweighted: we don't multiply the predicted probabilities, we just add them straight away

In the interest of time, we will focus on hard voting.

First we import the required libraries: tensorflow, keras, sklearn, cv2, matplotlib, statistics and a few other utilities.

Dataset: https://www.tensorflow.org/datasets/catalog/malaria

In [None]:
!pip3 install keras tensorflow sklearn matplotlib opencv-python pandas

import statistics
import os
import glob
import numpy as np
import pandas as pd
from concurrent import futures
import threading

import tensorflow as tf
from keras.optimizers import SGD, Adam
from keras.layers import Conv2D, Activation, Dense, MaxPooling2D, Flatten, Dropout
from keras.models import Sequential


from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, confusion_matrix, accuracy_score, precision_score, recall_score, classification_report

import cv2
import matplotlib.pyplot as plt

Next, we load the data. 
- File names are obtained using the glob module.
- Create a data frame object for infected and healthy cell images
- Randomize the order of data
- Pick the first 2000 images
- Check how many of each class are there in the sample; we should be close to 50/50

In [None]:
infected = 'cell_images/Parasitized'
healthy = 'cell_images/Uninfected'

infected_files = glob.glob(infected+'/*.png')
healthy_files = glob.glob(healthy+'/*.png')

files_df = pd.DataFrame({
    'img': infected_files + healthy_files,
    'malaria': [1] * len(infected_files) + [0] * len(healthy_files)
})

files_df = files_df.sample(frac=1, random_state=42).reset_index(drop=True)

# Just to reduce complexity
files_df = files_df.iloc[0:2000, :]
files_df['malaria'].value_counts()

## Task 2: Transform the image files into arrays and create the datasets

The image files, as they are, are binary. We need to turn them into numbers so we can pass them into the machine learning pipeline.

To do so, we will use the cv2 library to read and resize the images. These operations will be performed by the `get_data()` function.

Next, we place the input arrays into `X` and target values into `y`. We have to normalize the image data by dividing all `X` values by 255, so numbers would range from 0 to 1.

Now that our `X` and `y` are ready, we split the dataset into 80:20 train:test split.

Finally, let's see how the image will look like: use the `imshow()` function in matplotlib, which plots images from 3-d arrays.

In [None]:
img_length, img_width = 50, 50


def get_data(data_files):
    data = []
    for img in data_files:
        print(img)
        img = cv2.imread(img)
        img = cv2.resize(img, dsize=(img_length, img_width),
                         interpolation=cv2.INTER_CUBIC)
        img = np.array(img)
        data += [img]
    return np.array(data)

X = files_df['img'].values
y = files_df['malaria'].values

X_converted = get_data(X)/255.0

train_data, val_data, train_labels, val_labels = train_test_split(
    X_converted, y, test_size=0.2, random_state=42)

# Check images

plt.figure(figsize=(8, 8))
plt.imshow(train_data[0])
plt.title('{}'.format(train_labels[0]))
plt.xticks([])
plt.yticks([])
plt.savefig('sample')

## Task 3: Create a deep CNN

Time to start doing deep learning to predict the presence or absence of malaria in cell images.

We will be experimenting with a deep convolutional neural network which has the following architecture:

- two 32 convolutional layers, each followed by max pooling
- 64 convolutional layer, followed by max pooling
- layer flattening
- a dense hidden layer with 64 nodes
- dropping 50% of the prev hidden layer
- output layer with 1 node

The Adam optimizer will be used with a learning rate of 0.001.

In [None]:

model = Sequential()
model.add(Conv2D(16, (3, 3), activation='relu',
                 input_shape=(img_length, img_width, 3)))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(16, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), activation='relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(1, activation='sigmoid'))
model.summary()

adam = Adam(lr=0.001)

model.compile(optimizer=adam,
              loss='binary_crossentropy',
              metrics=['accuracy'])

## Task 4: Train and test the CNN

Next, we call the fit method to optimize the model in 25 epochs, then perform prediction using the predict_classes method.

We measure our prediction accuracy using the classification_report function, which gives us the key classification metrics. I will also display those metrics individually so you can know their formulas.

- Precision: ability of the classifier not to label as positive a sample that is negative.
- Recall: the ability of the classifier to find all the positive samples
- f1: weighted average of the precision and recall

In additioin:
- Accuracy: measures how close the predicions are to the actual values

Using the history object, we plot the validation accuracy and loss across the epochs to see how our models coverged.

In [None]:
history = model.fit(x=train_data, y=train_labels, batch_size=64, epochs=20,
                    verbose=1, shuffle=True, validation_data=(val_data, val_labels))

y_predicted = model.predict_classes(val_data)

# accuracy = (true positives + true negatives) / (positives + negatives)
print('Accuracy: ', accuracy_score(val_labels, y_predicted))
# precision = true positives / (true positives + false positives)
print('Precision: ', precision_score(val_labels, y_predicted))
# recall = true positives / (true positives + false negatives)
print('Recall: ', recall_score(val_labels, y_predicted))
# f1 = 2 * (precision * recall) / (precision + recall)
print('f1: ', f1_score(val_labels, y_predicted))

print(classification_report(val_labels, y_predicted))

plt.subplot(211)
plt.title('Loss')
plt.plot(history.history['loss'], label='train')
plt.plot(history.history['val_loss'], label='test')
plt.legend()
# plot accuracy during training
plt.subplot(212)
plt.title('Accuracy')
plt.plot(history.history['accuracy'], label='train')
plt.plot(history.history['val_accuracy'], label='test')
plt.legend()
plt.savefig('accuracy_loss')


## Task 5: Create the CNN models ensemble

Now that we know we can achieve good accuracy with one CN model, let's try an ensemble of CNN models. Let's generate an ensemble of 2 more models using a formula, as in this code.

Here we create the models and place them in a dictionary of models, `models`.

In [None]:
models = {}

for j in range(2, 4):
    newmodel = Sequential()
    newmodel.add(Conv2D(j*16, (3, 3), activation='relu',
                        input_shape=(img_length, img_width, 3)))
    newmodel.add(MaxPooling2D(pool_size=(2, 2)))
    newmodel.add(Conv2D(j*16, (3, 3), activation='relu'))
    newmodel.add(MaxPooling2D(pool_size=(2, 2)))
    newmodel.add(Conv2D(j*32, (3, 3), activation='relu'))
    newmodel.add(MaxPooling2D(pool_size=(2, 2)))
    newmodel.add(Flatten())
    newmodel.add(Dense(j*32, activation='relu'))
    newmodel.add(Dropout(0.5))
    newmodel.add(Dense(1, activation='sigmoid'))

    newmodel.compile(optimizer=adam,
                     loss='binary_crossentropy',
                     metrics=['accuracy'])
    newmodel.summary()
    models[j] = newmodel

## Task 6: Fit the models in the ensemble and perform the prediction

Next, we fit each of the models separately using the same datasets.

Once done, we generate the predictions and add them into an array, `predictions_hard`.

In [None]:
for j in models:
    models[j].fit(x=train_data, y=train_labels, batch_size=64, epochs=20,
                  verbose=1, shuffle=True, validation_data=(val_data, val_labels))


models[1] = model

predictions_hard = []
for j in models:
    predictions_hard += [models[j].predict_classes(val_data)]

## Task 7: Apply hard voting to the ensemble

Here we will apply the hard voting procedure. This is by deciding the class according the majority vote. We will use the `mean()` statistical function to get the class that was predicted more frequently than the other.

Remember that the mean function gives you the data value that was repeated the most in the dataset. So in our predicted classes, where there are only 1's and 0's, it will pick the value that was repeated most frequently.

In [None]:
voting_hard = []
for i in range(0, len(val_data)):
    voting_hard += [statistics.mode(
        [predictions_hard[0][i][0], predictions_hard[1][i][0], predictions_hard[2][i][0]])]

print(classification_report(val_labels, voting_hard))
