<a href="https://colab.research.google.com/github/vappiah/Bioinformatics/blob/master/notebooks/projects/Neural_Networks_for_Malaria_Classification_Episode_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Predict Malaria outcomes using Neural Network 

The data used for this tutorial was obtained from this study: 
Morang’a, C.M., Amenga–Etego, L., Bah, S.Y. et al. Machine learning approaches classify clinical malaria outcomes based on haematological parameters. BMC Med 18, 375 (2020). https://doi.org/10.1186/s12916-020-01823-3

## Required Libraries
 - numpy
 - matplotlib
 - pandas
 - tensorflow
 - keras
 - scikit-learn

## Import Python libraries

In [None]:
#data handling
import pandas as pd
import numpy as np

#data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder,label_binarize
from sklearn.preprocessing import MinMaxScaler

#classification
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense

## Read data

In [None]:
#set path to the data file.
data_file='https://raw.githubusercontent.com/vappiah/Machine-Learning-Tutorials/main/datasets/malaria_clin_data.csv'

#read the data with pandas
dataframe=pd.read_csv(data_file)


## Data Exploration & Cleaning




In [None]:
#find the number of rows and columns in the dataframe
dataframe.shape

In [None]:
#get the first n rows in the dataframe
dataframe.head(n=5)

In [None]:
# list the column names
dataframe.columns

In [None]:
#we are interested in the columns : 'Clinical_diagnosis' up to 'RBC_dist_width_Percent'
#meaning we will subset the data from column 16 - the last column
subset=dataframe.iloc[:,16:]

In [None]:
# handling missing values
# drop / remove all rows with missing values
subset.dropna(inplace=True)

In [None]:
#Let us get the different malaria outcomes. 
#The outcomes will be our labels/classes in the data

subset['Clinical_Diagnosis'].unique()

 
## **Data preprocesing** 
This is done to put the data in an appropriate format before modelling


In [None]:
# separate the labels/classes from the features/measurement
X=subset.iloc[:,1:]
y=subset.iloc[:,0]

\
**Encode labels**

The labels for this data are categorical and we therefore have to convert them to numeric forms. This is referred to as encoding. Machine learning models usually require input data to be in numeric forms, hence we encoding the labels.

In [None]:
#let's encode target labels (y). We will use onehot encoding.

label_encoder=LabelEncoder()
label_encoder.fit(y)
y=label_encoder.transform(y)
labels=label_encoder.classes_
classes=np.unique(y)
y=label_binarize(y,classes=np.unique(y))
nclasses=y.shape[1]

\
**Data Normalization**\
Data normalization is done so that the values are in the same range. This will improve model performance and avoid bias

In [None]:
### scale the data between 0-1

In [None]:
min_max_scaler=MinMaxScaler()
X=min_max_scaler.fit_transform(X)

\
**Data Splitting**\
Data is split into three: training, validation and test sets\
-training set is used for training.\
-validation set is used for evaluating the model during training.\
-test set is used to test the model after training and tuning has been completed.

In [None]:
#split data into training,validation and test sets

#split the data into training and test sets
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2)

#split the training set into two (training and validation)
X_train, X_val, y_train, y_val = train_test_split(X_train,y_train,test_size=0.2)

## Build the Neural Network Model

In [None]:
#define model
model = Sequential()

#hidden layer 1
model.add(Dense(40, input_dim=X_train.shape[1], activation='relu'))

#hidden layer 2
model.add(Dense(20, activation='relu'))

#output layer
model.add(Dense(nclasses, activation='softmax'))

#define optimizer and learning rate. We will use Adam optimizer
opt_adam = keras.optimizers.Adam(learning_rate=0.001)

model.compile(loss=tf.keras.losses.CategoricalCrossentropy(), optimizer=opt_adam, metrics=[keras.metrics.CategoricalAccuracy()])


In [None]:
#fit the model to the training data
history = model.fit(X_train, y_train, validation_data=(X_val, y_val), batch_size=32,epochs=200, verbose=1)


In [None]:
# summarize history for accuracy
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title('model performance')
plt.ylabel('accuracy')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='lower right')
plt.show()

In [None]:
# summarize history for loss
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('model loss')
plt.ylabel('loss')
plt.xlabel('epoch')
plt.legend(['train', 'val'], loc='lower right')
plt.show()

## Predict on data 
Let's use our trained model to classify some samples which were not included in the training or validation sets. This data is the test set.