# Disease Prediction using Machine Learning

- Disease:       Malaria
- Data:          Haematological data
- Classifier:    Random Forest
- Python Libraries
    - numpy
    - scipy
    - pandas
    - matplotlib
    - seaborn
    - scikit-learn

This tutorial is based on this paper: 
Morang’a, C.M., Amenga–Etego, L., Bah, S.Y. et al. Machine learning approaches classify clinical malaria outcomes based on haematological parameters. BMC Med 18, 375 (2020). https://doi.org/10.1186/s12916-020-01823-3


<a href="https://youtu.be/2zZYtboqZdY">Explanation of the code can be found here</a>

<a href="https://static-content.springer.com/esm/art%3A10.1186%2Fs12916-020-01823-3/MediaObjects/12916_2020_1823_MOESM2_ESM.xlsx"> Download Dataset

### Import libraries for the analysis 

In [2]:
import pandas as pd        #read,explore & clean data
import numpy as np         # data manipulation
import matplotlib.pyplot as plt   # data visualization
import seaborn as sns             # data visualization

### Read the data

In [3]:
#read the data with pandas
dataframe=pd.read_excel("https://static-content.springer.com/esm/art%3A10.1186%2Fs12916-020-01823-3/MediaObjects/12916_2020_1823_MOESM2_ESM.xlsx",sheet_name="Table S2 Hematological Raw Data")

### Explore, clean and preprocess the data 

In [None]:
#find the number of rows and columns in the dataframe
dataframe.shape

In [None]:
#get the first n rows in the dataframe
dataframe.head(n=5)

In [None]:
# list the column names
dataframe.columns

In [None]:
#obtain some information about the data 
#i.e. columns,datatypes,missing values,etc
dataframe.info()

In [None]:
#we are interested in the columns : 'Clinical_diagnosis' up to 'RBC_dist_width_Percent'
#meaning we will subset the data from column 16 - the last column
subset=dataframe.iloc[:,16:]

In [None]:
subset.shape

In [None]:
subset.info()

In [None]:
#Check the mising data. We are interested in how many missing data are present in each column
subset.isnull().sum()

In [None]:
# handling missing values
# drop / remove all rows with missing values
subset.dropna(inplace=True)

In [None]:
subset.shape

In [None]:
subset.columns

In [None]:
#Let us get the different malaria outcomes. 
#The outcomes will be our labels/classes in the data

In [None]:
subset['Clinical_Diagnosis'].unique()

In [None]:
labels=pd.Categorical(subset['Clinical_Diagnosis'])
labels

In [None]:
subset.head()

In [None]:
#class distribution
subset['Clinical_Diagnosis'].value_counts()

In [None]:
# plot a bar chat to display the class distribution
subset['Clinical_Diagnosis'].value_counts().plot.bar()

In [None]:
#descriptive statistics on the data
subset.iloc[:,1:].describe().transpose()

In [None]:
#check the correlation for the features
subset.corr()

In [None]:
#lets visualize the correlation matrix using seaborn
sns.heatmap(subset.corr(),cmap='coolwarm')

## Data Preprocessing

In [None]:
# separate the labels/classes from the features/measurement
X=subset.iloc[:,1:]
y=subset.iloc[:,0]

In [None]:
X.shape

In [None]:
y.shape

In [None]:
#encode the labels. 
#This is required by scikit learn when performing supervised learning

In [None]:
from sklearn.preprocessing import LabelEncoder
label_encoder=LabelEncoder()
label_encoder.fit(y)
y_encoded=label_encoder.transform(y)

In [None]:
y_encoded[0:5]

In [None]:
y[0:5]

In [None]:
classes=label_encoder.classes_
classes

## Split data into train and test sets

In [None]:
# train test ratio 80:20

In [None]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y_encoded,test_size=0.2)

In [None]:
X_train.shape

In [None]:
y_train.shape

In [None]:
X_test.shape

In [None]:
y_test.shape

### Standardization of the data

In [None]:
# scale data between 0 and 1

In [None]:
from sklearn.preprocessing import MinMaxScaler
min_max_scaler=MinMaxScaler()
X_train_scaled=min_max_scaler.fit_transform(X_train)
X_test_scaled=min_max_scaler.fit_transform(X_test)

In [None]:
X_train_scaled[0,0]

In [None]:
X_train.iloc[0,0]

### Training Phase

In [None]:
#create random forest classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
clf=RandomForestClassifier()
clf.fit(X_train_scaled,y_train)

### Testing phase

In [None]:
# model prediction on the test set

In [None]:
y_pred=clf.predict(X_test_scaled)

In [None]:
y_pred[0:3]

In [None]:
y_test[0:3]

In [None]:
classes

### Evaluating the model



The following metrics will be used
- accuracy
- f1 score
- recall
- precision
- confusion matrix

In [5]:
# import the metrics
from sklearn.metrics import balanced_accuracy_score,f1_score,precision_score, recall_score
from sklearn.metrics import plot_confusion_matrix

In [None]:
#balanced accuracy
balanced_accuracy=balanced_accuracy_score(y_test,y_pred)
balanced_accuracy=round(balanced_accuracy,2)
print('balanced accuracy:',balanced_accuracy)

In [None]:
#f1score
f1score=f1_score(y_test,y_pred,average='weighted')
f1score=round(f1score,2)
print('f1score:',f1score)

In [None]:
#precision 
precision=precision_score(y_test,y_pred,average='weighted')
precision=round(precision,2)
print('precision:',precision)

In [None]:
#recall
recall=recall_score(y_test,y_pred,average='weighted')
recall=round(recall,2)
print('recall:',recall)

In [None]:
#confusion matrix
disp=plot_confusion_matrix(clf,X_test_scaled,y_test,xticks_rotation='vertical',
                     cmap='Blues',display_labels=classes)