**Load Basic Modules**

In [62]:
%matplotlib notebook

import pandas as pd
import numpy as np

** Load data into pandas dataframe**

In [63]:
hepatitis_data = pd.read_csv("dataset_55_hepatitis.csv")

**Prediction algorithm**  

One of the main decisions to make when performing machine learning is choosing the appropriate algorithm that fits the current problem we are dealing with.
**Supervised learning** refers to the task of inferring a function from a labeled training dataset. We fit the model to the labeled training set with the main goal of finding the optimal parameters that will predict unknown labels of new examples included in the test dataset. There are two main types of supervised learning: regression, in which we want to predict a label that is a real number, and classification, in which we want to predict a categorical label.
In our case, we have a labeled dataset and we want to use a classification algorithm to find the label in the categorical values: 0 and 1.

We can find many classification supervised learning algorithms, some simple but efficient, such as linear classifier or logistic regression, and another ones more complex but powerful such as decision trees and k-means.  
In this case, we will choose **Random Forest** algorithm. Random forest is one of the most used machine learning algorithm due to the fact that it is very simple, flexible and easy to use but produces reliable results.
So we will load the packages from scikit-learn that we need to perform Random Forest and also to evaluate afterwards the model:

In [64]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score

from sklearn.metrics import roc_curve, auc, accuracy_score

from sklearn.preprocessing import Imputer

In [65]:
replacements = {'no': 0,
               'yes': 1,
               'DIE': 0,
               'LIVE': 1,
               '?': np.nan,
               'female': 0,
               'male': 1}

hepatitis_data.replace(replacements, inplace = True)
hepatitis_data = hepatitis_data.astype(float)

In the EDA, we dropped all `NaN` values. Here, we need to evaluate what is the best method to handle them.  
There are several ways to deal with missing data but none of them is perfect. The first step is to understand why data went missing. In our case, we can guess that the values missing in the categorical variables could be due to the absence of the feature that instead of being imputed as 'no' was left blank or that it was not tested. Also, missing values in continuous variables could be explained by the lack of biochemical studies performed in that particular patient or because the parameters were within normal range and it was not written down.  

In both cases, we could be in the presence of **Missing at Random** value (The fact that the value is missing has nothing to do with the hypothetical value) or **Missing not at Random** value (The missing value depends on the hypothetical value). If it was the first one, we could drop the `NaN` value safely, while in the last case it would not be safe to drop it because this missing value tell us something about the hypothetical value. So we will then impute the values of the missing value once we are about to train our model.

In [66]:
hepatitis_data.isnull().sum()

AGE                 0
SEX                 0
STEROID             1
ANTIVIRALS          0
FATIGUE             1
MALAISE             1
ANOREXIA            1
LIVER_BIG          10
LIVER_FIRM         11
SPLEEN_PALPABLE     5
SPIDERS             5
ASCITES             5
VARICES             5
BILIRUBIN           6
ALK_PHOSPHATE      29
SGOT                4
ALBUMIN            16
PROTIME            67
HISTOLOGY           0
Class               0
dtype: int64

**Training and Test datasets**

In order to train and test our model, we need to split our dataset into to subdatasets, the training and the test dataset. The model will learn from the training dataset to generalize to other data; the test dataset will be used to "test" what the model learnt in the training and fitting step. 
It is common to use the rule of 80%-20% to split the original dataset. It is important to use a reliable method to split the dataset to avoid data leakage; this is the presence in the test set of examples that were also in the training set. 
First, we will assign all the columns except our dependant variable ("Class") to the variable X and the column "Class" to the variable Y.
And then we will `train_test_split` from the scikit-learn library to split them into X_train, X_test, Y_train and Y_test. It is important to add `random_state` because this will allow us to have the same results every time we run the code. 

In [67]:
x = hepatitis_data.iloc[:, hepatitis_data.columns != 'Class']
y = hepatitis_data.iloc[:, hepatitis_data.columns == 'Class']

In [68]:
X_train, X_test, Y_train, Y_test = train_test_split(x, y, test_size = 0.2, 
                                                    random_state = 42)

** Training Random Forest **

It is very easy now to impute missing values (using Imputer), create and train the basic random forest model using the package Scikit-learn.
We will start by apply .ravel() to the Y_train and Y_test to flatten our array as not doing so will rise warnings from our model. 

In [69]:
Y_train = Y_train.values.ravel()
Y_test = Y_test.values.ravel()

Then, we will impute our missing values using the function `Imputer` and the strategy `most_frequent` that will replace the missing values for the most frequent value in the column (axis = 0). It is worthy to notice that doing so can introduce errors and bias, but of course as we state before there is no perfect way to handle missing data.

In [70]:
imp = Imputer(missing_values = 'NaN', strategy = "most_frequent", axis = 0)
imp = imp.fit(X_train)

X_train_imp = imp.transform(X_train)

fit_random_forest = RandomForestClassifier(random_state = 42)

fit_random_forest.fit(X_train_imp, Y_train);


Our basic model has now been trained and has learnt the relationship between our independent variables and the target variable. Now, we can check how good our model is by making predictions on the test set. We can then compare the prediction with our known labels.

We will again impute the missing values in our test set and use the function `predict` and the metrics `accuracy_score` to evaluate the performance of our model.

In [73]:
X_test_imp = imp.transform(X_test)

y_predicted = fit_random_forest.predict(X_test_imp)

In [74]:
accuracy_score(Y_test, y_predicted)

0.74193548387096775

As we can observe above, our basic model has an accuracy of 74.19% which tell us that it has to be further improved.