**Load Basic Modules**

In [34]:
%matplotlib notebook

import pandas as pd
import numpy as np

** Load data into pandas dataframe**

In [35]:
hepatitis_data = pd.read_csv("dataset_55_hepatitis.csv")

**Prediction algorithm**  

One of the main decisions to make when performing machine learning is choosing the appropriate algorithm that fits the current problem we are dealing with.
**Supervised learning** refers to the task of inferring a function from a labeled training dataset. We fit the model to the labeled training set with the main goal of finding the optimal parameters that will predict unknown labels of new examples included in the test dataset. There are two main types of supervised learning: regression, in which we want to predict a label that is a real number, and classification, in which we want to predict a categorical label.
In our case, we have a labeled dataset and we want to use a classification algorithm to find the label in the categorical values: 0 and 1.

We can find many classification supervised learning algorithms, some simple but efficient, such as linear classifier or logistic regression, and another ones more complex but powerful such as decision trees and k-means.  
In this case, we will choose **Random Forest** algorithm. Random forest is one of the most used machine learning algorithm due to the fact that it is very simple, flexible and easy to use but produces reliable results.
So we will load the packages from scikit-learn that we need to perform Random Forest and also to evaluate afterwards the model:

In [36]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, KFold, cross_val_score

from sklearn.metrics import roc_curve, auc

In [37]:
replacements = {'no': 0,
               'yes': 1,
               'DIE': 0,
               'LIVE': 1,
               '?': np.nan,
               'female': 0,
               'male': 1}

hepatitis_data.replace(replacements, inplace = True)
hepatitis_data = hepatitis_data.astype(float)

In the EDA, we dropped all `NaN` values. Here, we need to evaluate what is the best method to handle them.  
There are several ways to deal with missing data but none of them is perfect. The first step is to understand why data went missing. In our case, we can guess that the values missing in the categorical variables could be due to the absence of the feature that instead of being imputed as 'no' was left blank or that it was not tested. In contrast, missing values in continuous variables could be explained by the lack of biochemical studies performed in that particular patient.  

In the case of the categorical variables, we could be in the presence of **Missing at Random** value (The fact that the value is missing has nothing to do with the hypothetical value) or **Missing not at Random** value (The missing value depends on the hypothetical value). If it was the first one, we could drop the `NaN` value safely, while in the last case it would not be safe to drop it because this missing value tell us something about the hypothetical value. Assuming the risks, we will treat the missing values as missing at random.  

On the other hand, the missing data in the continuos variables could be explained as **Missing at Random** and could safely be dropped. Because as we can see below, the `NaN` values are high in some of the variables (e.g PROTIME), we could use the Pairwise deletion to be sure be don't end up with too few cases.

In [38]:
hepatitis_data.isnull().sum()

AGE                 0
SEX                 0
STEROID             1
ANTIVIRALS          0
FATIGUE             1
MALAISE             1
ANOREXIA            1
LIVER_BIG          10
LIVER_FIRM         11
SPLEEN_PALPABLE     5
SPIDERS             5
ASCITES             5
VARICES             5
BILIRUBIN           6
ALK_PHOSPHATE      29
SGOT                4
ALBUMIN            16
PROTIME            67
HISTOLOGY           0
Class               0
dtype: int64

**Training and Test datasets**

In order to train and test our model, we need to split our dataset into to subdatasets, the training and the test dataset. 