First of all, it is important to use the command `%matplotlib notebook` in order to interactive plot the figures. 

In [1]:
%matplotlib notebook

### Loading modules

We need to load the modules to our python environment using the command import

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns 
sns.set(style="whitegrid", color_codes=True)

### Loading the data

Because the dataset was downloaded as a `csv file`, we will use the **Pandas** command `read_csv` that automatically reads the file into a `DataFrame`.

In [19]:
hepatitis_data = pd.read_csv("dataset_55_hepatitis.csv")

We can check the shape of our DataFrame to match the specifications provided for our dataset: 155 patients(rows), 19 features+1 class (columns)

In [20]:
hepatitis_data.shape

(155, 20)

As we can see above, the dataset has 155 rows corresponding to the number of patients included in this study, and 20 columns, corresponding to the features or characteristics collected for each patient.

### Exploratory Analysis

An important part of doing predictions with Machine Learning techniques is to perform Exploratory Data Analysis (EDA). This is useful for getting to know your data, looking at it from different perspectives, describing  and summarizing it without making any assumption in order to detect any potential problems.

First, we can inspect our data to see if we need to clean it. We will start by using the `head` command, that will show us the first 5 rows of our DataFrame.

In [21]:
hepatitis_data.head()

Unnamed: 0,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER_BIG,LIVER_FIRM,SPLEEN_PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK_PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Class
0,30,male,no,no,no,no,no,no,no,no,no,no,no,1.0,85,18,4.0,?,no,LIVE
1,50,female,no,no,yes,no,no,no,no,no,no,no,no,0.9,135,42,3.5,?,no,LIVE
2,78,female,yes,no,yes,no,no,yes,no,no,no,no,no,0.7,96,32,4.0,?,no,LIVE
3,31,female,?,yes,no,no,no,yes,no,no,no,no,no,0.7,46,52,4.0,80,no,LIVE
4,34,female,yes,no,no,no,no,yes,no,no,no,no,no,1.0,?,200,4.0,?,no,LIVE


As we can see above, there are missing values identified with the '?' symbol. Knowing the data types of the variable included in our dataset is a crucial piece of information.  We can check this by using `dtypes` function.

In [22]:
hepatitis_data.dtypes

AGE                 int64
SEX                object
STEROID            object
ANTIVIRALS         object
FATIGUE            object
MALAISE            object
ANOREXIA           object
LIVER_BIG          object
LIVER_FIRM         object
SPLEEN_PALPABLE    object
SPIDERS            object
ASCITES            object
VARICES            object
BILIRUBIN          object
ALK_PHOSPHATE      object
SGOT               object
ALBUMIN            object
PROTIME            object
HISTOLOGY          object
Class              object
dtype: object

As we can see above, 19 of our 20 variable appear to be `object` data type. Some of these variable are categoricals and some of them are numericals.

Because for machine learning algorithms, it is requiered to have numerical data, we will convert categorical data that has values 'no', 'yes' to 0 and 1 respectively. Another important point to consider is to convert the binary survival variable (`Class`) encoded now as 'DIE', 'LIVE' to numerical categories (0 and 1, respectively). We will use for this task, the function `replace`

In [23]:
replacements = {'no': 0,
               'yes': 1,
               'DIE': 0,
               'LIVE': 1,
               '?': np.nan,
               'female': 0,
               'male': 1}

hepatitis_data.replace(replacements, inplace = True)

Lastly, we will convert all of our columns in the dataset to **float** type.

In [24]:
hepatitis_data = hepatitis_data.astype(float)

**Class Imbalance**  
Class imbalance occurs when the total number of observations in one class is significantly lower that the observations in the other class. Machine learning algorithms perform well when the number of observations in each class are similar but when there is a high class imbalance (90%-10% points, problems arise leading to misclassification. 

In order to 

In [9]:
total_of_patients = hepatitis_data.shape[0]
total_of_live_patients = (np.sum(hepatitis_data['Class'] == 1)/total_of_patients)*100
total_of_dead_patients = (np.sum(hepatitis_data['Class'] == 0)/total_of_patients)*100
print("Living patients:", round(total_of_live_patients,2),"%")
print("Dead patients:", round(total_of_dead_patients,2),"%")

Living patients: 79.35 %
Dead patients: 20.65 %


In [10]:
hepatitis_data.describe()

Unnamed: 0,AGE,SEX,STEROID,ANTIVIRALS,FATIGUE,MALAISE,ANOREXIA,LIVER_BIG,LIVER_FIRM,SPLEEN_PALPABLE,SPIDERS,ASCITES,VARICES,BILIRUBIN,ALK_PHOSPHATE,SGOT,ALBUMIN,PROTIME,HISTOLOGY,Class
count,155.0,155.0,154.0,155.0,154.0,154.0,154.0,145.0,144.0,150.0,150.0,150.0,150.0,149.0,126.0,151.0,139.0,88.0,155.0,155.0
mean,41.2,0.103226,0.506494,0.154839,0.649351,0.396104,0.207792,0.827586,0.416667,0.2,0.34,0.133333,0.12,1.427517,105.325397,85.89404,3.817266,61.852273,0.451613,0.793548
std,12.565878,0.30524,0.501589,0.362923,0.47873,0.490682,0.407051,0.379049,0.494727,0.40134,0.475296,0.341073,0.32605,1.212149,51.508109,89.65089,0.651523,22.875244,0.499266,0.40607
min,7.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.3,26.0,14.0,2.1,0.0,0.0,0.0
25%,32.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.7,74.25,31.5,3.4,46.0,0.0,1.0
50%,39.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,85.0,58.0,4.0,61.0,0.0,1.0
75%,50.0,0.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,0.0,0.0,1.5,132.25,100.5,4.2,76.25,1.0,1.0
max,78.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,8.0,295.0,648.0,6.4,100.0,1.0,1.0


### Graphical Exploratory Analysis

In [11]:
hepatitis_analysis = hepatitis_data.dropna()
interesting_values_x = ['AGE', 'BILIRUBIN', 'PROTIME', 'ALBUMIN', 'ASCITES', 'ALK_PHOSPHATE', 'SGOT']