# Heart Disease Model
---
13516062 - Yusuf Rahmat Pratama<br>
13516<br>
13516<br>
13516<br>
13516



In [1]:
import pandas as pd
import numpy as np

## Data Preparation & Preprocessing
---
Training data for health disease is read using Pandas' `read_csv()` and `read_excel()` method, and is preprocessed as such to be ready to fit into the learning model.

##### Data information
The description file is read here. The file describes each of the data's attributes and their respective domains. From the description, it can be seen that for the attributes in Column 2, Column 3, Column 6, Column 7, Column 9, Column 11, Column 12, and Column 13 has a discrete value range, and discrete values for other attributes. Column 14 contains the label for the data, therefore the most suitable learning method is Supervised Learning.

In [3]:
# data_info = pd.read_excel('../data/description.xlsx', header=1)
# data_info.fillna('-', inplace=True)
# data_info

##### Load training data
Training data are read and the data are split between data attributes and label. The resulting data read are 13 columns as attributes and 1 column as label. A total of 779 rows are read.

In [4]:
train = pd.read_csv('../data/tubes2_HeartDisease_train.csv')
train_data = train.iloc[:, :13]
train_target = train.iloc[:, 13:]
train_data.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,54,1,4,125,216,0,0,140,0,0.0,?,?,?
1,55,1,4,158,217,0,0,110,1,2.5,2,?,?
2,54,0,3,135,304,1,0,170,0,0.0,1,0,3
3,48,0,3,120,195,0,0,125,0,0.0,?,?,?
4,50,1,4,120,0,0,1,156,1,0.0,1,?,6


##### Preprocessing data
The data are preprocessed here. Some data contain unknown value in some of their attributes, therefore needed to be processed. 

The string '?' that represents the unknown value is replaced with NaN to make data uniformly numeric, and all data are cast into float to process NaN as well (NaN is represented as float in Numpy). 

In [5]:
train_data = train_data.replace('?', np.nan).astype(float)

For now, mean of each attributes is used to input value to the unknown-valued data for the free-discrete attributes, and mode of each attributes is used for the ranged discrete attributes.

In [6]:
mode_attributes = ["Column2", "Column3", "Column6", "Column7", "Column9", "Column11", "Column12","Column13"]
mean_attributes = ["Column1", "Column4", "Column5", "Column8", "Column10"]
train_data[mode_attributes] = train_data[mode_attributes].fillna(train_data.mode().iloc[0])
train_data[mean_attributes] = train_data[mean_attributes].fillna(train_data.mean())
train_data.head()

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,Column12,Column13
0,54.0,1.0,4.0,125.0,216.0,0.0,0.0,140.0,0.0,0.0,2.0,0.0,3.0
1,55.0,1.0,4.0,158.0,217.0,0.0,0.0,110.0,1.0,2.5,2.0,0.0,3.0
2,54.0,0.0,3.0,135.0,304.0,1.0,0.0,170.0,0.0,0.0,1.0,0.0,3.0
3,48.0,0.0,3.0,120.0,195.0,0.0,0.0,125.0,0.0,0.0,2.0,0.0,3.0
4,50.0,1.0,4.0,120.0,0.0,0.0,1.0,156.0,1.0,0.0,1.0,0.0,6.0


## Training Model
---

Here the training data is fitted into a model which will represent the hypothesis model of the learning method used. As the data is labelled discretely, classification models are suitable for the data. For this testing, we will use Native Bayesian, kNN (k-Nearest Neighbor), DTL (Decision Tree Learning), and MLP (Multi-layered Perceptron).

In [7]:
from sklearn.naive_bayes import GaussianNB
from sklearn import tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score
from sklearn.base import clone
from sklearn.model_selection import KFold
import warnings

warnings.filterwarnings('ignore')

##### Training Procedure
For training data and measuring the model prediction performance, we use **100-Fold Cross Validation** testing schema, in which the training data is iterated 100 times, with each iteration splitting the data as testing data and training data, fitting the model with the training data and checking the prediction with the testing data 

In [8]:
def prepare_and_execute_train_data(model, X, y, n_split=100):
    kf = KFold(n_splits = n_split)
    
    curr_model = clone(model)
    
    curr_fold = 1
    accuracy_scores = []
    precision_scores = []
    recall_scores = []
    
    for train_index, test_index in kf.split(X, y):
        X_train = np.array(X.ix[train_index])
        X_test = np.array(X.ix[test_index])
        y_train = np.array(y.ix[train_index])
        y_test = np.array(y.ix[test_index])
        
        curr_model.fit(X_train, y_train)
        
        curr_prediction = curr_model.predict(X_test)
        
        curr_accuracy = accuracy_score(y_test, curr_prediction)
        curr_precision = precision_score(y_test, curr_prediction, average='macro')
        curr_recall = recall_score(y_test, curr_prediction, average='macro')
        
#         print("Fold ", curr_fold)
#         print('Prediction Performance: ')
#         print('Accuracy:     ', curr_accuracy)
#         print('Precision:    ', curr_precision)
#         print('Recall:       ', curr_recall)
#         print()
        
        accuracy_scores.append(curr_accuracy)
        precision_scores.append(curr_precision)
        recall_scores.append(curr_recall)
        
        curr_fold += 1
    
    print('\nMean Prediction Peformance: ')
    print('Mean Accuracy:     ', np.mean(accuracy_scores))
    print('Mean Precision:    ', np.mean(precision_scores))
    print('Mean Recall:       ', np.mean(recall_scores))
    
    model = curr_model

##### Native Bayesian
Here the Gaussian Native Bayesian Classifier is used to fit the learning model.

In [12]:
nb = GaussianNB()
prepare_and_execute_train_data(nb, train_data, train_target, 5)


Mean Prediction Peformance: 
Mean Accuracy:      0.5468982630272953
Mean Precision:     0.34799892240935704
Mean Recall:        0.3457177102350929


##### Decision Tree Learning
The Decision Tree Classifier model is used to fit the learning model.

In [14]:
dtc = tree.DecisionTreeClassifier()
prepare_and_execute_train_data(dtc, train_data, train_target, 5)


Mean Prediction Peformance: 
Mean Accuracy:      0.46210090984284535
Mean Precision:     0.30875412303973127
Mean Recall:        0.29596686239769987


##### k-Nearest Neighbor
The KNN Classifier is used to fit the learning model

In [15]:
knn = KNeighborsClassifier()
prepare_and_execute_train_data(knn, train_data, train_target, 5)


Mean Prediction Peformance: 
Mean Accuracy:      0.48778329197684034
Mean Precision:     0.31210306473609395
Mean Recall:        0.2759087830257295


##### Multi-layered Perceptron
Here the MLP Classifier is used to fit the learning model.

In [21]:
mlp = MLPClassifier(max_iter=1000)
prepare_and_execute_train_data(mlp, train_data, train_target, 5)


Mean Prediction Peformance: 
Mean Accuracy:      0.49678246484698096
Mean Precision:     0.32610126589910227
Mean Recall:        0.3130386501599783


## Model Finalization and Export
---
The model with the best prediction performance is chosen and exported as a Sklearn model for use in predicting (classifying) test data.

In [62]:
from sklearn.externals import joblib

##### Choose the best-scored model
The model with the best prediction performance is finalized and ready to be exported here.

In [65]:
chosen_model = []
chosen_model

[]

##### Export model to external file
Here the finalized model is dumped into an external file using sklearn's joblib method. The exported model will be saved and can be used to predict the test data.

In [69]:
joblib.dump(chosen_model, '../models/heart_disease.joblib', compress=1)

['../models/heart_disease.joblib']