# Introduction
This notebook performs predictive modeling on the Heart Diseases UCI (https://www.kaggle.com/ronitf/heart-disease-uci) to identify relationship between heart disease and various other features. 

## Step1:  Select and load python libraries
Python libraries loaded to preform the prelimnary EDA include Pandas, Numpy, sklearn, Matplotlb, and Seaborn.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report

## Step 2:  Load Conditioned Dataset from Exploratory Data Analysis
The dataset that was conditioned as part of the Exploratory Data Analysis portion of this project, and saved as a pickel file for use in this phase of the project.  

In [2]:
locData=r'C:\Users\Lisa\Documents\Training\CoderGirl\Project\Data'
df=pd.read_pickle(locData+'\\heart.pkl')
print('Dataframe Shape:  {}'.format(df.shape))
df.head()

Dataframe Shape:  (302, 14)


Unnamed: 0,age,gender,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate_achieved,exercise_induced_angina,st_depression,st_slope,num_major_vessels,thalassemia,target
0,63,1,3,145,233.0,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250.0,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204.0,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236.0,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354.0,0,1,163,1,0.6,2,0,2,1


# Step 3:  Reorder data set
The dataset appears to be orgainized in such a way that the patients that have heart disease are listed first, followed by the patients that do not have heart disease.  The dataframe was shuffled such that the heart disease target column was randomly ordered, as shown below.

In [3]:
dfReOrdered=df.sample(frac=1).reset_index(drop=True)
dfReOrdered.head()

Unnamed: 0,age,gender,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate_achieved,exercise_induced_angina,st_depression,st_slope,num_major_vessels,thalassemia,target
0,42,1,2,130,180.0,0,1,150,0,0.0,2,0,2,1
1,58,0,1,136,319.0,1,0,152,0,0.0,2,2,2,0
2,68,1,2,180,274.0,1,0,150,1,1.6,1,0,3,0
3,62,0,0,138,294.0,1,1,106,0,1.9,1,3,2,0
4,47,1,2,108,243.0,0,1,152,0,0.0,2,0,2,0


# Step 4:  Seperate the predictor and response variables.
The response variable (i.e. what we are trying to predict) is the target column of the dataframe, and the predictor variables (i.e. variables used to predict) are the other columns of the dataframe. 

In [4]:
y=dfReOrdered.target
print(y[:5])
x=dfReOrdered.drop('target',axis=1)
x.head()

0    1
1    0
2    0
3    0
4    0
Name: target, dtype: int64


Unnamed: 0,age,gender,chest_pain_type,resting_blood_pressure,cholesterol,fasting_blood_sugar,rest_ecg,max_heart_rate_achieved,exercise_induced_angina,st_depression,st_slope,num_major_vessels,thalassemia
0,42,1,2,130,180.0,0,1,150,0,0.0,2,0,2
1,58,0,1,136,319.0,1,0,152,0,0.0,2,2,2
2,68,1,2,180,274.0,1,0,150,1,1.6,1,0,3
3,62,0,0,138,294.0,1,1,106,0,1.9,1,3,2
4,47,1,2,108,243.0,0,1,152,0,0.0,2,0,2


# Step 5:  Seperate the dataset into training and test groups, and standardize the x-dataframe

The training and test groups were selected using the train_test_split function in the sklearn model_selection tool box.  Note, a validation group was not selected for this analysis, as there was a limited amount of data and when GridSearchCV was implemented in Step 7 of this effort a cross-validation technique was implemented for hyperparameter tuning.

The min-max scaler from the sklearn pre-processing toolbox was selected for feature standardization.  This was selected over the standard normal standardization as not all features appeared to be normally distributed when visually inspected.

In [5]:
xtrain,xtest,ytrain,ytest=train_test_split(x,y)

scaler=MinMaxScaler().fit(xtrain)
xtrainScaled=scaler.transform(xtrain) 
xtestScaled=scaler.transform(xtest) 

  return self.partial_fit(X, y)


# Step 6:  Evaluate commonly used binary classifiers
Several commonly used binary classifiers were evaluated with default parameters to determine which predictive model may require the least amount of tuning to achieve the most accurate heart disease predictive model.

The random forest classification algorithim was selected for use as a predictive model for this project, as it is commonly used and has a top accuracy score (i.e. accuracy = 0.83).  Note, the Extra Tree Classification model has the highest accuracy score (i.e. accuracy = 0.84); however, it is less widely used than the random forest algorithm and accordingly was not selected.

In [6]:
classifiers=[['Logistic Regression :',LogisticRegression()],
       ['Decision Tree Classification :',DecisionTreeClassifier()],
       ['Random Forest Classification :',RandomForestClassifier()],
       ['Gradient Boosting Classification :', GradientBoostingClassifier()],
       ['Ada Boosting Classification :',AdaBoostClassifier()],
       ['Extra Tree Classification :', ExtraTreesClassifier()],
       ['K-Neighbors Classification :',KNeighborsClassifier()],
       ['Support Vector Classification :',SVC()],
       ['Gaussian Naive Bayes :',GaussianNB()]]

cla_pred=[]

for name,model in classifiers:
    
    model=model
    model.fit(xtrainScaled,ytrain)
    predictions = model.predict(xtestScaled)
    cla_pred.append(accuracy_score(ytest,predictions))
    print(name,accuracy_score(ytest,predictions))

Logistic Regression : 0.7763157894736842
Decision Tree Classification : 0.7763157894736842
Random Forest Classification : 0.7631578947368421
Gradient Boosting Classification :



 0.7236842105263158
Ada Boosting Classification : 0.75
Extra Tree Classification : 0.8289473684210527
K-Neighbors Classification : 0.8026315789473685
Support Vector Classification : 0.7368421052631579
Gaussian Naive Bayes : 0.7894736842105263




# Step 7:  Tune the random forest algorithm for optimal preformance
To identify the most influential hyper-parameters for tuning a random forest classifier, several blog post and articles were reviewed, and the most influenital hyper-parameters identified, included:  max_features, n_estimators, max_depth, min_sample_leaf, and criterion.  Accordinlgy, a grid search was developed to ideintify the optimal input value for each of the most influential hyper-parameters.  The optimal values are shown below.

References:
https://stackoverflow.com/questions/36107820/how-to-tune-parameters-in-random-forest-using-scikit-learn
https://www.analyticsvidhya.com/blog/2015/06/tuning-random-forest-model/
https://towardsdatascience.com/hyperparameter-tuning-the-random-forest-in-python-using-scikit-learn-28d2aa77dd74
https://www.kaggle.com/hadend/tuning-random-forest-parameters

In [7]:
model=RandomForestClassifier()
paramGrid=[{'n_estimators':[50,100,200],'max_depth':[3,5,7,9],
           'min_samples_leaf':[1,2,4],'criterion':['gini','entropy']}]
gridSearch=GridSearchCV(model,paramGrid,cv=5)
gridSearch.fit(xtrainScaled,ytrain)
gridSearch.best_params_



{'criterion': 'entropy',
 'max_depth': 9,
 'min_samples_leaf': 4,
 'n_estimators': 100}

# Step 8:  Execute the model
Once the optimal parameter values were identified from the GridSearchCV function, a model was developed using the selected values, as shown below.  Also, accuracy, sensitiviy, and specificity were also computed model metrics.  

In general, the model is over-trained as indicated by the difference of 0.12 in accuracy between the training and test model accuracies.  

Sensitivity refers to the correctly identfied true positives.  For this case, approximately 80 percent were idenitfied correctly.  Specificity refers to the models ability to identify true negatives, like sensitivity, approximately 80 percent were correctly identified.

In [8]:
model=RandomForestClassifier(criterion='gini',max_depth=5,min_samples_leaf=2,n_estimators=100)
model.fit(xtrainScaled,ytrain)
yhat=model.predict(xtestScaled)
yhat_quant=model.predict_proba(xtestScaled)[:, 1]
yhat_bin=model.predict(xtestScaled)
print("Training Accuracy :", model.score(xtrainScaled,ytrain))
print("Testing Accuracy :", model.score(xtestScaled,ytest))

cr=classification_report(ytest,yhat)
print(cr)

Training Accuracy : 0.9380530973451328
Testing Accuracy : 0.7894736842105263
              precision    recall  f1-score   support

           0       0.92      0.62      0.74        37
           1       0.73      0.95      0.82        39

   micro avg       0.79      0.79      0.79        76
   macro avg       0.82      0.79      0.78        76
weighted avg       0.82      0.79      0.78        76



In [9]:
from sklearn.metrics import confusion_matrix
confusion_matrix=confusion_matrix(ytest,yhat_bin)
confusion_matrix

array([[23, 14],
       [ 2, 37]], dtype=int64)

In [10]:
total=sum(sum(confusion_matrix))

sensitivity = confusion_matrix[0,0]/(confusion_matrix[0,0]+confusion_matrix[1,0])
print('Sensitivity : ', sensitivity )

specificity = confusion_matrix[1,1]/(confusion_matrix[1,1]+confusion_matrix[0,1])
print('Specificity : ', specificity)

Sensitivity :  0.92
Specificity :  0.7254901960784313


# Step 9:  Save model for next steps

In [11]:
import pickle
filename='finalized_model.sav'

xtrainScaled=pd.DataFrame(xtrainScaled,columns=xtrain.columns)
xtestScaled=pd.DataFrame(xtestScaled,columns=xtrain.columns)

pickle.dump([model,confusion_matrix,xtrainScaled,ytrain,xtestScaled,ytest,yhat_quant],open(locData+'\\'+filename, 'wb'))