#Heart Disease Prediction Project
Data set from http://archive.ics.uci.edu/ml/datasets/Heart+Disease?spm=a2c63.p38356.879954.4.742a1bbfIXlwNf

dataset name clevland.data

Data Set Information:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to 
this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

import os
print(os.listdir())

import warnings
warnings.filterwarnings('ignore')

In [4]:
dataset = pd.read_csv('/dbfs/FileStore/tables/heart.csv', header=0, error_bad_lines=False)

In [5]:
type(dataset)

In [6]:
dataset.shape

In [7]:
dataset.isnull().any()

In [8]:
dataset.head(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


####Columns explain
1. age: age in years
2. sex: sex (1 = male; 0 = female) 
3. cp: chest pain type 
  -- Value 1: typical angina 
  -- Value 2: atypical angina 
  -- Value 3: non-anginal pain 
  -- Value 4: asymptomatic 
4. trestbps: resting blood pressure (in mm Hg on admission to the hospital) 
5. chol: serum cholestoral in mg/dl 
6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
7. restecg: resting electrocardiographic results 
  -- Value 0: normal 
  -- Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) 
  -- Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria 
8. thalach: maximum heart rate achieved 
9. exang: exercise induced angina (1 = yes; 0 = no) 
10. oldpeak = ST depression induced by exercise relative to rest 
11. slope: the slope of the peak exercise ST segment 
-- Value 1: upsloping 
-- Value 2: flat 
-- Value 3: downsloping 
12. ca: number of major vessels (0-3) colored by flourosopy 
13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect 
14. target: 0 or 1

In [10]:
dataset.sample(5)

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
256,58,1,0,128,259,0,0,130,1,3.0,1,2,3,0
106,69,1,3,160,234,1,0,131,0,0.1,1,1,2,1
248,54,1,1,192,283,0,0,195,0,0.0,2,1,3,0
265,66,1,0,112,212,0,0,132,1,0.1,2,1,2,0
7,44,1,1,120,263,0,1,173,0,0.0,2,0,3,1


In [11]:
dataset.describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
count,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0,303.0
mean,54.366337,0.683168,0.966997,131.623762,246.264026,0.148515,0.528053,149.646865,0.326733,1.039604,1.39934,0.729373,2.313531,0.544554
std,9.082101,0.466011,1.032052,17.538143,51.830751,0.356198,0.52586,22.905161,0.469794,1.161075,0.616226,1.022606,0.612277,0.498835
min,29.0,0.0,0.0,94.0,126.0,0.0,0.0,71.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,47.5,0.0,0.0,120.0,211.0,0.0,0.0,133.5,0.0,0.0,1.0,0.0,2.0,0.0
50%,55.0,1.0,1.0,130.0,240.0,0.0,1.0,153.0,0.0,0.8,1.0,0.0,2.0,1.0
75%,61.0,1.0,2.0,140.0,274.5,0.0,1.0,166.0,1.0,1.6,2.0,1.0,3.0,1.0
max,77.0,1.0,3.0,200.0,564.0,1.0,2.0,202.0,1.0,6.2,2.0,4.0,3.0,1.0


##We need to Scaling, standardize and Normalize dataset furtures.

In [13]:
dataset.info()

In [14]:
dataset["target"].describe()

In [15]:
dataset["target"].unique()

This is a classification problem, with the target variable having values '0' and '1'

Checking correlation between columns

In [17]:
plt.figure(figsize=(12, 9))
corr = dataset.corr(method = "pearson")

ax = sns.heatmap(
    corr, 
    vmin=-1, vmax=1, center=0,
    cmap=sns.diverging_palette(20, 220, n=200),
    square=True
)
ax.set_xticklabels(
    ax.get_xticklabels(),
    rotation=45,
    horizontalalignment='right'
);

###Explanation
From the pearson correlation heatmap, we can see the score of most features are close to 0. The pearson score of slope and oldpeak is a little bit high. Because both of them are related with ECG ST depression.

##Exploratory Data Analysis (EDA)

In [20]:
dataset.target.count()

In [21]:
target = dataset["target"]
print("target count value:")
target_cnt = dataset.target.value_counts()
print(target_cnt)
sns.countplot(target)

In [22]:
dataset.groupby("target").mean()

Unnamed: 0_level_0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal
target,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
0,56.601449,0.826087,0.478261,134.398551,251.086957,0.15942,0.449275,139.101449,0.550725,1.585507,1.166667,1.166667,2.543478
1,52.49697,0.563636,1.375758,129.30303,242.230303,0.139394,0.593939,158.466667,0.139394,0.58303,1.593939,0.363636,2.121212


###Create dummy variables

In [24]:
cat_vars = ['sex','cp','fbs', 'restecg','exang','slope','ca','thal']
dummy_dataset = dataset
for var in cat_vars:
    cat_list = 'var' + '_' + var
    cat_list = pd.get_dummies(dummy_dataset[var], prefix = var)
    data1 = dummy_dataset.join(cat_list)
    dummy_dataset = data1

    

In [25]:
dummy_dataset.columns.tolist()

In [26]:
cat_vars = ['sex','cp','fbs', 'restecg','exang','slope','ca','thal']
data_vars = dummy_dataset.columns.values.tolist()
to_keep = [i for i in data_vars if i not in cat_vars]

In [27]:
data_final = dummy_dataset[to_keep]
data_final.columns.values

###Over-sampling using SMOTE

In [29]:
%sh
#need to run ***ONCE*** to install SMOTE package
/home/ubuntu/databricks/python/bin/pip install 'imbalanced-learn<0.2.1'
pip freeze | grep imbalanced-learn

In [30]:
from imblearn import under_sampling, over_sampling
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split

In [31]:
data_final.head()

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,target,sex_0,sex_1,cp_0,cp_1,cp_2,cp_3,fbs_0,fbs_1,restecg_0,restecg_1,restecg_2,exang_0,exang_1,slope_0,slope_1,slope_2,ca_0,ca_1,ca_2,ca_3,ca_4,thal_0,thal_1,thal_2,thal_3
0,63,145,233,150,2.3,1,0,1,0,0,0,1,0,1,1,0,0,1,0,1,0,0,1,0,0,0,0,0,1,0,0
1,37,130,250,187,3.5,1,0,1,0,0,1,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,0,0,0,1,0
2,41,130,204,172,1.4,1,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0
3,56,120,236,178,0.8,1,0,1,0,1,0,0,1,0,0,1,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0
4,57,120,354,163,0.6,1,1,0,1,0,0,0,1,0,0,1,0,0,1,0,0,1,1,0,0,0,0,0,0,1,0


In [32]:
X = data_final.loc[:, data_final.columns != 'target']
y = data_final.loc[:, data_final.columns == 'target']

os = SMOTE(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
columns = X_train.columns
os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['target'])
# we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no heart disease in oversampled data",len(os_data_y[os_data_y['target']==0]))
print("Number of heart disease",len(os_data_y[os_data_y['target']==1]))
print("Proportion of no heart disease data in oversampled data is ",len(os_data_y[os_data_y['target']==0])/len(os_data_X))
print("Proportion of heart disease data in oversampled data is ",len(os_data_y[os_data_y['target']==1])/len(os_data_X))

In [33]:
X_train.head(5)

Unnamed: 0,age,trestbps,chol,thalach,oldpeak,sex_0,sex_1,cp_0,cp_1,cp_2,cp_3,fbs_0,fbs_1,restecg_0,restecg_1,restecg_2,exang_0,exang_1,slope_0,slope_1,slope_2,ca_0,ca_1,ca_2,ca_3,ca_4,thal_0,thal_1,thal_2,thal_3
137,62,128,208,140,0.0,0,1,0,1,0,0,0,1,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0
106,69,160,234,131,0.1,0,1,0,0,0,1,0,1,1,0,0,1,0,0,1,0,0,1,0,0,0,0,0,1,0
284,61,140,207,138,1.9,0,1,1,0,0,0,1,0,1,0,0,0,1,0,0,1,0,1,0,0,0,0,0,0,1
44,39,140,321,182,0.0,0,1,0,0,1,0,1,0,1,0,0,1,0,0,0,1,1,0,0,0,0,0,0,1,0
139,64,128,263,105,0.2,0,1,1,0,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,0,1


Now we have a perfect balanced data. I over-sampled only on the training data, because by oversampling only on the training data, none of the information in the test data is being used to create synthetic observations, therefore, no information will bleed from test data into the model training.

###Recursive Feature Elimination

Recursive Feature Elimination (RFE) is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features.

In [37]:
data_final_vars=data_final.columns.values.tolist()
y=['target']
X=[i for i in data_final_vars if i not in y]
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
rfe = RFE(logreg, 20)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)

In [38]:
feature_index = rfe.get_support(True)
print(feature_index)

In [39]:
cols = ['thalach','oldpeak', 'sex_0', 'cp_0', 'cp_1', 'cp_2', 'fbs_1', 'restecg_0', 'restecg_1', 'restecg_2', 'exang_1', 'slope_0', 'slope_1','ca_0', 'ca_1', 'ca_2','ca_4', 'thal_0', 'thal_1']
X = os_data_X[cols]
y = os_data_y['target']

###Implementing the model

In [41]:
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

###Logistic Regression Model Fitting

In [43]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

In [44]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

In [45]:
y_pred

###Confusion MAtrix

In [47]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

###Compute precision, recall, F-measure and support

In [49]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

###Interpretation

###ROC Curve

In [52]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
logit_roc_auc = roc_auc_score(y_test, logreg.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

###SVM Model Fitting

In [54]:
from sklearn import svm

sv = svm.SVC(kernel='linear', probability = True)
sv.fit(X_train, y_train)
y_pred_svm = sv.predict(X_test)
print('Accuracy of SVM regression classifier on test set: {:.2f}'.format(sv.score(X_test, y_test)))

In [55]:
from sklearn.metrics import confusion_matrix
confusion_matrix_svm = confusion_matrix(y_test, y_pred_svm)
print(confusion_matrix_svm)

In [56]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
svm_roc_auc = roc_auc_score(y_test, sv.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, sv.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='SVM Regression (area = %0.2f)' % svm_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

###Decision Tree

In [58]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
max_accuracy = 0
for x in range(200):
    dt = DecisionTreeClassifier(random_state=x)
    dt.fit(X_train,y_train)
    y_pred_dt = dt.predict(X_test)
    current_accuracy = round(accuracy_score(y_pred_dt,y_test)*100,2)
    if(current_accuracy>max_accuracy):
        max_accuracy = current_accuracy
        best_x = x
        
#print(max_accuracy)
#print(best_x)


dt = DecisionTreeClassifier(random_state=best_x)
dt.fit(X_train,y_train)
y_pred_dt = dt.predict(X_test)

print('Accuracy of DT classifier on test set: {:.2f}'.format(dt.score(X_test, y_test)))

In [59]:
# from sklearn.metrics import confusion_matrix
confusion_matrix_svm = confusion_matrix(y_test, y_pred_dt)
print(confusion_matrix_svm)

Random Forest

In [61]:
from sklearn.ensemble import RandomForestClassifier

max_accuracy = 0

for x in range(2000):
    rf = RandomForestClassifier(random_state=x)
    rf.fit(X_train,y_train)
    y_pred_rf = rf.predict(X_test)
    current_accuracy = round(accuracy_score(y_pred_rf,y_test)*100,2)
    if(current_accuracy>max_accuracy):
        max_accuracy = current_accuracy
        best_x = x
        
#print(max_accuracy)
#print(best_x)

rf = RandomForestClassifier(random_state=best_x)
rf.fit(X_train,y_train)
y_pred_rf = rf.predict(X_test)

In [62]:
print('Accuracy of RF classifier on test set: {:.2f}'.format(rf.score(X_test, y_test)))

In [63]:
confusion_matrix_rf = confusion_matrix(y_test, y_pred_rf)
print(confusion_matrix_rf)

In [64]:
print('Recall of RF classifier on test set:')