# What causes Heart Diseaze?

# Contents

1. [Introduction ](#section1)
2. [Load Data ](#section2)
3. [Data Exploration](#section3)
4. [The Explanation](#section4)
5. [Conclusion](#section5)

[](http://) 
# Introduction

Machine learning is good at find some latent regular in the data. In this data, there're many features, and a target now.  So a lot of machine learning tools are good at do the prediction. 
However, we need to do the data exploration, and fit the data to machine learning tools.

> ### Column introduction
* age:    age in years
* sex:  (1 = male; 0 = female)
* cp:   chest pain type
* trestbps:   resting blood pressure (in mm Hg on admission to the hospital)
* chol:   serum cholestoral in mg/dl
* fbs:    (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
* restecg:     resting electrocardiographic results
* thalach:   maximum heart rate achieved
* exang:   exercise induced angina (1 = yes; 0 = no)
* oldpeak:    ST depression induced by exercise relative to rest
* slope:  the slope of the peak exercise ST segment
* ca:       number of major vessels (0-3) colored by flourosopy
* thal:     3 = normal; 6 = fixed defect; 7 = reversable defect
* target: 1 or 0

[](http://) # Load Data
* Load the data from the csv file

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
df = pd.read_csv('../input/heart.csv')

In [None]:
df.head()

In [None]:
df.describe()

* The data has 14 columns, 13 columns are the features, and 1 column is the target.

> [](http://) #Data Exploration

In [None]:
df.isna().any()

In [None]:
df.info()

In [None]:
# Check the count of with disease and without disease
fig, ax = plt.subplots(figsize=(6.,5.))
sns.countplot(x= 'target', data=df, palette='Accent')

In [None]:
# use the pair plot try to find the inner relationship
sns.pairplot(data = df)

It seems there is no obvious distribution between two variables. So we need to do some analysis more specificly.

In [None]:
# The disease rate difference between genders
plt.figure(figsize=(12, 8))
sns.countplot(x = 'target', hue='sex', data = df,palette='bwr')
plt.xlabel("Sex (0 = female, 1= male)")
plt.title('Heart Frequency for Sex')
plt.legend(['No Disease', 'Disease'])
plt.ylabel('Frequency')

Female seems to have more possibility to suffer from heart disease.

In [None]:
# what about the age
plt.figure(figsize=(25,8), dpi=100)
sns.countplot(x = 'age', hue='target', data=df)
plt.title('Heart Disease Frequency for Ages')
plt.xticks(rotation=0)
plt.xlabel('Age')
plt.ylabel('Frequency')

Age between 40 and 60 have the most possibility to have heart disease.

### Chest pain type influence

In [None]:
# chest pain type
print("There are {} types of chest pain".format(len(df["cp"].unique())))

In [None]:
plt.figure(figsize=(12, 8))
sns.countplot(x ="cp", hue= "target", data=df)
plt.title("Different chest type and thier disease count")
plt.legend(['No disease', 'Disease'])
plt.xlabel("Chest pain type")

### resting blood pressure

In [None]:
df.groupby('target')['trestbps'].mean()

In [None]:
print("With disease, the average blood pressure is {}".format(df.groupby('target')['trestbps'].mean()[1]))
print("Normal, the average blood pressure is {}".format(df.groupby('target')['trestbps'].mean()[0]))

In [None]:
# The blood pressuer distribution
plt.figure(figsize=(8, 8))
sns.violinplot(x = 'target', y ='trestbps' ,data = df)
plt.title("Blood pressure difference")
plt.ylabel("Resting blood pressure")
plt.xlabel("Target (0 = No disease, 1= Disease)")

Resting blood pressure is not very important

### Maxium heart rate

In [None]:
print("With disease, the average blood pressure is {}".format(df.groupby('target')['thalach'].mean()[1]))
print("Normal, the average blood pressure is {}".format(df.groupby('target')['thalach'].mean()[0]))

In [None]:
# The blood pressuer distribution
plt.figure(figsize=(6,7))
sns.violinplot(x = 'target', y ='thalach' ,data = df)
plt.title("Maximun heart rate difference")
plt.ylabel("Maximum heart rate")
plt.xlabel("Target (0 = No disease, 1= Disease)")

In [None]:
sns.scatterplot(x = 'age', y = 'thalach',hue='target',data = df, palette='bwr')
plt.xlabel("Age")
plt.ylabel("Maximum Heart Rate")
plt.show()

> # handle the dummy variables and standarlize the data
We need to standarlize the feature for the purpose of training

In [None]:
# handle the dummy data, there are three dummy datas: cp, slope, thal
a = pd.get_dummies(df['cp'], prefix='cp')
b = pd.get_dummies(df['slope'], prefix='slope')
c = pd.get_dummies(df['thal'], prefix='thal')

# new frame
frames = [df, a, b, c]
df_dummyed = pd.concat(frames, axis=1)
df_dummyed.drop(['cp', 'slope', 'thal'], axis=1, inplace= True)

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaler.fit(df_dummyed.drop(['target','cp_0', 'cp_1', 'cp_2', 'cp_3', 'thal_0',
       'thal_1', 'thal_2', 'thal_3', 'slope_0', 'slope_1', 'slope_2'], axis=1))


In [None]:
scaled_features = scaler.transform(df_dummyed.drop(['target','cp_0', 'cp_1', 'cp_2', 'cp_3', 'thal_0',
       'thal_1', 'thal_2', 'thal_3', 'slope_0', 'slope_1', 'slope_2'], axis=1))

In [None]:
df_feat = pd.DataFrame(scaled_features, columns=df_dummyed.columns[:-12])
df_feat = df_feat.join(df_dummyed[['cp_0', 'cp_1', 'cp_2', 'cp_3', 'thal_0',
       'thal_1', 'thal_2', 'thal_3', 'slope_0', 'slope_1', 'slope_2']])
df_feat.head()

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df_feat, df['target'], test_size= 0.20, random_state=0)

> # Machine Learning

### Logistic Regression
注意：因变量和残差都要符合二项分布，才能使用logistic regression

In [None]:
from sklearn.metrics import confusion_matrix, classification_report

In [None]:
precisions = [] 
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train,y_train)
print("Test Accuracy {:.2f}%".format(lr.score(X_test,y_test)*100))
precisions.append(lr.score(X_test,y_test)*100)

In [None]:
pred_y = lr.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

### K-Nearest Neighbour (KNN) Classification

In [None]:
# KNN Model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 2)  # n_neighbors means k
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)

print("{} NN Score: {:.2f}%".format(2, knn.score(X_test, y_test)*100))

In [None]:
# try ro find best k value
scoreList = []
for i in range(1,20):
    knn2 = KNeighborsClassifier(n_neighbors = i)  # n_neighbors means k
    knn2.fit(X_train, y_train)
    scoreList.append(knn2.score(X_test, y_test))
    
plt.plot(range(1,20), scoreList)
plt.xticks(np.arange(1,20,1))
plt.xlabel("K value")
plt.ylabel("Score")
plt.show()


print("Maximum KNN Score is {:.2f}%".format((max(scoreList))*100))

In [None]:
# KNN Model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 10)  # n_neighbors means k
knn.fit(X_train, y_train)
prediction = knn.predict(X_test)

print("{} NN Score: {:.2f}%".format(2, knn.score(X_test, y_test)*100))
precisions.append(knn.score(X_test,y_test)*100)

In [None]:
pred_y = knn.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

### Support Vector Machine (SVM) Algorithm
* 支持向量机仅支持数值类数据
* 仅支持二元分类

In [None]:
from sklearn.svm import SVC
svm = SVC(random_state = 1)
svm.fit(X_train, y_train)
print("Test Accuracy of SVM Algorithm: {:.2f}%".format(svm.score(X_test,y_test)*100))
precisions.append(svm.score(X_test,y_test)*100)

In [None]:
pred_y = svm.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

### Naive Bayes Algorithm

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(X_train, y_train)
print("Accuracy of Naive Bayes: {:.2f}%".format(nb.score(X_test,y_test)*100))
precisions.append(nb.score(X_test,y_test)*100)

In [None]:
pred_y = nb.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

### Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
print("Decision Tree Test Accuracy {:.2f}%".format(dtc.score(X_test, y_test)*100))
precisions.append(dtc.score(X_test, y_test)*100)

In [None]:
pred_y = dtc.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

### Random forest

In [None]:
# Random Forest Classification
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 1)
rf.fit(X_train, y_train)
print("Random Forest Algorithm Accuracy Score : {:.2f}%".format(rf.score(X_test,y_test)*100))
precisions.append(rf.score(X_test,y_test)*100)

In [None]:
pred_y = rf.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

### BOOM！ Why not try some boosting method

In [None]:
from sklearn.ensemble import AdaBoostClassifier
abc = AdaBoostClassifier(n_estimators=100)
abc.fit(X_train, y_train)
print("AdaBoost Accuracy Score : {:.2f}%".format(abc.score(X_test,y_test)*100))
precisions.append(abc.score(X_test,y_test)*100)

In [None]:
pred_y = rf.predict(X_test)
print("Classification report:\n")
print(classification_report(y_test, pred_y))

print("Confusion matrix:\n")
print(confusion_matrix(y_test, pred_y))

> # Conclusion

In [None]:
methods = ["Logistic Regression", "KNN", "SVM", "Naive Bayes", "Decision Tree", "Random Forest", "Adaboost"]
sns.set_style("whitegrid")
plt.figure(figsize=(16,5))
plt.yticks(np.arange(0,100,10))
plt.ylabel("Accuracy %")
plt.xlabel("Algorithms")
sns.barplot(x=methods, y=precisions, palette="gnuplot")
plt.show()