# Which features are related with a hearth disease?
<font color = 'blue'>
Content:

1. [LOAD AND CHECK DATA](#1)
1. [VARIABLE DESCRIPTION](#2)
    * [Categorical Variable](#3)
    * [Numerical Variable](#4)
1. [BASIC DATA ANALYSIS](#5)
1. [OUTLIER DETECTION](#6)
1. [MISSING VALUE](#7)
    * [Find Missing Value](#7)
    * [Fill Missing Value](#7)
1. [VISUALIZATION](#8)
    * [Correlation Between Features vs Hearth Disease](#8)
    * [thal -- target](#9)
    * [ca -- target](#10)
    * [slope -- target](#11)
    * [exang -- target](#12)
    * [cp -- target](#13)
    * [oldpeak -- target](#14)
    * [thalach -- target](#15)
    * [slope -- oldpeak -- target](#16)
    * [slope -- thalach -- target](#17)
    * [exang -- cp -- target](#18)    
    * [exang -- thalach -- target](#19)
    * [cp -- thalach -- target](#20)
    * [oldpeak -- thalach -- target](#21)
    * [thalach -- age -- target](#22)
1. [IMPLEMENTING ML ALGORITHMS](#23)
    * [K-Nearest Neighbors (KNN)](#23)
    * [Regression](#24)
    * [Regularized Regression](#25)
    * [Accuracy](#26)
    * [ROC Curve with Logistic Regression](#27)
    * [Hyperparameter Tuning](#28)
1. [CONCLUTION](#29)

        

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt

from plotly.offline import init_notebook_mode, iplot, plot
import plotly as py
init_notebook_mode(connected=True)
import plotly.graph_objs as go

import seaborn as sns

from collections import Counter

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import warnings
warnings.filterwarnings("ignore")

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

<a id = "1"></a><br>
# LOAD AND CHECK DATA

In [None]:
data = pd.read_csv('../input/heart-disease-uci/heart.csv')
print(plt.style.available)
plt.style.use('ggplot')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

<a id = "2"></a><br>
# VARIABLE DESCRIPTION

1. age: Age of patient
1. sex: Gender of patient (1:Male, 0:Female)
1. cp: chest pain type (4 values)
1. trestbps: resting blood pressure
1. chol: serum cholestoral in mg/dl
1. fbs: fasting blood sugar > 120 mg/dl
1. restecg: resting electrocardiographic results (values 0,1,2)
1. thalach: maximum heart rate achieved
1. exang: exercise induced angina (1: yes, 0: no)
1. oldpeak: ST depression induced by exercise relative to rest
1. slope: the slope of the peak exercise ST segment (values 0,1,2)
1. ca: number of major vessels (0-3) colored by flourosopy
1. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
1. target: Presence of heart disease (1: yes, 0: No)

<a id = "3"></a><br>
## Categorical Variables
* sex
* cp
* restecg
* exang
* slope
* ca
* thal
* target

In [None]:
def bar_plot(variable):
    """
        input: variable ex: "Sex"
        output: bar plot & value count    
    """
    # get feature
    var = data[variable]
    # caount number of categorical variable (value/sample)
    varValue = var.value_counts()
    
    #visualize
    plt.figure(figsize = (9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

In [None]:
category = ["sex", "cp", "restecg", "exang", "slope", "ca", "thal", "target"]
for c in category:
    bar_plot(c)


<a id = "4"></a><br>
## Numerical Variables
* age
* trestbps
* chol
* fbs
* thalach
* oldpeak

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (9,3))
    plt.hist(data[variable], bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numericVar = ["age", "trestbps", "chol", "fbs", "thalach", "oldpeak"]
for n in numericVar:
    plot_hist(n)

<a id = "5"></a><br>
# BASIC DATA ANALYSIS
* sex - target
* cp - target
* restecg - target
* exang - target
* slope - target
* ca - target
* thal - target

In [None]:
# sex - target
data[["sex", "target"]].groupby(["sex"], as_index = False).mean().sort_values(by = "target", ascending =False)

In [None]:
# cp - target
data[["cp", "target"]].groupby(["cp"], as_index = False).mean().sort_values(by = "target", ascending =False)

As you can see there is a correlation between chest pain and hearth disease.

In [None]:
# restecg - target
data[["restecg", "target"]].groupby(["restecg"], as_index = False).mean().sort_values(by = "target", ascending =False)

In [None]:
# exang - target
data[["exang", "target"]].groupby(["exang"], as_index = False).mean().sort_values(by = "target", ascending =False)

In [None]:
# slope - target
data[["slope", "target"]].groupby(["slope"], as_index = False).mean().sort_values(by = "target", ascending =False)

In [None]:
# ca - target
data[["ca", "target"]].groupby(["ca"], as_index = False).mean().sort_values(by = "target", ascending =False)

In [None]:
# thal - target
data[["thal", "target"]].groupby(["thal"], as_index = False).mean().sort_values(by = "target", ascending =False)

<a id = "6"></a><br>
# OUTLIER DETECTION

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        #Detect outlier and their indices
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        #store indices
        outlier_indices.extend(outlier_list_col)
        
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)
    
    return multiple_outliers

In [None]:
data.loc[detect_outliers(data,["age", "trestbps", "chol", "fbs", "thalach", "oldpeak"])]

No outliers detected in the data.

<a id = "7"></a><br>
# MISSING VALUE
* Find Missing Value
* Fill Missing Value

In [None]:
data.columns[data.isnull().any()]

There isn't any missing value so we don't need to fill either.

<a id = "8"></a><br>
# VISUALIZATION
* Correlation Between Features vs Hearth Disease

In [None]:
fig, ax = plt.subplots(figsize=(10,10)) 
sns.heatmap(data[["age", "trestbps", "chol", "fbs", "thalach", "oldpeak",
                      "sex", "cp", "restecg", "exang", "slope", "ca", "thal", "target"]].corr(), annot = True)
plt.show()

It seems that probability of hearth disease (target in this instance) has correlation with:
* thal (-)
* ca (-)
* slope (+)
* exang (-)
* cp (+)
* oldpeak (-)
* thalac (+)

It is also seen that:
* slope has correlation with:
    * oldpeak (-)
    * thalac (+)
* exang has correlation with:
    * cp (-)
    * thalac (-)
* cp has correlation with:
    * thalac (+)
* oldpeak has correlation with:
    * thalac (+)
* thalac has correlation with:
    * age (-)

Now we will visualize these relations

 <a id = "9"></a>
 * thal -- target


In [None]:
g = sns.factorplot(x = "thal", y = "target", data = data, kind = "bar", size = 6)
g.set_ylabels("Disease Probability")
plt.show()

* Patiens whose thal = 2 have a very high heart disease probability. 
* Also thal = 0 patients have a higher risk then thal = 1 or 3

 <a id = "10"></a><br>
 * ca -- target


In [None]:
g = sns.factorplot(x = "ca", y = "target", data = data, kind = "bar", size = 6)
g.set_ylabels("Disease Probability")
plt.show()

* ca = 0 or 4 patients have a higher risk then ca = 1, 2 or 3

<a id = "11"></a><br>
* slope -- target



In [None]:
g = sns.factorplot(x = "slope", y = "target", data = data, kind = "bar", size = 6)
g.set_ylabels("Disease Probability")
plt.show()

* slope = 2 patients have a higher risk then slope = 0 or 1

<a id = "12"></a><br> 
* exang -- target

In [None]:
g = sns.factorplot(x = "exang", y = "target", data = data, kind = "bar", size = 6)
g.set_ylabels("Disease Probability")
plt.show()

* exang = 0 patients have a higher risk then exang = 1

<a id = "13"></a><br>
* cp -- target

In [None]:
g = sns.factorplot(x = "cp", y = "target", data = data, kind = "bar", size = 6)
g.set_ylabels("Disease Probability")
plt.show()

> Patiens who has chest pain, has a very high probability of a hearth disease

<a id = "14"></a><br>    
* oldpeak -- target

In [None]:
g = sns.FacetGrid(data, col = "target", size = 6)
g.map(sns.distplot, "oldpeak", bins = 25)
plt.show()

* For o<oldpeak<2, there is a higher risk of disease

<a id = "15"></a><br>  
* thalach -- target

In [None]:
g = sns.FacetGrid(data, col = "target")
g.map(sns.distplot, "thalach", bins = 25)
plt.show()

* As thalach rises over 150, the risk increases

 <a id = "16"></a><br>  
 * slope -- oldpeak -- target

In [None]:
g = sns.FacetGrid(data, col = "target", row = "slope", size = 3)
g.map(plt.hist, "oldpeak", bins = 25)
g.add_legend()
plt.show()

 <a id = "17"></a><br>  
 * slope -- thalach -- target

In [None]:
g = sns.FacetGrid(data, col = "target", row = "slope", size = 3)
g.map(plt.hist, "thalach", bins = 25)
g.add_legend()
plt.show()

* The risk is higher for slope=2 and thalach>150 patients
* The risk is lower for slope=1 and thalach>150 patients

<a id = "18"></a><br>  
* exang -- cp -- target

In [None]:
g = sns.FacetGrid(data, col = "target", row = "exang", size = 4)
g.map(plt.hist, "cp", bins = 25)
g.add_legend()
plt.show()

 <a id = "19"></a><br>  
 * exang -- thalach -- target

In [None]:
g = sns.FacetGrid(data, col = "target", row = "exang", size = 4)
g.map(plt.hist, "thalach", bins = 25)
g.add_legend()
plt.show()

 <a id = "20"></a><br>  
 * cp -- thalach -- target

In [None]:
g = sns.FacetGrid(data, col = "target", row = "cp", size = 2)
g.map(plt.hist, "thalach", bins = 25)
g.add_legend()
plt.show()

 <a id = "21"></a><br>  
 * oldpeak -- thalach -- target

In [None]:
g = sns.FacetGrid(data, col="target", size = 8)
g.map(plt.scatter, "oldpeak", "thalach", edgecolor="w")
g.add_legend()
plt.show()

* Heart disease risk increases especially when oldpeak < 2 and thalach > 150

 <a id = "22"></a><br>  
 * thalach -- age -- target

In [None]:
g = sns.FacetGrid(data, col="target", size = 8)
g.map(sns.kdeplot, "age", "thalach", edgecolor="w")
g.add_legend()
plt.show()

* The disease risk inceases at age between 40-60 and thalach between 150-185

 <a id = "23"></a><br>  
 # IMPLEMENTING ML ALGORITHMS
 <a id = "23"></a><br>  
 ## K-Nearest Neighbors (KNN)
 * KNN is a clasification method.
 * It looks for the K number of closest data points.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
x,y = data.loc[:,data.columns != 'target'], data.loc[:,'target']
knn.fit(x,y)
prediction = knn.predict(x)
print('Prediction: {}'. format(prediction))

What I have done is training my model with the data and make prediction of a possible hearth disease. Bu I do not know whether my prediction is true or not. In other means I do not know the accuracy of my model.

To overcome this, a general rule is dividing the data into train and test parts. Than we first train or fit the model with "train" part of the data and test the results with the "test" part of the data. Lets do it!

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
knn = KNeighborsClassifier(n_neighbors=3)
x,y = data.loc[:,data.columns != 'target'], data.loc[:,'target']
knn.fit(x_train,y_train)
prediction = knn.predict(x)
print('With KNN (K=3) accuracy is: ',knn.score(x_test,y_test))

So the accuracy of my model is %63. Is this a good score? Is it possible to hit higher scores by changing the K value? Lets find it out!

In [None]:
neig = np.arange(1, 25)
train_accuracy = []
test_accuracy = []
# Loop over different values of k
for i, k in enumerate(neig):
    knn = KNeighborsClassifier(n_neighbors=k)
    # Fit with knn
    knn.fit(x_train,y_train)
    #train accuracy
    train_accuracy.append(knn.score(x_train, y_train))
    # test accuracy
    test_accuracy.append(knn.score(x_test, y_test))

# Plot
plt.figure(figsize=[13,8])
plt.plot(neig, test_accuracy, label = 'Testing Accuracy')
plt.plot(neig, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.title('-value VS Accuracy')
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.xticks(neig)
plt.savefig('graph.png')
plt.show()
print("Best accuracy is {} with K = {}".format(np.max(test_accuracy),1+test_accuracy.index(np.max(test_accuracy))))

 <a id = "24"></a><br>  
 ## Regression
 * It is a supervised learning model
 * I will show linear and logistic regression

In [None]:
data1 = data[data['target'] == 1]
x = np.array(data1.loc[:,'oldpeak']).reshape(-1,1)
y = np.array(data1.loc[:,'thalach']).reshape(-1,1)
# Scatter
plt.figure(figsize=[10,10])
plt.scatter(x=x,y=y)
plt.xlabel('oldpeak')
plt.ylabel('thalach')
plt.show()

* Linear regression

y = ax + b where y = target, x = feature and a = parameter of model

In [None]:
# LinearRegression
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
# Predict space
predict_space = np.linspace(min(x), max(x)).reshape(-1,1)
# Fit
reg.fit(x,y)
# Predict
predicted = reg.predict(predict_space)
# R^2 
print('R^2 score: ',reg.score(x, y))
# Plot regression line and scatter
plt.plot(predict_space, predicted, color='black', linewidth=3)
plt.scatter(x=x,y=y)
plt.xlabel('oldpeak')
plt.ylabel('thalach')
plt.show()

In [None]:
data.head()

<a id = "25"></a><br>
## Regularized Regression
* Linear regression may result in overfitting because it can give high coefficient to a feature. 
* In order to solve this problem we use regularized regression. Lets check out "Ringe" and "Lasso" regressions. 

In [None]:
# Ridge
from sklearn.linear_model import Ridge
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3)
ridge = Ridge(alpha = 0.1, normalize = True)
ridge.fit(x_train,y_train)
ridge_predict = ridge.predict(x_test)
print('Ridge score: ',ridge.score(x_test,y_test))

In [None]:
# Lasso
from sklearn.linear_model import Lasso
x = np.array(data1.loc[:,['thalach','oldpeak','trestbps','chol']])
x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 3, test_size = 0.3)
lasso = Lasso(alpha = 0.1, normalize = True)
lasso.fit(x_train,y_train)
ridge_predict = lasso.predict(x_test)
print('Lasso score: ',lasso.score(x_test,y_test))
print('Lasso coefficients: ',lasso.coef_)

<a id = "26"></a><br>
## Accuracy
Lets discuss about the accuracy. The accuracy of our model shows the percentage of the correct predictions. But does it really make sence to know this percentage? Think about the %70 KNN acuracy above. Lets say that %70 of the patients have heart disease. If our model predicts that all the patients have hearth disease, than it means that the model has %70 accuracy.   

To get rid of this confusion, we calculate the confusion matrix. We calculate:
* tp = Prediction is positive(normal) and actual is positive(normal).
* fp = Prediction is positive(normal) and actual is negative(abnormal).
* fn = Prediction is negative(abnormal) and actual is positive(normal).
* tn = Prediction is negative(abnormal) and actual is negative(abnormal)

In [None]:
# Confusion matrix with random forest
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.ensemble import RandomForestClassifier
x,y = data.loc[:,data.columns != 'target'], data.loc[:,'target']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1)
rf = RandomForestClassifier(random_state = 4)
rf.fit(x_train,y_train)
y_pred = rf.predict(x_test)
cm = confusion_matrix(y_test,y_pred)
print('Confusion matrix: \n',cm)
print('Classification report: \n',classification_report(y_test,y_pred))

In [None]:
# visualize with seaborn library
sns.heatmap(cm,annot=True,fmt="d") 
plt.show()

<a id = "27"></a><br>
## ROC Curve with Logistic Regression
* logistic regression output is probabilities
* If probability is higher than 0.5 data is labeled 1(abnormal) else 0(normal)
* By default logistic regression threshold is 0.5
* ROC is receiver operationg characteristic. In this curve x axis is false positive rate and y axis is true positive rate
* If the curve in plot is closer to left-top corner, test is more accurate.
* Roc curve score is auc that is computation area under the curve from prediction scores
* We want auc to closer 1
* fpr = False Positive Rate
* tpr = True Positive Rate

In [None]:
# ROC Curve with logistic regression
from sklearn.metrics import roc_curve
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
# have disease = 1 and no disease = 0
x,y = data.loc[:,(data.columns != 'target')], data.loc[:,'target']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=42)
logreg = LogisticRegression()
logreg.fit(x_train,y_train)
y_pred_prob = logreg.predict_proba(x_test)[:,1]
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC')
plt.show()

<a id = "28"></a><br>
## Hyperparameter Tuning
Hyperparameter tuning:
* try all of combinations of different parameters
* fit all of them
* measure prediction performance
* see how well each performs
* finally choose best hyperparameters

In [None]:
# grid search cross validation with 1 hyperparameter
from sklearn.model_selection import GridSearchCV
grid = {'n_neighbors': np.arange(1,50)}
knn = KNeighborsClassifier()
knn_cv = GridSearchCV(knn, grid, cv=3) # GridSearchCV
knn_cv.fit(x,y)# Fit

# Print hyperparameter
print("Tuned hyperparameter k: {}".format(knn_cv.best_params_)) 
print("Best score: {}".format(knn_cv.best_score_))

Other grid search example with 2 hyperparameter

* First hyperparameter is C:logistic regression regularization parameter
* If C is high: overfit
* If C is low: underfit
* Second hyperparameter is penalty(lost function): l1 (Lasso) or l2(Ridge) as we learnt at linear regression part.

In [None]:
# grid search cross validation with 2 hyperparameter
# 1. hyperparameter is C:logistic regression regularization parameter
# 2. penalty l1 or l2
# Hyperparameter grid
param_grid = {'C': np.logspace(-3, 3, 7), 'penalty': ['l1', 'l2']}
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.3,random_state = 12)
logreg = LogisticRegression()
logreg_cv = GridSearchCV(logreg,param_grid,cv=3)
logreg_cv.fit(x_train,y_train)

# Print the optimal parameters and best score
print("Tuned hyperparameters : {}".format(logreg_cv.best_params_))
print("Best Accuracy: {}".format(logreg_cv.best_score_))

<a id = "29"></a><br>
# CONCLUTION
In this tutorial I tried to show you:
* How to visualize and understand the data
* How to implement ML models