# Introduction
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to
this date.The "target" field refers to the presence of heart disease in the patient. It is integer valued 0 = no/less chance of heart attack and 1 = more chance of heart attack

<font color = "lightblue">
    Content :

1. [Load and Check Data](#1)
1. [Variable Describtion](#2)
      * [Univariate Variable Analysis](#3)
          * [Chatagorical Variable:](#4)
          * [Numerical Variable:](#5)
1. [Basic Data Analysis](#6)
1. [Outlier Detection](#7)
1. [Missing Value](#8)
    * [Find Missing Value](#9)
1. [Vissualization](#10)
    * [Corelation between all features](#11)
    * [thalach -- target](#12)
    * [cp -- target](#13)
    * [exang -- target](#14)
    * [ca -- target](#15)
    * [slope -- target](#16)
    * [thal -- target](#17)
    * [cp -- exang -- target](#18)
    * [cp -- exang -- target -- sex](#19)
1. [Feature Engineering](#20)
1. [Modelling](#21)
    * [Train - Test Split](#22)
    * [Simple Logistic Regression](#23)
    * [Hyper Parameter -- Grid Search -- Cross Validation](#24)
    * [Ensemble Modelling](#25)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)


import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")  # plt.style.available => if you write this you can see the other styles


import seaborn as sns

from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id="1"></a><br>
# Load and Check Data

In [None]:
data = pd.read_csv("/kaggle/input/health-care-data-set-on-heart-attack-possibility/heart.csv")

In [None]:
len(data)

* Data has 303 feature.
* I split 70 feature before I analysis the data. I will use this test data.

In [None]:
test = data.iloc[140:210]
test["target"].value_counts()
data.drop(index=range(140,211), axis=0, inplace=True)
len_train = len(data)

In [None]:
len_train

In [None]:
data.columns

In [None]:
data.head()

In [None]:
data.describe()

<a id="2"></a><br>
## Variable Describtion

1. age
1. sex
1. cp: chest pain type (4 values)
1. trestbps : resting blood pressure
1. chol : serum cholestoral in mg/dl
1. fbs : fasting blood sugar > 120 mg/dl
1. restecg : resting electrocardiographic results (values 0,1,2)
1. thalach : maximum heart rate achieved
1. exang : exercise induced angina
1. oldpeak : ST depression induced by exercise relative to rest
1. slope : the slope of the peak exercise ST segment
1. ca : number of major vessels (0-3) colored by flourosopy
1. thal :  0 = normal; 1 = fixed defect; 2 = reversable defect
1. target : 0= less chance of heart attack 1= more chance of heart attack

In [None]:
data.info()

data = data.astype({"sex":"int64",
             "cp":"category",
            "fbs":"category",
            "restecg":"category",
            "exang":"category",
            "slope":"category",
            "ca":"category",
            "thal":"category",
            "target":"int64"})
data.info()

* int64(4): age, trestbps, chol, thalach
* float64(1): oldpeak
* category(9): sex, cp, fbs, restecg, exang, slope, ca, thal, target

<a id="3"></a><br>
## Univariate Variable Analysis
* Catagorical Variable
* Numerical Variable

<a id="4"></a><br>
## Catagorical Variable

In [None]:
def bar_plot(variable):
    """
    input: variable ex: sex
    output: barplot & value count
    """
    # get features
    var = data[variable]
    # count number of categorical variable (value/sample)
    varValue = var.value_counts()
    # visualize
    plt.figure(figsize=(9,3))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Frequanc")
    plt.title(variable)
    plt.show()
    
    print("{}\n{}".format(variable, varValue))

In [None]:
categorical = ["sex", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal", "target"]
for c in categorical:
    bar_plot(c)

<a id="5"></a><br>
## Numerical Variable

In [None]:
def plot_hist(variable):
    plt.figure(figsize=(9,3))
    plt.hist(data[variable], bins=40)
    plt.xlabel(variable)
    plt.ylabel("Frequancy")
    plt.title("{} distribution with histogram".format(variable))
    plt.show()

In [None]:
numericVar = ["age", "trestbps", "chol", "thalach", "oldpeak"]
for n in numericVar:
    plot_hist(n)

<a id="6"></a><br>
## Basic Data Analysis
* sex vs target
* cp vs target
* fbs vs target
* restecg vs target
* exang vs target
* slope vs target
* ca vs target
* thal vs target

In [None]:
# sex vs target
data[["sex","target"]].groupby(["sex"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# cp vs target
data[["cp","target"]].groupby(["cp"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# fbs vs target
data[["fbs","target"]].groupby(["fbs"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# restecg vs target
data[["restecg","target"]].groupby(["restecg"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# exang vs target
data[["exang","target"]].groupby(["exang"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# slope vs target
data[["slope","target"]].groupby(["slope"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# ca vs target
data[["ca","target"]].groupby(["ca"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
# thal vs target
data[["thal","target"]].groupby(["thal"], as_index=False).mean().sort_values(by="target", ascending=False)

In [None]:
data.head()

<a id="7"></a><br>
## Outlier Detection

In [None]:
def outlier_detection(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st Quantile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # Detect outlier and their indices
        outlier_indices_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        
        # store indices
        outlier_indices.extend(outlier_indices_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outlier = list(i for i, v in outlier_indices.items() if v >= 2)
    
    return multiple_outlier

In [None]:
data.loc[outlier_detection(data, ["age", "trestbps", "chol", "thalach", "oldpeak"])]

In [None]:
# drop outliers
data = data.drop(outlier_detection(data, ["age", "trestbps", "chol", "thalach", "oldpeak"]), axis=0).reset_index(drop=True)

<a id="8"></a><br>
## Missing Value
* Find Missing Value
* Fill Missing Value

In [None]:
data = pd.concat([data,test], axis=0).reset_index(drop=True)

In [None]:
data.head()

<a id="9"></a><br>
## Find Missing Value

In [None]:
data.columns[data.isnull().any()]

In [None]:
data.isnull().sum()

* There is no missing vallue so I'm passing this step

<a id="10"></a><br>
## Vissualization

<a id="11"></a><br>
### Corelation between all features

In [None]:
# age sex cp trestbps chol fbs restecg  thalach	exang oldpeak slope	ca thal	target
list1 = ["age","sex","trestbps","chol","thalach","oldpeak", "cp", "fbs", "restecg", "exang", "slope", "ca", "thal", "target"]
plt.figure(figsize=(15,10))
sns.heatmap(data[list1].corr(), annot=True, fmt=".2f")
plt.show()

thalach, oldpeak, cp(chest pain), exang(exercies induced angina), ca, slope and thal are corelated with tharget

<a id="12"></a><br>
## thalach -- target

In [None]:
g = sns.FacetGrid(data, col="target")
g.map(sns.distplot, "thalach", bins=40)
plt.show()

In [None]:
g = sns.factorplot(x="thalach", y="target", data=data, kind="bar", size=6, orient="h")
g.set_ylabels("Heart Attack Probability")
plt.show()

If thalach is higher than 140, they have more heart attack risk

In [None]:
g = sns.FacetGrid(data, col="target")
g.map(sns.distplot, "oldpeak", bins=40)
plt.show()

<a id="12"></a><br>
## oldpeak -- target

In [None]:
g = sns.factorplot(x="oldpeak", y="target", data=data, kind="bar", size=6, orient="h")
g.set_ylabels("Heart Attack Probability")
plt.show()

If oldpeak is lower than 0.50, they have more heart attack risk

<a id="13"></a><br>
## cp -- target

In [None]:
g = sns.factorplot(x="cp", y="target", data=data, kind="bar", size=6)
g.set_ylabels("Heart Attack Probability")
plt.show()

If cp value is equals 1,2,3 , they have more harth attack risk

<a id="14"></a><br>
## exang -- target

In [None]:
g = sns.factorplot(x="exang", y="target", data=data, kind="bar", size=6)
g.set_ylabels("Heart Attack Probability")
plt.show()

If exangina value equal 0, they have more heart attack risk

<a id="15"></a><br>
## ca -- target

In [None]:
g = sns.factorplot(x="ca", y="target", data=data, kind="bar", size=6)
g.set_ylabels("Heart Attack Probability")
plt.show()

If ca value equals 0 or 4, they have more heart attack risk

<a id="16"></a><br>
## slope -- target

In [None]:
g = sns.factorplot(x="slope", y="target", data=data, kind="bar", size=6)
g.set_ylabels("Heart Attack Probability")
plt.show()

If slope value equal 2, they have more heart attack risk

<a id="17"></a><br>
## thal -- target

In [None]:
g = sns.factorplot(x="thal", y="target", data=data, kind="bar", size=6)
g.set_ylabels("Heart Attack Probability")
plt.show()

If thal value equal 2, thay have more heart attack risk

<a id="18"></a><br>
## cp -- exang -- target

In [None]:
#cp -- exang -- target
g = sns.FacetGrid(data, col="target", row="exang")
g.map(plt.hist, "cp")
plt.show()

<a id="19"></a><br>
## cp -- exang -- target -- sex

In [None]:
g = sns.FacetGrid(data, col="cp", size=3)
g.map(sns.pointplot, "exang", "target", "sex")
g.add_legend()
plt.show()

* black=male
* blue=female

<a id="20"></a><br>
## Feature Engineering

kategorical feature ları ayır ve 0 ve 1 den oluştur

<a id="21"></a><br>
## Modelling

In [None]:
from sklearn.model_selection import train_test_split, StratifiedKFold, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

<a id="22"></a><br>
## Train - Test Split

In [None]:
len_train = len_train-1 # I dropped 1 feature that outlier one. This is wht ı do this
train = data[:len_train]
test = data[len_train:]
test_x = test.drop(["target"], axis=1)
test_y = test["target"]
print("train len : ",len(train))
print("test len : ", len(test))

In [None]:
X_train = train.drop(["target"] , axis=1)
Y_train = train["target"]
x_train, x_test, y_train, y_test = train_test_split(X_train,Y_train, test_size=0.2, random_state=42)
print("x_train len:", len(x_train))
print("x_test len:", len(x_test))
print("y_train len:", len(y_train))
print("y_test len:", len(y_test))
print("test len:", len(test))

<a id="23"></a><br>
## Simple Logistic Regression

In [None]:
lr = LogisticRegression()
lr.fit(x_train,y_train)
y_head = lr.predict(x_test)
acc_log_train = round(lr.score(x_train, y_train)*100,2)
acc_log_test = round(lr.score(x_test,y_test)*100,2)
print("Logistic Regression Train Accuracy: %", acc_log_train)
print("Logistic Regression Test Accuracy: %", acc_log_test)

<a id="24"></a><br>
## Hyper Parameter -- Grid Search -- Cross Validation
We will compare 5 ml classifier and evaluate mean accuracy of each of them by stratified cross validation.

* DecisionTree
* SVM
* RandomForest
* KNN
* LogisticRegression

In [None]:
random_state = 42
classifier = [DecisionTreeClassifier(random_state=random_state),
             SVC(random_state=random_state),
             RandomForestClassifier(random_state=random_state),
             LogisticRegression(random_state=random_state),
             KNeighborsClassifier()]

dt_param_grid = {"min_samples_split":range(10,500,20),
                "max_depth":range(1,20,2)}

svc_param_grid = {"kernel":["rbf"],
                 "gamma":[0.001,0.01,0.1,1],
                 "C":[1,10,50,100,200,300,1000]}

rf_param_grid = {"max_features":[1.3,10],
                "min_samples_split":[2,3,10],
                "min_samples_leaf":[1,3,10],
                "bootstrap":[False],
                "n_estimators":[100,300],
                "criterion":["gini"]}

lr_param_grid = {"C":np.logspace(-3,3,7),
                "penalty":["l1","l2"]}

knn_param_grid = {"n_neighbors":np.linspace(1,19,10, dtype=int).tolist(),
                 "weights":["uniform","distance"],
                 "metric":["euclidean","manhattan"]}

classifier_param = [dt_param_grid,
                   svc_param_grid,
                   rf_param_grid,
                   lr_param_grid,
                   knn_param_grid]

In [None]:
model_names=["DecisionTree :", "SVC : ", "RandomForest : ", "LogisticRegression : ", "KNN : "]
cv_result = []
best_estimators = []
for i in range(len(classifier)):
    clf = GridSearchCV(classifier[i], param_grid=classifier_param[i], cv=StratifiedKFold(n_splits=10), scoring="accuracy", n_jobs=-1, verbose=1)
    clf.fit(x_train,y_train)
    cv_result.append(clf.best_score_)
    best_estimators.append(clf.best_estimator_)
    print(model_names[i], cv_result[i])

In [None]:
print(cv_result)
cv_results = pd.DataFrame({"Cross Validation Means":cv_result, "ML Models":["DecisionTreeClassifier", "SVC", "RandomForestClassifier", "LogisticRegression", "KNeigborsClassifier"]})

g = sns.barplot(x="Cross Validation Means", y = "ML Models", data=cv_results)
g.set_xlabel("Means Accuracy")
g.set_title("Cross Validation Scores")
plt.show()

<a id="25"></a><br>
## Ensemble Modelling

In [None]:
votingC = VotingClassifier(estimators=[("rf", best_estimators[2]),
                                      ("lr", best_estimators[3])],
                          voting="soft", n_jobs=-1)
votingC = votingC.fit(x_train, y_train)
print("Accuracy :", accuracy_score(votingC.predict(x_test), y_test))

And last , I use first test data for prediction

In [None]:
print("Accuracy :",accuracy_score(votingC.predict(test_x), test_y))