# Introduction

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to
this date. The "goal" field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4.

<font color="blue">
What will we do?

1. [Load and Check of Dataset](#1)
1. [Normalization](#2)
1. [Determine X and Y (Train-test split)](#3) 
1. [Machine Learning (Classification)](#4)
    * [Logistic Regression](#5)
    * [KNN (K Neighbors)](#6)
    * [Decision Tree](#7)
    * [Random Forest](#8)
    * [SVM](#9)
    * [GBM](#10)
    * [Adaboost](#11)
    * [Bagging](#12)
1. [Compare Algorithms](#13)


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import r2_score, mean_squared_error, confusion_matrix, roc_curve, classification_report
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from collections import Counter

import warnings
warnings.filterwarnings("ignore")

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

<a id = "1"></a><br>
# Load and Check of Dataset

In [None]:
df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")
df.head()

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
for i in df.columns:

    sns.boxplot(df[i])
    plt.show()

We have checked the dataset and, we didn't see any blank data. Some data has outliers but, they can be ignored. Why didn't we see any blank data? Because medical data don't have any blank data. Why? Because this data is the hospital's data so they can't be blank. This info gave from the patients.

<a id = "2"></a><br>
# Normalization

We did it because we need to normalize. Why? Because our data has distribution. Some features have a big value, some feature has between of 0 and 1. But they're the same distribution and for the numbers are different from each other, we need to normalize. 

In [None]:
for i in df.columns:
    for each in df[i].values:
        
        if each > 1 or each < -1:
            df[i] = (df[i] - np.min(df[i]))/(np.max(df[i]) - np.min(df[i]))
        else:
            pass

If we want to watch easily road, we can use Label Encoder. I wanted to show how we do in for loop.

<a id = "3"></a><br>
# Determine X and Y (Train-test split)

In [None]:
y = df["target"].values
x = df.drop(["target"],axis = 1)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,test_size = 0.20,random_state = 42)

<a id = "4"></a><br>
# Machine Learning (Classification)

1. Logistic Regression
1. KNN(K Neighbors)
1. Decision Tree
1. Random Forests
1. SVM
1. GBM (Gradient Boosting Machine)
1. AdaBoost
1. Bagging

<a id = "5"></a><br>
## Logistic Regression

In [None]:
lr = LogisticRegression(random_state = 42).fit(x_train,y_train)
lr.score(x_test,y_test)

In [None]:
# GridSearchCV
params = {'C': np.logspace(-3, 3, 7), 'penalty': ['l1', 'l2']}
lr_model = LogisticRegression(random_state = 42)
lr_cv = GridSearchCV(lr_model,params,cv = 3).fit(x_train,y_train)
lr_cv.best_params_

I prefered GridSearch because it's easy for me. That's selection is yours.

In [None]:
lr_tuned = LogisticRegression(C = 1,random_state = 42).fit(x_train,y_train)
lr_tuned.score(x_test,y_test)

In [None]:
y_pred_log = lr_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_log),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "6"></a><br>
## KNN (K Neighbors)

In [None]:
knn = KNeighborsClassifier(n_neighbors = 2).fit(x_train,y_train)
knn.score(x_test,y_test)

In [None]:
params = {"n_neighbors": range(1,50)}
knn_model = KNeighborsClassifier()
knn_cv = GridSearchCV(knn_model, params, cv = 10, n_jobs = -1).fit(x_train,y_train)
knn_cv.best_params_

In [None]:
knn_tuned = KNeighborsClassifier(n_neighbors = 7).fit(x_train,y_train)
knn_tuned.score(x_test,y_test)

In [None]:
y_pred_knn = knn_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_knn),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "7"></a><br>
## Decision Tree

In [None]:
tree = DecisionTreeClassifier(random_state = 42).fit(x_train,y_train)
tree.score(x_test,y_test)

In [None]:
params = {"max_depth": range(1,10),
            "min_samples_split" : list(range(2,50))}
tree_model = DecisionTreeClassifier(random_state = 42)
tree_cv = GridSearchCV(tree_model, params, cv = 10, n_jobs = -1).fit(x_train,y_train)
tree_cv.best_params_

In [None]:
tree_tuned = DecisionTreeClassifier(max_depth = 4, min_samples_leaf = 7, min_samples_split = 2,random_state = 42).fit(x_train,y_train)
tree_tuned.score(x_test,y_test)

In [None]:
y_pred_tree = tree_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_tree),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "8"></a><br>
## Random Forest

In [None]:
rf = RandomForestClassifier(random_state = 42).fit(x_train,y_train)
rf.score(x_test,y_test)

In [None]:
params = {"max_depth": range(1,15),
         "min_samples_split" : [2,5,6,7,10],
         "min_samples_leaf": [2,5,7,8],
         "max_features" : [2,3,5,7,9,12],
         }
rf_model = DecisionTreeClassifier(random_state = 42)
rf_cv = GridSearchCV(rf_model, params, cv = 10, n_jobs = -1).fit(x_train,y_train)
rf_cv.best_params_

In [None]:
rf_tuned = RandomForestClassifier(max_depth = 5, max_features  = 9, min_samples_leaf = 8, min_samples_split = 2,random_state = 42).fit(x_train,y_train)
rf_tuned.score(x_test,y_test)

In [None]:
y_pred_rf = rf_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_rf),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "9"></a><br>
## SVM

In [None]:
svm = SVC(random_state = 42).fit(x_train,y_train)
svm.score(x_test,y_test)

In [None]:
params = {"C": [0.0001, 0.001, 0.1, 1, 5, 10 ,50 ,100],
             "gamma": [0.0001, 0.001, 0.1, 1, 5, 10 ,50 ,100]}
svm_model = SVC(random_state = 42)
svm_cv = GridSearchCV(svm_model,params,cv = 10,n_jobs = -1).fit(x_train,y_train)
svm_cv.best_params_

In [None]:
svm_tuned = SVC(C = 10, degree = 1, gamma = 0.1, random_state = 42).fit(x_train,y_train)
svm_tuned.score(x_test,y_test)

In [None]:
y_pred_svm = svm_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_svm),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "10"></a><br>
## GBM

In [None]:
gbm = GradientBoostingClassifier(random_state = 42).fit(x_train,y_train)
gbm.score(x_test,y_test)

In [None]:
gbm_params = {"learning_rate" : [0.001, 0.01, 0.1, 0.05],
             "n_estimators": [100,500,1000],
             "max_depth": [3,5,10],
             "min_samples_split": [2,5,10]}
gbm_model = GradientBoostingClassifier(random_state = 42)
gbm_cv = GridSearchCV(gbm_model, gbm_params, cv = 10, n_jobs = -1).fit(x_train,y_train)
gbm_cv.best_params_

In [None]:
gbm_tuned = GradientBoostingClassifier(max_depth = 2, learning_rate = 0.01, min_samples_split = 45, n_estimators = 100, random_state = 42).fit(x_train,y_train)
gbm_tuned.score(x_test,y_test)

In [None]:
y_pred_gbm = gbm_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_gbm),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "11"></a><br>
## Adaboost

In [None]:
ada = AdaBoostClassifier(random_state = 42).fit(x_train,y_train)
ada.score(x_test,y_test)

In [None]:
params = {"n_estimators":[10,100,200,300,500,1000],"learning_rate":[0.0001,0.001,0.01,0.1,0.2,0.3,0.7]}
ada_model = AdaBoostClassifier(random_state = 42)
ada_cv = GridSearchCV(ada_model,params,cv = 10,n_jobs = -1).fit(x_train,y_train)
ada_cv.best_params_

In [None]:
ada_tuned = AdaBoostClassifier(learning_rate = 0.01,n_estimators = 1000,random_state = 42).fit(x_train,y_train)
ada_tuned.score(x_test,y_test)

In [None]:
y_pred_ada = ada_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_ada),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "12"></a><br>
## Bagging

In [None]:
bag = BaggingClassifier(random_state = 42).fit(x_train,y_train)
bag.score(x_test,y_test)

In [None]:
params = {"n_estimators": range(1,50)}
bag_model = BaggingClassifier(random_state = 42)
bag_cv = GridSearchCV(bag_model, params, cv = 10, n_jobs = -1).fit(x_train,y_train)
bag_cv.best_params_

In [None]:
bag_tuned = BaggingClassifier(n_estimators = 45,random_state = 42).fit(x_train,y_train)
bag_tuned.score(x_test,y_test)

In [None]:
y_pred_bag = bag_tuned.predict(x_test)
sns.heatmap(confusion_matrix(y_test,y_pred_bag),annot = True)
plt.xlabel("Y_pred")
plt.ylabel("Y_test")
plt.show()

<a id = "13"></a><br>
# Compare Algorithms

In [None]:
pred_list = [lr_tuned,knn_tuned,tree_tuned,rf_tuned,gbm_tuned,svm_tuned,ada_tuned,bag_tuned]

for i in pred_list:
    print("Score : ",i.score(x_test,y_test))
    y_pred = i.predict(x_test)
    sns.heatmap(confusion_matrix(y_test,y_pred),annot = True)
    plt.xlabel("Y_pred")
    plt.ylabel("Y_test")
    plt.title(i)
    plt.show()

Briefly we seleceted to GBM and SVM. They scores are 88.5