# INTRODUCTION



1 - [Load and Check Data](#1)

2 - [Variable Description](#2)
     
3 - [Univarite Variable Analysis](#3)
        
   * [Categorial Variable](#4)
   * [Numerical Variable](#5)

4 - [Outlier Detection](#6)

5 - [Missing Values](#7)
    
   * [Find Missing Values](#8)
   * [Fill Missing Values](#9)

6 - [X and Y coordinates](#10)

7 - [Normalization Operation](#11)

8 - [Train - Test Split](#12)

9 - [Logistic Regression Classfication](#13)

10 - [K-Nearest Neighbour (KNN) Classification](#14)

10 - [Support Vector Machines](#15)

11 - [Naive Bayes Classification](#16)

12 - [Decision Tree Classification](#17)

13 - [Random Forest Classification](#18)

14 - [Confusion Matrix](#19)
    
   * [Visualize of the Confusion Matrix](#20)


In [None]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
plt.style.use('seaborn-whitegrid')

from collections import Counter

import warnings
warnings.filterwarnings('ignore')


import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

<a id = '1' ><a><br>
## Load and Check Data

In [None]:
data = pd.read_csv("/kaggle/input/health-care-data-set-on-heart-attack-possibility/heart.csv")

In [None]:
# Let's look at the first 5 rows of the data
data.head()

In [None]:
# last 5 rows.
data.tail()

In [None]:
# Column names in the data
data.columns

<a id = '2' ><a><br>
## Variable Description

1) age = age of the patient

2) sex = gender of the patient

3) cp = chest pain type (4 values) 

4) trestbps = resting blood pressure

5) chol = serum cholestoral in mg/dl

6) fbs = fasting blood sugar > 120 mg/dl

7) restecg = resting electrocardiographic results (values 0,1,2)

8) thalach =  maximum heart rate achieved

9) exang = exercise induced angina

10) oldpeak = ST depression induced by exercise relative to rest

11) slope = the slope of the peak exercise ST segment

12) ca = number of major vessels (0-3) colored by flourosopy

13) thal: 0 = normal; 1 = fixed defect; 2 = reversable defect

14) target --->   0 = less chance of heart attack ,  1 = more chance of heart attack

In [None]:
data.info()

* int64(13) = age , sex , cp , trestbps , chol ,fbs ,restecg , thalach , exang , slope , ca , thal , target
* float64(1) = oldpeak

<a id = '3' ><a><br>
## Univarite Variable Analysis

* Categorical Variables = sex ,cp , fbs , restecg , exang , slope , ca , thal ,target
 
* Numerical Variables = age , trestbps , chol , thalach , oldpeak

<a id = '4' ><a><br>
## Categorical Variables

In [None]:
def bar_plot(variable):
    
    # get feature
    var = data[variable]
    
    # count number of the cateegorical variable (sample)
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize = (10,10))
    plt.bar(varValue.index,varValue)
    plt.xticks(varValue.index,varValue.index.values)
    plt.ylabel("Frequency")
    plt.title(variable)
    plt.show()
    print("{} : \n {}",variable,varValue)
    

In [None]:
categorical_variables = ["sex" ,"cp" , "fbs" , "restecg" , "exang" , "slope" , "ca" , "thal" ,"target"]
for c in categorical_variables:
    bar_plot(c)

<a id = '5' ><a><br>
## Numerical Variables

In [None]:
def plot_hist(variable):
    plt.figure(figsize = (10,10))
    plt.hist(data[variable],bins = 75,color = "green")
    plt.xlabel(variable)
    plt.ylabel("Frequency")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

In [None]:
numerical_variables = ["age" , "trestbps" , "chol" , "thalach" , "oldpeak"]

for x in numerical_variables:
    plot_hist(x)

<a id = '6' ><a><br>
## Outlier Detection

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    for c in features:
        # 1 st quartile
        Q1 = np.percentile(df[c],25)
        
        # 3 rd quartile
        Q3 = np.percentile(df[c],75)
        
        # IQR
        IQR = Q3 - Q1
        
        # Outlier step
        outlier_step = IQR * 1.5
   
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1-outlier_step) | (df[c] > Q3 + outlier_step)].index
        
        # store indeces
        outlier_indices.extend(outlier_list_col)

    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 2)

    return multiple_outliers

In [None]:
data.loc[detect_outliers(data,["age","trestbps","chol","thalach","oldpeak"])]

<a id = '7' ><a><br>
## Find Missing Values

In [None]:
data.columns[data.isnull().any()]

In [None]:
data.isnull().sum() # Here , how many missing values are in the dataset ?

<a id = '8' ><a><br>
# Fill Missing Values
* Dataset hasn't any missing value.

<a id = '9' ><a><br>
## X and Y Coordinates

In [None]:
data.head()

In [None]:
y = data.target.values

In [None]:
# axis = 1 , which means column
# axis = 0 , which means row
x_data = data.drop(["target"],axis = 1)

<a id = '10' ><a><br>
## Normalization Operation
* To scale from 0 to 1

In [None]:
x = (x_data - np.min(x_data))/(np.max(x_data) - np.min(x_data)).values

<a id = '11' ><a><br>
# Train - Test Split

In [None]:
from sklearn.model_selection import train_test_split
x_test,x_train,y_test,y_train = train_test_split(x,y,test_size = 0.2,random_state = 42)

<a id = '12' ><a><br>
# Logistic Regression Classfication

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(x_train,y_train)
print("Test accuracy: {}",lr.score(x_test,y_test))

<a id = '13' ><a><br>
## K-Nearest Neighbour (KNN) Classification

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3) # n_neighbors = k value
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)
print("KNN score: {}".format(knn.score(x_test,y_test)))

As you can see , our accuracy decreased.We can incerasing accuracy by changing the k value (n_neighbors)

In [None]:
# visualize to find best K value
score_list = []

for each in range(1,61):
    knn2 = KNeighborsClassifier(n_neighbors=each)
    knn2.fit(x_train,y_train)
    score_list.append(knn2.score(x_test,y_test))


plt.plot(range(1,61),score_list)
plt.title("K-Value & Accuracy")
plt.xlabel("K-Value")
plt.ylabel("Accuracy")
plt.show()

When we look at this graph , best k value is 6.
Let's do it again...

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=6) # n_neighbors = k value
knn.fit(x_train,y_train)
prediction = knn.predict(x_test)
print("KNN score: {}".format(knn.score(x_test,y_test)))

<a id = '14' ><a><br>
# Support Vector Machines

In [None]:
from sklearn.svm import SVC
svm = SVC(random_state = 1) # random state => to randomly divide at the same rate each time.
svm.fit(x_train,y_train)

print("Accuracy of the Support Vector Machines : ",svm.score(x_test,y_test))

<a id = '15' ><a><br>
# Naive Bayes Classification

In [None]:
from sklearn.naive_bayes import GaussianNB
nb = GaussianNB()
nb.fit(x_train,y_train)
print("Accuracy of the Naive Bayes Classification: ",nb.score(x_test,y_test))

<a id = '16' ><a><br>
# Decision Tree Classification
    

In [None]:
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)
print("Accuracy of the Decision Tree Classification: ",dt.score(x_test,y_test))


<a id = '17' ><a><br>
# Random Forest Classification

In [None]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 100,random_state = 1) # n_estimators = number of trees
rf.fit(x_train,y_train)

print("Accuracy of the Random Forest Classification: ",rf.score(x_test,y_test))

<a id = '18' ><a><br>
# Confusion Matrix

In [None]:
y_pred = rf.predict(x_test)
y_true = y_test

In [None]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true,y_pred)

<a id = '19' ><a><br>
## Visualize of the Confusion Matrix

In [None]:
import seaborn as sns

f,ax = plt.subplots(figsize=(15,15))
sns.heatmap(cm,annot = True,linewidths = 0.5,linecolor = "green",fmt = ".0f",ax = ax)
plt.xlabel("y_pred")
plt.ylabel("y_true")
plt.show()