In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# **GENERAL DESCRIPTION OF THE DATASET**

A heart attack is quite a common ailment for humans with average age of 65 (for men) and 72 (for women). Although new researches in the recent past do put the spotlight on an alarming trend - "**a rising incidence of heart attacks in younger adults**."

This dataset contains information about people and their chances of succumbing to a **HEART ATTACK**.

The contents of the dataset range from personal records like age, sex, etc. to more cardiovascular related information like blood pressure levels, types of chest pain, cholestrol levels, etc. 


#### Given below is a brief explanation of the terminology used in the given dataset :-

1. age : Age of the patient
2. sex : Sex of the patient
3. exng : exercise induced angina (1 = yes; 0 = no)
4. ca : number of major vessels (0-3)
5. cp : Chest Pain type <br>
   • Value 1: typical angina <br>
   • Value 2: atypical angina <br>
   • Value 3: non-anginal pain <br>
   • Value 4: asymptomatic <br>
6. trtbps : resting blood pressure (in mm Hg)
7. chol : cholestoral in mg/dl fetched via BMI sensor
8. fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
9. rest_ecg : resting electrocardiographic results <br>
   • Value 0: normal <br>
   • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) <br>
   • Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria <br>
10. thalach : maximum heart rate achieved
11. target : <br>
    • 0 = less chance of heart attack <br>
    • 1 = more chance of heart attack

# **OBJECTIVES OF THE NOTEBOOK**

1. Perform the EDA (Exploratory Data Analysis) followed by the VDA (Visual Data Analysis) to better understand the given dataset. 

2. Correctly predict (max accuracy) the probability of a person succumbing to a heart attack with the help of the various features present in the dataset.

3. Apply and compare the various ML classification algorithms we have at our disposal and select the one with the highest accuracy.

4. Boost the ML classification models using processing techniques to examine any substantial increase in the acuracy scores.  

# **IMPORTING THE LIBRRIES AND DATASTET**

### **LIBRARIES**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import plotly.express as px

### **DATASET**

In [None]:
dataset = pd.read_csv ("../input/heart-attack-analysis-prediction-dataset/heart.csv")

# **EXPLORATORY DATA ANALYSIS (EDA)**

In this section, we will be performing some basic operations on the dataset in order to break down our data to the rudimentary level. For example, checking for missing values, what are the input types of the features, dropping unvalued features etc. to name a few.

In [None]:
dataset.shape

In [None]:
dataset.columns

In [None]:
dataset.info ()

In [None]:
dataset.head ()

To the naked eye, all of the features that we have present in our dataset appear to contribute to the final predictions of the heart attck analysis in some form or the other. Hence, in our intial stages of EDA we wont be dropping any feature/column. 

### **CHECKING FOR MISSING DATA**

In [None]:
dataset.isnull ()

In the output of the **isna ()** function only the first 5 and the last 5 data entries are visible and we dont get any actual idea whether any data point is mising or not in the middle of the table. We will rectify that in the next code cell.

In [None]:
dataset.isna (). sum ()

The **isna (). sum ()** function gives the sum of all the missing or uncorrectly entered data points for each and every column in the dataset. Since, none of the features in our dataset is capable of having negative values it is safe to state that the dataset doestn't contain any missing values. 

### **CHECKING THE NUMBER OF UNIQUE VALUES**


In [None]:
dict = {}
for i in list(dataset.columns):
    dict[i] = dataset[i].value_counts().shape[0]

pd.DataFrame(dict,index=["unique count"]).transpose()

**DataFrame.describe()** method generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values. This method tells us a lot of things about a dataset. One important thing is that the describe() method deals only with numeric values. It doesn't work with any categorical values. So if there are any categorical values in a column the describe() method will ignore it and display summary for the other columns unless parameter include="all" is passed.

Now, let's understand the statistics that are generated by the describe() method:

• count tells us the number of NoN-empty rows in a feature. <br>
• mean tells us the mean value of that feature. <br>
• std tells us the Standard Deviation Value of that feature. <br>
• min tells us the minimum value of that feature. <br>
• 25%, 50%, and 75% are the percentile/quartile of each features. This quartile information helps us to detect Outliers. <br>
• max tells us the maximum value of that feature.

In [None]:
dataset.describe ()

### **CHECKING FOR DUPLICATED DATA**

In [None]:
dataset.duplicated (). sum ()

A duplicate row has been found and will have to be removed to establish chances of best results.

In [None]:
dataset.drop_duplicates (inplace = True)
dataset.duplicated (). sum ()

# **VISUAL DATA ANALYSIS (VDA)**

After doing a theoretical analysis in the previous section, we will be moving on to visual analysis of the dataset. This will include quite a number of scatter plots, bar plots, etc. between the different features and how they affect one other and how they will affect the working algorithm. We will also be getting a general idea about which features will play a more active role while determining the accuracy of the model.

For this section we will be using a new library **plotly.express** as well.

### **CORRELATION MATRIX AND HEATMAP**

In [None]:
df = pd.DataFrame (dataset, columns = ['age', 'sex', 'cp', 'trtbps', 'chol', 'fbs', 'restecg', 'thalachh', 'exng', 'oldpeak', 'slp', 'caa', 'thall', 'output'])
df.corr ()

In [None]:
corr_Matrix = df.corr ()
f,ax = plt.subplots(figsize=(18, 18))
sns.heatmap (corr_Matrix, linewidths = 0.5, annot = True, fmt= '.1f',ax=ax)
plt.show ()

It can be seen from the heatmap that we don't have much strong correlation neither between any two features nor between a feature and the response variable (output column). Therefore, it won't be advantageous for us to make choice of features on the basis of such low correlation scores.

For this dataset we will make use of the medical knowledge available pertaining to heart attacks and what are the main factors that influence it.

For example, from our features and consulting the internet it becomes fairly obvious that features like age, cp, trtbps, chol, etc will be accounted for.

The list of relevant features is as follows :- <br>
• age <br>
• cp <br>
• trtbps <br>
• chol <br>
• fbs <br>
• rest_ecg <br>
• thalachh

### **HEART ATTACK vs AGE** 

In [None]:
fig = px.histogram (dataset, x = "age", nbins = 6, facet_row = "output", title = "Heart Attacks Per Age Group", template = 'plotly_dark')
fig.show ()

**output = 1** corresponds to people who had a heart attack. <br>
**output = 0** corresponds to people who didn't have a heart attack.

**INFERENCES** <br>
⁫• people in the age bracket of 50-60 (65 cases) have the maximum tendency to have a heart attack closely followed by the     age group of 40-50 (50 cases). <br>
⁫• 10 cases in the age range of 30-40 is an alarming figure because people in this age group weren't known to suffer from     heart attacks. <br>
⁫• the age group of 50-60 (60 cases) also leads the chart in the maximum number of people not succumbing to heart attacks     followed by people in the range of 60-70 (48 cases).

### **HEART ATTACK vs CHEST PAIN**

In [None]:
fig =  px.pie (dataset, names = "cp", hole = 0.4, template = "gridon", title = " Types Of Chest Pain")
fig.show ()
sns.countplot (x = 'cp', data = dataset)

Maximum number of people suffer from type 0/typical angina (47.4%) chest pain followed by type 2/non-anginal pain        (28.5%), type 1/atypical angina (16,6%) chest pain and lastly type 3/asymptomatic (7.62%). <br>

In [None]:
fig = px.sunburst(dataset, names = "cp", path = ["output","cp"], template = "gridon", title = "Heart Attack Chances Based On Chest Pain ")
fig.show()

**SUNBURST CHART TERMINOLOGY** <br>

**lighter color corresponds to chest pain types** <br>
type 0 - typical angina <br>
type 1 - atypical angina <br>
type 2 - non-anginal pain <br>
type 3 - asymptomatic <br>
**darker color corresponds to heart attack response** <br>
1 - heart attack <br>
0 - non heart attack <br>

**INFERENCES** <br>
• patients with higher chances of heart attack or have suffered a heart attack tend to experience non-anginal/type 2 pain   (41.4%) the most and asymptomatic/type 3 pain (9.7%) the least. <br>
• 75.3% of the patients with a lesser chance of suffering from a heart attack experience typical angina (type 0) pain.

### **HEART ATTACK vs RESTING BLOOD SUGAR**

In [None]:
fig = px.scatter (dataset, x = "age", y = "trtbps", color = "output", template = "plotly", size = "trtbps", title = "Age vs Resting Blood Sugar Level and Impact On Heart Attack")
fig.show ()
fig = px.histogram (dataset, x = "trtbps", nbins = 12, facet_row = "output", title = "Heart Attacks per Resting Blood Sugar Levels", template = 'plotly_dark')
fig.show ()

**output = 1** corresponds to people who had a heart attack. <br>
**output = 0** corresponds to people who didn't have a heart attack.

**INFERENCES** <br>
• patients prone to a heart attack have a resting blood sugar level in the range of 120-140 mm Hg. <br>
• patients not prone to a heart attack have a resting blood sugar distributed more or less evenly in the range of 110-150   mm Hg. 

### **HEART ATTACK vs CHOLESTROL**

In [None]:
fig = px.scatter (dataset, x = "age", y = "chol", color = "output", template = "plotly", size = "chol", title = "Age vs Cholestrol Level and Impact On Heart Attack")
fig.show ()
fig = px.histogram (dataset, x = "chol", nbins = 10, facet_row = "output", title = "Heart Attacks per Cholestrol Levels", template = 'plotly_dark')
fig.show ()

**output = 1** corresponds to people who had a heart attack. <br>
**output = 0** corresponds to people who didn't have a heart attack.

**INFERENCES** <br>
• maximum number of patients prone to a heart attack have cholestrol level in the range of 200-250 mg/dL. <br>
• maximum cholestrol level reached for a heart attack prone victim was 564 mg/dL. <br>
• patients not prone to a heart attack have a cholestrol level distributed evenly in the range of 200-300 mg/dL. 

### **HEART ATTACK vs FASTING BLOOD SUGAR**

In [None]:
fig =  px.pie (dataset, names = "fbs", hole = 0.4, template = "gridon", title = "Fasting Blood Sugar Levels")
fig.show ()
sns.countplot (x = 'fbs', data = dataset)

**1 - fbs > 120mg/dL** <br>
**0 - fbs < 120mg/dL** <br>

Maximum people have a fasting blood sugar lower than 120mg/dL

In [None]:
fig = px.sunburst(dataset, names = "fbs", path = ["output","fbs"], template = "gridon", title = "Heart Attack Chances Based On Fasting Blood Sugar Levels")
fig.show ()

**SUNBURST CHART TERMINOLOGY** <br>

**lighter color corresponds to fasting blood sugar level** <br>
type 0 - fbs<120mg/dL <br>
type 1 - fbs>120mg/dL <br>
**darker color corresponds to heart attack response** <br>
1 - heart attack <br>
0 - non heart attack <br>

**INFERENCES** <br>
• 85.9% patients prone to a heart attack have a fasting blood sugar level greater than 120 mg/dL. <br>
• 84% patients not prone to a heart attack have a fasting blood sugar level less than 120 mg/dL.

### **HEART ATTACK vs RESTING ELECTROCARDIOGRAPHIC RESULTS**

In [None]:
fig =  px.pie (dataset, names = "restecg", hole = 0.4, template = "gridon", title = "Resting Electrocardiographic Results")
fig.show ()
sns.countplot (x = 'restecg', data = dataset)

The percentage of patients with a restecg of type 2 (1.32%) is more or less negligible in comparison to types 0 and 1 (48.7% and 50% respectively). 

In [None]:
fig = px.sunburst(dataset, names = "restecg", path = ["output","restecg"], template = "gridon", title = "Heart Attack Chances Based On Resting Electrocardiographic Results")
fig.show ()

**SUNBURST CHART TERMINOLOGY** <br>

**lighter color corresponds to resting electrocardiographic results** <br>
type 0: normal <br>
type 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) <br>
type 2: showing probable or definite left ventricular hypertrophy by Estes' criteria <br>
**darker color corresponds to heart attack response** <br>
1 - heart attack <br>
0 - non heart attack <br>

**INFERENCES** <br>
• heart attack prone patients predominantly have ST-T wave abnormality/type 1 restecg result (57.9%) followed by             normal/type 0 restecg result (41.4%). <br>
• patients not prone to a heart attack have an opposite scenario with normal/type 0 restecg result being the major case     (57.2%) and ST-T wave abnormality/type 1 restecg result being secondary (40.5%). <br>
• both the categories have a marginal number of patients who have a type 2 restecg result (0.006% and 0.02% respectively).

### **HEART ATTACK vs MAXIMUM HEART RATE**

In [None]:
fig = px.scatter (dataset, x = "age", y = "thalachh", color = "output", size = "thalachh", template = "plotly", title = "Age vs Maxmum Heart Rate and Impact on Heart Attack")
fig.show ()
fig = px.histogram (dataset, x = "thalachh", nbins = 7, facet_row = "output", title = "Heart Attacks per Maximum Heart Rate", template = 'plotly_dark')
fig.show ()

**output = 1** corresponds to people who had a heart attack. <br>
**output = 0** corresponds to people who didn't have a heart attack.

**INFERENCES** <br>
• patients with the maximum heart rate in the range of 160-180 are more prone to a heart attack. <br>
• at the same time patients with the maximum heart rate in the range of 140-160 are less prone to suffer a heart attack.

### **SOME 3D PLOTS AND PAIRPLOTS**

In [None]:
fig = px.scatter_3d (dataset, x = "age", y = "trtbps", z = "chol", color = "output", template = "plotly_dark", title = "Age vs Resting Blood Sugar vs Cholestrol")
fig.show ()

In [None]:
fig = px.scatter_3d (dataset, x = "age", y = "trtbps", z = "fbs", color = "output", template = "plotly_dark", title = "Age vs Resting Blood Sugar vs Fasting Blood Sugar")
fig.show ()

In [None]:
fig = px.scatter_3d (dataset, x = "age", y = "thalachh", z = "cp", color = "output", template = "plotly_dark", title = "Age vs Maximum Heart Rate vs Chest Pain")
fig.show ()

In [None]:
sns.pairplot (dataset, hue = "output")
plt.show ()

# **DATA PREPROCESSING**

1. Splitting dataset into matrix of features and response variables. <br>
2. Splitting dataset into training and test sets. <br>
3. Feature scaling.

### **SPLITTING DATASET INTO MATRIX OF FEATURES AND RESPONSE VARIABLES**

In [None]:
X = dataset.iloc [ : , : -1].values
Y = dataset.iloc [ :, -1].values
X.shape

### **SPLITTING DATASET INTO TRAINING AND TEST SETS**

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split (X, Y, test_size = 0.25, random_state = 1)

In [None]:
X_train.shape 

In [None]:
Y_train.shape 

In [None]:
X_test.shape

In [None]:
Y_test.shape

### **FEATURE SCALING**

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler ()
X_train = sc.fit_transform (X_train)
X_test = sc.transform (X_test)

In [None]:
print (X_train [ : 5, : ])

In [None]:
print (X_test [ : 5, : ])

### **PRINCIPLE COMPONENT ANALYSIS (PCA)**

PCA algorithm helps in optimizing the data to its best analytical form. It removes any unneeded features which have zero or play next to negligible part in determining the accuracy of our models and reducing the dimentions of our data. It is usually helpful in datasets which have a strong correlation between features (>30%). 

After analysing the correlation heatmap of our dataset, most of the features have a correlation of less than 30% with some even showing negative correlation. Due to this, it would be redundant to apply PCA in this dataset.

# **MODEL IMPLEMENTATIONS WITH HYPER PARAMETER TUNING**

In [None]:
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.model_selection import GridSearchCV

### **LOGISTIC REGRESSION**

In [None]:
from sklearn.linear_model import LogisticRegression 
classifier_log = LogisticRegression ()
classifier_log.fit (X_train, Y_train)
Y_pred_log = classifier_log.predict (X_test)
acc_log = accuracy_score (Y_test, Y_pred_log)
parameters = [{'penalty': ['l1', 'l2'], 'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000],
                'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']}]
grid_search = GridSearchCV(estimator = classifier_log,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_log = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_parameters)

### **K-NN MODEL**

In [None]:
from sklearn.neighbors import KNeighborsClassifier
classifier_knn = KNeighborsClassifier ()
classifier_knn.fit (X_train, Y_train)
Y_pred_knn = classifier_knn.predict (X_test)
acc_knn = accuracy_score (Y_test, Y_pred_knn)
parameters = [{'n_neighbors': [3,5,7,10,13,15], 'weights': ['uniform', 'distance'],
                'p': [1,2]}] 
grid_search = GridSearchCV(estimator = classifier_knn,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_knn = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_parameters)

### **NAIVE BAYES MODEL**

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier_nb = GaussianNB ()
classifier_nb.fit (X_train, Y_train)
Y_pred_nb = classifier_nb.predict (X_test)
acc_nb = accuracy_score (Y_test, Y_pred_nb)

### **SVM MODEL**

In [None]:
from sklearn.svm import SVC
classifier_svm = SVC (kernel = 'rbf', random_state = 0)
classifier_svm.fit (X_train, Y_train)
Y_pred_svm = classifier_svm.predict (X_test)
acc_svm = accuracy_score (Y_test, Y_pred_svm)
parameters = [{'C': [0.001, 0.01, 0.1, 1, 10, 100, 1000], 'kernel': ['linear', 'rbf'],
                'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]}]
grid_search = GridSearchCV(estimator = classifier_svm,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_svm = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_parameters)

### **DECISION TREE MODEL**

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier_dtc = DecisionTreeClassifier (criterion = 'entropy', random_state = 0)
classifier_dtc.fit (X_train, Y_train)
Y_pred_dtc = classifier_dtc.predict (X_test)
acc_dtc = accuracy_score (Y_test, Y_pred_dtc)
parameters = [{'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150], 
                'max_leaf_nodes': [2,4,6,10,15,30,40,50,100], 'min_samples_split': [2, 3, 4]}]
grid_search = GridSearchCV(estimator = classifier_dtc,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_dtc = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_parameters)

## **RANDOM FOREST MODEL**

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_rfc = RandomForestClassifier (n_estimators = 100, criterion = 'entropy', random_state = 1)
classifier_rfc.fit (X_train, Y_train)
Y_pred_rfc = classifier_rfc.predict (X_test)
acc_rfc = accuracy_score (Y_test, Y_pred_rfc)
parameters = [{'n_estimators': [100,200,300],
               'max_features': ['auto', 'sqrt'],
               'max_depth': [10,25,50,'none'],
               'min_samples_leaf': [1, 2], 
               'min_samples_split': [2, 5]}]
grid_search = GridSearchCV(estimator = classifier_rfc,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_rfc = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_parameters)

### **XGBOOST MODEL**

In [None]:
from xgboost import XGBClassifier
classifier_xgb = XGBClassifier (use_label_encoder = False)
classifier_xgb.fit (X_train, Y_train, eval_metric = "logloss")
Y_pred_xgb = classifier_xgb.predict (X_test)
acc_xgb = accuracy_score (Y_test, Y_pred_xgb)
parameters = {
        'min_child_weight': [1, 5, 10],
        'gamma': [0.5, 1, 1.5, 2],
        'max_depth': [3, 4, 5]}
grid_search = GridSearchCV(estimator = classifier_xgb,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,
                           n_jobs = -1)
grid_search.fit(X_train, Y_train)
best_accuracy_xgb = grid_search.best_score_
best_parameters = grid_search.best_params_
print(best_parameters)

### **ACCURACY COMPARISON**

In [None]:
prediction_columns = ["NAME OF MODEL", "ACCURACY SCORE"]
df_pred = {"NAME OF MODEL" : ["LOGISTIC REGRESSION", "K-NN", "NAIVE BAYES", "SVM", "DECISION TREE", "RANDOM FOREST", "XGBOOST"],
           "ACCURACY SCORE " : [acc_log, acc_knn, acc_nb, acc_svm, acc_dtc, acc_rfc, acc_xgb],
           "BEST ACCURACY (AFTER HYPER-PARAMETER TUNING)" : [best_accuracy_log, best_accuracy_knn, "-", best_accuracy_svm, best_accuracy_dtc, best_accuracy_rfc, best_accuracy_xgb]}
df_predictions = pd.DataFrame (df_pred)
df_predictions

# **CONCLUSION**

After applying EDA, VDA and numerous algorithms, finally it came down to the **K-NN Model** with the highest accuracy score of **0.850395 (85%)** after hyper-parameter tuning.