<h2>HeartAttack Data Analysis</h2>

1. Load the dataset
2. Check missing values and data type for each feature
3. Seperate the categorical and continuous features
4. Analyze discrete features wrt to target variable
5. Data distribution analysis of continuous features
6. Check the outliers in the continuous data
7. Perform feature scaling
8. Train and test split
9. Apply the following algorithms:<br>
    a. Support Vector Machine<br>
    b. Logistic Regression<br>
    c. Naive Bayes Algorithm<br>
    d. KNN Algorithm<br>
    e. Decision Tree<br>
    f. Random Forest<br>
    g. Gradient Boosting Classifier<br>
10. Apply Neural Network and check the accuracy 

**Data Dictionary**

age - Age of the patient

sex - Sex of the patient

cp - Chest pain type ~ 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic

trtbps - Resting blood pressure (in mm Hg)

chol - Cholestoral in mg/dl fetched via BMI sensor

fbs - (fasting blood sugar > 120 mg/dl) ~ 1 = True, 0 = False

restecg - Resting electrocardiographic results ~ 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy

thalachh - Maximum heart rate achieved

oldpeak - Previous peak

slp - Slope

caa - Number of major vessels

thall - Thalium Stress Test result ~ (0,3)

exng - Exercise induced angina ~ 1 = Yes, 0 = No

output - Target variable

[](http://)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df=pd.read_csv("/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv")

In [None]:
df.head()

In [None]:
df.shape

In [None]:
#Missing Values
df.isnull().sum()

In [None]:
#Data type
for i in df.columns:
    print(i," : ",df[i].dtype)

In [None]:
#Number of unique values in each category
df.nunique()

In [None]:
#Discrete features
discrete_features=[feature for feature in df.columns if df[feature].nunique()<=10 and feature!='output' ]
discrete_features

In [None]:
#Continuous features
continuous_features=[feature for feature in df.columns if df[feature].nunique()>10]
continuous_features

In [None]:
#Target feature
target='output'
target

In [None]:
df[continuous_features].describe().transpose()

**Analysis of discrete features wrt to target variable**

In [None]:

for i,feature in enumerate(discrete_features):
    plt.figure(i)
    sns.countplot(x=target,hue=feature,data=df,palette='Paired')
    

Observations:
1. The gender 1 has had higher heart attacks
2. The cp(chest pain) with value type 2 that is atypical angina has higher heart attacks
3. fbs value having 0 value that is false for fasting blood sugar > 120 mg/dl has higher count for heart attacks
4. rest_ecg : resting electrocardiographic results for value 1:having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV) has higher heart attacks
5. exang: exercise induced angina (1 = yes; 0 = no) where 0=no has more number of heart attacks
6. caa: Number of major vessels where value=0 has highest number of heart attacks
7. Thalium Stress Test result ~ (0,3) where value=2 has highest number of heart attacks

**Data Distribution of continuous features**

In [None]:
for i,feature in enumerate(continuous_features):
    plt.figure(i)
    #sns.set_color_codes()
    sns.distplot(df[feature],color='g')

**Check the outliers**

In [None]:
for i,feature in enumerate(continuous_features):
    plt.figure(i)
    sns.boxplot(x=feature,data=df,palette="Set3")
    

In [None]:
#Check the correlation between the features
sns.heatmap(df.corr(),cmap="YlGnBu")

In [None]:
for i,feature in enumerate(continuous_features):
    plt.figure(i)
    #sns.distplot(df[feature],hue='target',data=df)
    sns.kdeplot(data=df, x=feature, hue='output', fill=True,palette=["#8000ff","#da8829"])
    

**Observations**: From the data distribution of continuous features we can see that 

1. Heart attack is not affected by the age factor
2. Greater the thalachh value higher is the risk of heart attack
3. Lower the old peak value higher is the risk of heart attack

In [None]:
for i,feature in enumerate(discrete_features):
    plt.figure(i)
    sns.countplot(data=df,x=feature,hue='output', palette="Set2")

**Observations**: The discrete variable relationship wrt to target variable

1. The number of records for gender=1 and fbs=0 are greater as compared to its counterpart values causing an imbalance in the dataset
2. The number of records for thall= 0/1, caa= 3/4, slp=0, restecg=2 are less as compared to the other categories of their respective field values

In [None]:
sns.lineplot(x="age",y="chol",hue="output",data=df)

In [None]:
df.describe().transpose()

In [None]:
df.head()

**Feature Engineering**

In [None]:
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler


In [None]:
# creating a copy of df
df1 = df

# define the columns to be encoded and scaled
cat_cols = ['sex','exng','caa','cp','fbs','restecg','slp','thall']
con_cols = ["age","trtbps","chol","thalachh","oldpeak"]

# encoding the categorical columns
df1 = pd.get_dummies(df1, columns = cat_cols, drop_first = True)

# defining the features and target
X = df1.drop(['output'],axis=1)
y = df1['output']

# instantiating the scaler
scaler = RobustScaler()

# scaling the continuous featuree
X[con_cols] = scaler.fit_transform(X[con_cols])
print("The first 5 rows of X are")
X.head()

In [None]:
X.head()

**Train and Test Split**

In [None]:
from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=42)
print("The shape of X_train is      ", X_train.shape)
print("The shape of X_test is       ",X_test.shape)
print("The shape of y_train is      ",y_train.shape)
print("The shape of y_test is       ",y_test.shape)

**Model Selection**

*Packages import for the models*

In [None]:
# Models
import torch
import torch.nn as nn
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier

#Metrics
from sklearn.metrics import accuracy_score,roc_auc_score,classification_report

# Cross Validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

#results dataset
results=pd.DataFrame(columns=['Algorithm','Accuracy Score'])

***Support Vector Machines***

In [None]:
#SVM Algorithm
svc=SVC(kernel='linear',random_state=42, C=1).fit(X_train,y_train)

#Predicting Values
y_predict=svc.predict(X_test)

score=accuracy_score(y_test,y_predict)
print('SVC: ',score)


*Hyperparameter Tuning*

In [None]:
svm=SVC()

parameters={"C":np.arange(1,5,1),"gamma":[0.00001,0.00005, 0.0001,0.0005,0.001,0.005,0.01,0.05,0.1,0.2,0.4,0.5,1,5]}

#Instantiating GridSearchCV
searcher=GridSearchCV(svm,parameters)

searcher.fit(X_train,y_train)

# the scores
print("The best params are :", searcher.best_params_)
print("The best score is   :", searcher.best_score_)

#Predict the values
y_pred=searcher.predict(X_test)

score=accuracy_score(y_test,y_pred)
print('SVC: ',score)
results=results.append({"Algorithm":'Support Vector Machine',"Accuracy Score":score},ignore_index=True)

**Logistic Regression**

In [None]:
log_reg=LogisticRegression()

log_reg.fit(X_train,y_train)

y_pred_proba=log_reg.predict_proba(X_test)

y_pred=np.argmax(y_pred_proba,axis=1)

score=accuracy_score(y_test,y_pred)

print("Logistic Regression: ",score)
results=results.append({"Algorithm":'Logistic Regression',"Accuracy Score":score},ignore_index=True)

**Naive Bayes Algorithm**

In [None]:
nb=GaussianNB()
nb.fit(X_train,y_train)
y_pred=nb.predict(X_test)

score=accuracy_score(y_test,y_pred)
print("Naive Bayes Algorithm: ",accuracy_score(y_test,y_pred))
results=results.append({"Algorithm":'Naive Bayes Algorithm',"Accuracy Score":score},ignore_index=True)

**K Nearest Neighbor Algorithm**

Hyperparameter tuning for KNN

In [None]:
error_rate=[]

for i in range(1,40):
    knn=KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    predict_values=knn.predict(X_test)
    #So here we will take the percent of all the values predicted by X_test which were not correct
    error_rate.append(np.mean(predict_values!=y_test))

In [None]:
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,marker='o',linestyle='dashed',markerfacecolor='red')
plt.grid(color='g', linestyle='-', linewidth=0.5)
plt.xlabel('K')
plt.ylabel('Error Rate')
plt.title('Error Rate vs K Value')
plt.show()

In [None]:
knn=KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train,y_train)
y_pred=knn.predict(X_test)

score=accuracy_score(y_test,y_pred)
print("KNN Algorithm: ",accuracy_score(y_test,y_pred))
results=results.append({"Algorithm":'KNN Algorithm',"Accuracy Score":score},ignore_index=True)

**Tree Models**

**Decision Tree Model**

In [None]:
dec_tree=DecisionTreeClassifier(random_state=42)
dec_tree.fit(X_train,y_train)
y_pred=dec_tree.predict(X_test)
score=accuracy_score(y_test,y_pred)
print('Decision Tree: ',accuracy_score(y_test,y_pred))

results=results.append({"Algorithm":'Decision Tree',"Accuracy Score":score},ignore_index=True)

**Random Forest Classsifier**

In [None]:
rf=RandomForestClassifier(random_state=42)

# fitting the model
rf.fit(X_train, y_train)

# calculating the predictions
y_pred = rf.predict(X_test)

score=accuracy_score(y_test,y_pred)
# printing the test accuracy
print("The test accuracy score of Random Forest is ", accuracy_score(y_test, y_pred))

results=results.append({"Algorithm":'Random Forest',"Accuracy Score":score},ignore_index=True)

**Gradient Boosting Classifier**

In [None]:
grad_boost=GradientBoostingClassifier(n_estimators = 300,max_depth=1,subsample=0.8,max_features=0.2,random_state=42)

grad_boost.fit(X_train,y_train)
y_pred=grad_boost.predict(X_test)

score=accuracy_score(y_test,y_pred)
print('Gradient Boosting Classifier: ',accuracy_score(y_test,y_pred))

results=results.append({"Algorithm":'Gradient Boosting Classifier',"Accuracy Score":score},ignore_index=True)

In [None]:
results

**Neural Network**

*Packages*

In [None]:
#Import the Sequential which helps to build model layer by layer
from tensorflow.keras.models import Sequential
#To create hidden layers
from tensorflow.keras.layers import Dense
#Import the activation function
from tensorflow.keras.layers import LeakyReLU,ReLU,ELU
#To avoid overfitting
from tensorflow.keras.layers import Dropout
from tensorflow.keras.optimizers import Adam
from keras_tuner.tuners import RandomSearch

*Hyperparameter Tuning*

In [None]:
def build_model(hp):
    model=Sequential()
    #the for loop will generate the number of hidden layers
    for i in range(hp.Int('num_layer',2,32)):
        model.add(Dense(units=hp.Int('units_'+str(i),min_value=32,max_value=512,step=32),
                       activation='relu'))
    #Output layer
    model.add(Dense(1,activation='sigmoid'))
    #Compile layer
    model.compile(optimizer=Adam(learning_rate=hp.Choice('learning_rate',[1e-2,1e-3,1e-4])),
                 loss='binary_crossentropy',metrics='accuracy')
    return model

In [None]:
#Instantiating random search
tuner=RandomSearch(build_model,objective='val_accuracy',max_trials=5,executions_per_trial=3)

In [None]:
tuner.search(X_train,y_train,
            epochs=5,
            validation_data=(X_test,y_test))

In [None]:
tuner.results_summary()

In [None]:
#Fetch the best model from the hyperparamter tuning
best_model = tuner.get_best_models(num_models=1)[0]
best_hyperparameters = tuner.get_best_hyperparameters(1)[0]

In [None]:
# Fit and Evaluate the best model.
best_model.fit(X_train,y_train, epochs=10, validation_data=(X_test,y_test))
metric_values = best_model.evaluate(X_test, y_test)

In [None]:
print('Neural Network accuracy: ', metric_values[1])