<h1> <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> Heart attack dataset

<h4> <span style="font-family: Arial;font-size:1.2em;color:#333333"> Objective: To predict the chance of Heart attack

# Data dictionary
* Age : Age of the patient

* Sex : Sex of the patient

* exang: exercise induced angina (1 = yes; 0 = no)

* ca: number of major vessels (0-3)

* cp : Chest Pain type chest pain type

* trtbps : resting blood pressure (in mm Hg)

* chol : cholestoral in mg/dl fetched via BMI sensor

* fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

* rest_ecg : resting electrocardiographic results

* target : 0= less chance of heart attack 1= more chance of heart attack

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.20px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Index 
</p>
</div>

<h4> Index

* <a href="#Packages">1.1.Packages</a>
* <a href="#EDA">1.2.EDA</a>
* <a href="#Tableau-dashboard">1.3.Tableau Dashboard</a>
* <a href="#Preprocessing">1.4.Preprocessing</a>
* <a href="#Modeling">1.5.Modeling</a>
* <a href="#Feature-importance">1.6.Feature importance
* <a href="#Permutation-Importance">1.7.Permutation Importance</a>
* <a href="#Model-comparison">1.8.Model comparison</a>
* <a href="#Reference">1.9.Reference</a>

# Packages

In [None]:
## packages

# data processing
import numpy as np # linear algebra
import pandas as pd # data processing

# visuals
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px

# Tree
from sklearn import tree
import graphviz

# preproessing
from sklearn.preprocessing import RobustScaler,StandardScaler
from sklearn.model_selection import train_test_split,GridSearchCV

# SMOTE treat class imbalance
from imblearn.over_sampling import (SMOTE,ADASYN)

# model evaluation
from sklearn import metrics

# model explainablity
import eli5
from eli5.sklearn import PermutationImportance

# html
from IPython.core.display import HTML

# plotly offline
from plotly.offline import download_plotlyjs,init_notebook_mode
init_notebook_mode(connected=True)

# MISC
import warnings
warnings.filterwarnings("ignore")

In [None]:
df = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')

In [None]:
df.head()

In [None]:
# rename the variables
df.columns = ['Age','Sex','chest_pain_type','resting_blood_pressure','serum_cholestoral_mg',
              'fasting_blood_sugar','resting_electrocardiographic','maximum_heart_rate',
             'exercise_induced_angina','ST_depression','slope','no._major_vessels','Thalassemia','Target']

In [None]:
df.head() 

In [None]:
num_cols =['Age','serum_cholestoral_mg','maximum_heart_rate','resting_blood_pressure','ST_depression','Target']


cat_cols =['slope','Sex','chest_pain_type','fasting_blood_sugar',
           'Thalassemia','slope','no._major_vessels',
           'resting_electrocardiographic','exercise_induced_angina']

# EDA

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.10px">
<p style="padding: 10px;
          color:black;
          text-align:center;">EDA
</p>
</div>

# stats

In [None]:
# stats
fi_df =df[num_cols]
print('stats of people who had higher chance of heart attack')
fi_df[df['Target']==1].describe().T

In [None]:
fi_df =df[num_cols]
print('stats of people who had low chance of heart attack')
fi_df[df['Target']==0].describe().T

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> The mean heart rate for those who had a higher chance of heart attack is 158.46 vs 139.10 who doesn't have a heart attack


In [None]:
# pairplot

sns.set_context(context='notebook',font_scale=1)
sns.pairplot(df.drop(cat_cols,axis=1),hue='Target');
plt.tight_layout()

* <span style="font-family:Arial;font-size:1.2em;color:#3366ff;"> some of the distributions are not well separated that are serum_cholestoral_mg and resting blood pressure which shows that these variables are least helpful for the model

In [None]:
# plot
plt.figure(figsize=(12,6))
sns.set_context(context='notebook',font_scale=1.2)
sns.heatmap(df[['Age','serum_cholestoral_mg','maximum_heart_rate',
                'resting_blood_pressure','ST_depression']].corr('pearson'),annot=True,cmap='Blues');
plt.title('Pearson correlation');
plt.tight_layout();

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff">there are no high correlation among the variables

In [None]:
# Target
sns.set_context(context='notebook',font_scale=1)
sns.countplot(df['Target']);
plt.tight_layout();
print(df['Target'].value_counts(1)*100)#target class precentage

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> target variable has chance of heart attack yes 54% and no 45%
    

In [None]:
# outliers are extreme values in the variables
df.plot(kind='box',figsize=(14,6));
plt.xticks(rotation=75);
plt.title('Outliers');
plt.tight_layout()

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff">most of the variables have outliers

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.10px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Tableau dashboard
</p>
</div>

# Tableau dashboard

In [None]:
%%HTML
<div class='tableauPlaceholder' id='viz1624963845064' style='position: relative'><noscript><a href='#'><img alt='All in one ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;he&#47;heart_attack_dataset&#47;Allinone&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='heart_attack_dataset&#47;Allinone' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;he&#47;heart_attack_dataset&#47;Allinone&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='language' value='en-US' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1624963845064');                    var vizElement = divElement.getElementsByTagName('object')[0];                    if ( divElement.offsetWidth > 800 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else if ( divElement.offsetWidth > 500 ) { vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';} else { vizElement.style.width='100%';vizElement.style.height='2377px';}                     var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.10px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Preprocessing
</p>
</div>

# Preprocessing

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff">i. Missing Values

In [None]:
# missing values
missing_value = 100* df.isnull().sum()/len(df)
missing_value = missing_value.reset_index()
missing_value.columns = ['variables','missing values in percentage']
missing_value = missing_value.sort_values('missing values in percentage',ascending=False)

# barplot
fig = px.bar(missing_value, y='missing values in percentage',x='variables',title='Missing values % in each column',
             template='ggplot2',text='missing values in percentage');
fig.update_traces(texttemplate='%{text:.2s}', textposition='inside')
fig.update_layout(uniformtext_minsize=8, uniformtext_mode='hide')

fig.show()

*  <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> no missing values are found in the data however we need to check for special characters and empty strings

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">ii. Data cleaning

In [None]:
# visual check for any empyty string,-999,0

def visual_check(df):
    """
    This function will print the unqiue values in each
    variable and return list of column with zeros 
    """

    counter = 0
    cols_with_zeros = []
    zero_in_col= False
    

    for feature in df.columns:

        counter += 1

        print('\n')
        print('***********************************************************')
        print(f'Col. NO.{counter} Column name: {feature}')
        print('***********************************************************')
        print(' ')
        print('1. Unique vlaues:',df[feature].unique())
        print(' ')


        try:

            print('2. Min values:',df[feature].min())
            print(' ')
            print('3. Max values:',df[feature].max())
            print(' ')
            print('4. no. unique:',df[feature].nunique())
            print(' ')
            print('5. value counts:')
            print(df[feature].value_counts(1)*100)
            print(' ')
            print('**************************************************')
            print('--------------------------------------------------')
            print('\n ')
            

            if df[feature].min()==0:
                               
                cols_with_zeros.append(feature)  

            else:
                zero_in_col= False
             
        except:
            print('min and max unsupported')

            
    return cols_with_zeros 

In [None]:
#  visual check for special characters
visual_check(df)

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">iii.Outliers 

In [None]:
# outliers
df.plot(kind='box',figsize=(12,6));
plt.xticks(rotation=80);

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff">Outliers are not treated as this a medical data need to proceed with caution for treating outliers 
* <span style="font-family: Arial;font-size:1.2em;color:#3366ff">some of the oulier data points may be the cause of heart attack so replacing that data points may affect the predict power of the model still we can check the metrics and then we can decide



In [None]:
# model metrics function

def model_metrics(X_train, X_test, y_train, y_test, model, name):
    """
    This fuction will print accuracy, F1 score
    and confusion matrix
    
    """

    # model predict
    predict_train = model.predict(X_train)
    predict_test = model.predict(X_test)

    # accuracy score
    train_score = model.score(X_train,y_train)
    test_score = model.score(X_test,y_test)

    # f1-score
    f1_score = metrics.f1_score(y_test, predict_test)

    print(f'{name} Accuracy on Train set',train_score)
    print(f'{name} Accuracy on Test set',test_score)
    print(f'{name} F1-score on Test set:',f1_score)
    print('\n')
    print(metrics.classification_report(y_test, predict_test))
    print('\n')

    # confusion matrix
    metrics.plot_confusion_matrix(model,X_test,y_test,cmap='Blues');
    plt.grid(False)
    plt.title(f'{name} Confusion Matrix on test set');
    

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">iv. data split

In [None]:
X= df.drop('Target',axis=1)
y= df.pop('Target')

In [None]:
# Data split

X_train, X_test,y_train,y_test = train_test_split(X,y,test_size=0.30,random_state =101)

In [None]:
print(y_train.value_counts(1))
print(y_test.value_counts(1))

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">vi. Scaling

In [None]:
## Scaling data

rc = RobustScaler()

X_train = rc.fit_transform(X_train)
X_test = rc.transform(X_test)

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">vii. Oversampling

In [None]:
## ADASYN oversampling

ADASYN = ADASYN(random_state=101)
X_train,y_train = ADASYN.fit_resample(X_train, y_train.ravel())

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.10px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Modeling
</p>
</div>

# Modeling

##### for modeling we can use bagging, boosting techniques and voting classifier
##### since I have't treated outlier some of the boosting models will get affected

*  <span style="font-family: Arial;font-size:1.2em;color:#3366ff">i. Decision Tree
*  <span style="font-family: Arial;font-size:1.2em;color:#3366ff">ii. Random Forest
*  <span style="font-family: Arial;font-size:1.2em;color:#3366ff">iii. LDA
*  <span style="font-family: Arial;font-size:1.2em;color:#3366ff">iv. Naive bayse
* <span style="font-family: Arial;font-size:1.2em;color:#3366ff">v. XGboost
* <span style="font-family:Arial;font-size:1.2em;color:#3366ff">vi. Adaboostclassifier
* <span style="font-family:Arial;font-size:1.2em;color:#3366ff">vii. GradientBoostingClassifier
* <span style="font-family:Arial;font-size:1.2em;color:#3366ff">viii. LGBMClassifier
* <span style="font-family:Arial;font-size:1.2em;color:#3366ff">ix. SVM
* <span style="font-family:Arial;font-size:1.2em;color:#3366ff">x. KNN
  

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">1.Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier

DT_model= DecisionTreeClassifier(max_features= 5,max_depth= 7,min_samples_split= 90,min_samples_leaf= 30,random_state=42)

# fit the model
DT_model.fit(X_train,y_train)

# model score
model_metrics(X_train, X_test,y_train,y_test,DT_model,'Decision_Tree')

### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">2. Random Forest

In [None]:
# Random forest
from sklearn.ensemble import RandomForestClassifier

RF_model = RandomForestClassifier(n_estimators= 600,min_samples_split= 90,min_samples_leaf= 20,max_features= 5,max_depth= 10)

# fit the model
RF_model.fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,RF_model,'Random_forest')

#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">3. LDA

In [None]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis


LDA = LinearDiscriminantAnalysis(solver='svd')

# fit the model
LDA.fit(X_train, y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,LDA,'LDA')

#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> 4. Naive bayes

In [None]:
from sklearn.naive_bayes import GaussianNB

NB_model = GaussianNB()

# fit the model
NB_model.fit(X_train, y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,NB_model,'GaussianNB')

#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> 5. Xgboost

In [None]:
# Xgboost
import xgboost as xgb

# create object model
Xgboost_model = xgb.XGBClassifier(learning_rate=0.01,verbosity=0)

# fit the model
Xgboost_model.fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,Xgboost_model,'Xgboost_model')

#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">6. Adaboostclassifier

In [None]:
from sklearn.ensemble import AdaBoostClassifier

# create object model
Adaboost_model = AdaBoostClassifier(n_estimators=500,learning_rate=0.01)

# fit the model
Adaboost_model.fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,Adaboost_model,'Adaboost_model')

#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">7. GradientBoostingClassifier

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

# create object model
gradientboost_model = GradientBoostingClassifier(learning_rate=0.01,n_estimators=600,subsample=1.0)

# fit the model
gradientboost_model.fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,gradientboost_model,'gradientboost_model')


#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">8. LGBMClassifier

In [None]:
from lightgbm import LGBMClassifier

# create object model
LightBGM  = LGBMClassifier(learning_rate=0.01,n_estimators=60,subsample=1.0,max_depth=5)

# fit the model
LightBGM.fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,LightBGM ,'LightBGM')


#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">9. SVM

In [None]:
from sklearn.svm import SVC

# create object model
SVM = SVC(probability=True)

# fit the model
SVM .fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,SVM ,'SVM')

#### <span style="font-family: Arial;font-size:1.2em;color:#3366ff">10. KNeighborsClassifier

In [None]:
from sklearn.neighbors import KNeighborsClassifier

# create object model
KNN = KNeighborsClassifier()

# fit the model
KNN.fit(X_train,y_train)

# model score function
model_metrics(X_train, X_test,y_train,y_test,KNN ,'KNN')

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.10px">
<p style="padding: 10px;
          color:black;
          text-align:center;">voting classifier
</p>
</div>

# VotingClassifier

In [None]:
from sklearn.ensemble import VotingClassifier

voting_classifier = VotingClassifier(estimators=[
    ('LightBGM', LightBGM),
    ('RF_model', RF_model),
    ('NB_model',NB_model),
    ('Xgboost_model',Xgboost_model),
    ('SVM',SVM),
    ('Adaboost_model',Adaboost_model),
    

],voting='soft')

# voting classifier
voting_classifier = voting_classifier.fit(X_train,y_train.ravel())

# model score function
model_metrics(X_train, X_test,y_train,y_test,voting_classifier ,'voting_classifier')

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.20px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Model comparison 
</p>
</div>

In [None]:
models_list = [KNN,SVM,LightBGM,gradientboost_model,Adaboost_model,Xgboost_model,NB_model,LDA,RF_model,DT_model,voting_classifier]
recall =[]
precision =[]
test_acc = []
train_acc = []
f1_score = []

for model in models_list:
    predict_test = model.predict(X_test)
    predict_train = model.predict(X_train)
    f1s = metrics.f1_score(y_test, predict_test)
    pre = metrics.precision_score(y_test, predict_test)
    rec = metrics.recall_score(y_test, predict_test)
    acc_test = model.score(X_test,y_test)
    acc_train = model.score(X_train,y_train)
    
    recall.append(rec)
    precision.append(pre)
    test_acc.append(acc_test)
    train_acc.append(acc_train)
    f1_score.append(f1s)
    
model_compare = pd.DataFrame({
'Models':['KNN','SVM','LightBGM','gradientboost_model','Adaboost_model',
          'Xgboost_model','NB_model','LDA','RF_model','DT_model','voting_classifier'],
'recall':recall,
'Precision':precision,
'f1_score':f1_score,
'Accuracy on Test':test_acc,
'Accuracy on Train':train_acc
})


# Model comparison

<h5> <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> objective here is we need choose the model with higher f1_score and with higher precision


In [None]:
model_compare.sort_values(['f1_score','Precision'],ascending=[False,False],inplace=True)
model_compare.style.background_gradient(cmap='coolwarm_r')

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> we can use SVM, Adaboost and Xgboost_model that has a balance of precision and recall for the model for an overall much better prediction
* <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> we can use voting classifier that combines the output of the models and use the average probability of the models for prediction


<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.20px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Feature importance 
</p>
</div>

# Feature importance

In [None]:
# Decision tree
Tree_plot = tree.export_graphviz(DT_model,feature_names=df.columns)
graphviz.Source(Tree_plot)

* <span style="font-family:Arial;font-size:1.2em;color:#3366ff">Decision tree Gini sore shows that important given to no._major vessels or ca it seems that referred to Fluoroscopy it is imaging technique that analyzes the blood flow in the heart vessels

In [None]:
# random forest feature importance 

feature_importances = pd.Series(RF_model.feature_importances_, index=df.columns);
feature_importances.nlargest(15).plot(kind='barh',figsize=(10,6));
plt.title('Random forest feature_importances');

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.20px">
<p style="padding: 10px;
          color:black;
          text-align:center;">Permutation importance
</p>
</div>

# permutation importance

* <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> Permutation importance is stated to be the decrease in a model Accuracy or any user-defined metrics when a single Independent variable is randomly shuffled and observe the difference in the output to know what are variables are given importance by the model to predict

In [None]:
# permutation importance for Voting classifier

import eli5
from eli5.sklearn import PermutationImportance

permutation_for_NB = PermutationImportance(voting_classifier,random_state=1,scoring='f1').fit(X_test,y_test)
eli5.show_weights(permutation_for_NB, feature_names = df.columns.tolist())

In [None]:
# SVM permutation importance

permutation = PermutationImportance(Xgboost_model,random_state=1,scoring='f1').fit(X_test,y_test)
eli5.show_weights(permutation, feature_names = df.columns.tolist())

<span style="font-family:Arial;font-size:1.2em;color:#3366ff">we can see that model has given importance to the feature like number of major vessels, chest pain type, Thalassemia, chest_pain_type and age

<div style="color:white;
           display:fill;
           border-radius:40px;
           background-color:#88d8b0;
           font-size:300%;
           font-family:Arial;
           letter-spacing:0.20px">
<p style="padding: 10px;
          color:black;
          text-align:center;">End
</p>
</div>

# Reference

*  Data dictionary: https://archive.ics.uci.edu/ml/datasets/heart+disease
* https://www.kaggle.com/dansbecker/permutation-importance referred Permutation Importance from Dan Becker's Notebook
* https://www.kaggle.com/amiiiney/titanic-top-20-with-ensemble-votingclassifier#4--Encode-categorical-feature # voting classifier

<h5> <span style="font-family: Arial;font-size:1.2em;color:#3366ff"> if you like the work hit the upvote and feel free to post any suggestions

# Thank you!