# Heart Attack Analysis

# Fact
**Heart disease** is the leading cause of death for men, women, and people of most racial and ethnic groups in the United States.

One person dies every **36** seconds in the United States from cardiovascular disease.

About 655,000 Americans die from **heart disease** each year—that's 1 in every 4 deaths.

References:
- [ https://www.cdc.gov/heartdisease/facts.htm ]

![](https://www.mainstreetfamilycare.com/wp-content/uploads/2018/10/iStock-heart-health.jpg)

# The Purpose of notebook

In this notebook, I will analyze a dataset of people who have been tested for heart disease.

Link to [Orginal dataset](https://archive.ics.uci.edu/ml/datasets/Heart+Disease)


# About this dataset
- `Age` : Age of the patient

- `Sex` : Sex of the patient (0 = female; 1 = male)

- `exng`: exercise induced angina (1 = yes; 0 = no)

- `caa`: number of major vessels (0-3)

- `cp` : Chest Pain type

    - Value 0: typical angina
    - Value 1: atypical angina
    - Value 2: non-anginal pain
    - Value 3: asymptomatic
    
    
- `trtbps` : resting blood pressure (in mm Hg)

- `chol` : cholestoral in mg/dl fetched via BMI sensor

- `fbs` : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

- `restecg` : resting electrocardiographic results

    - Value 0: normal
    - Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    - Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
    
    
- `thalachh` : maximum heart rate achieved

- `slp`: the slope of the peak exercise ST segment
    - Value 0: upsloping
    - Value 1: flat
    - Value 2: downsloping

- `oldpeak`:  ST depression induced by exercise relative to rest

- `thall`: Thallium Stress Test result ~ (0,3)

- `target` : 0 = less chance of heart attack, 1 = more chance of heart attack



## Imports libs 

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Charts
import matplotlib.pyplot as plt
import seaborn as sns
from scikitplot.estimators import plot_learning_curve
from sklearn.metrics import plot_confusion_matrix
from keras.utils.vis_utils import plot_model

#  Models
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Dropout, Flatten
from tensorflow.keras.layers import Conv2D, MaxPool2D, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau, ModelCheckpoint
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
import xgboost as xgb
import lightgbm as lgbm
import catboost as ctb


# Preprocessing
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import MinMaxScaler

# Scoring
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split, cross_val_score, KFold, cross_validate, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score, precision_score, recall_score, accuracy_score
from sklearn.metrics import classification_report, roc_curve, roc_auc_score, confusion_matrix

# Hyperparameters and features importance
from sklearn.model_selection import GridSearchCV
import eli5
from eli5.sklearn import PermutationImportance

# remove verison errors
import warnings
warnings.filterwarnings(action='ignore', category=UserWarning)

## Loading the dataset

In [None]:
path = '/kaggle/input/heart-attack-analysis-prediction-dataset/'

df_heart = pd.read_csv(path + 'heart.csv')

## The size of the dataset

In [None]:
df_heart.shape

# Exploratory data analysis

## Basic info about data

In [None]:
df_heart.info()

## Sample data

In [None]:
df_heart.sample(15)

## Checking missing values

In [None]:
df_heart.isnull().sum().sum()

#### There are no NaN values in the dataset.

## Checking duplicates

In [None]:
df_heart.duplicated().sum()

## Removing duplicate

In [None]:
df_heart.drop_duplicates(inplace=True)

#### I divide features to categorical, continous and label columns

In [None]:
categorical_cols = ['sex', 'cp', 'fbs', 'restecg', 'exng','slp', 'caa','thall'] # 8
continous_cols = ['age', 'trtbps', 'chol','thalachh', 'oldpeak'] # 5
label_col = ['output']

name_change = {
    'sex': {'0': 'female', '1': 'male'}, 
    'fbs': {'0': 'false', '1': 'true'}, 
    'exng': {'0': 'no', '1': 'yes'},
    'cp': {'0': 'typical angina', '1': 'atypical angina', '2': 'non-anginal pain', '3': 'asymptomatic'},
    'restecg': {'0': 'normal', '1': 'having ST-T wave abnormality', '2': 'showing probable or definite left ventricular hypertrophy'},
    'caa': {'0': '0 vessels', '1': '1 vessels', '2': '2 vessels', '3': '3 vessels', '4': '4 vessels'}, 
    'slp': {'0': '0', '1': '1', '2': '2'},
    'thall': {'0': '0', '1': '1', '2': '2', '3': '3'}, 
}

## Statistics continous columns

In [None]:
df_heart[continous_cols].describe().T

### Distribution of continuous features

#### Colors to charts

In [None]:
mycolors = ['red', 'blue', 'brown', 'orange']

In [None]:
cnt = 0
max_in_row = 1
for x in continous_cols:
    data = df_heart[x]
    plt.figure(cnt//max_in_row, figsize=(25,8))
    plt.subplot(1, max_in_row, (cnt)%max_in_row + 1)
    plt.title(f'Distribution of {x} variable', fontsize=20)
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlabel(x, fontsize=16)
    plt.ylabel('Count', fontsize=16)
    sns.histplot(data, bins = 50, kde=50);
    cnt += 1

In [None]:
cnt = 0
max_in_row = 1
for x in continous_cols:
    plt.figure(cnt//max_in_row, figsize=(25,8))
    plt.subplot(1, max_in_row, (cnt)%max_in_row + 1)
    plt.title(x, fontsize=20)
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlabel(x, fontsize=16)
    plt.ylabel('Density', fontsize=16)
    sns.kdeplot(data=df_heart, x=x, hue="output", fill=True, common_norm=False, alpha=.5, linewidth=0);
    cnt += 1

## Conclusion
1.`age`:
   * Most people get a heart attack at the age of 50.
   
2.`chol`:
   * People with higher cholesterol are less likely to get a heart attack.
   
3.`thalachh`:
   * People with a higher maximum heart rate are more likely to have a heart attack.


## Boxplot of continuous features

In [None]:
cnt = 0
max_in_row = 1
for x in continous_cols:
    data = df_heart[x]
    plt.figure(cnt//max_in_row, figsize=(25,8))
    plt.subplot(1, max_in_row, (cnt)%max_in_row + 1)
    plt.title(x, fontsize=20)
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlabel(x, fontsize=16)
    sns.boxplot(data = data);
    sns.despine(offset=10, trim=True);
    cnt += 1

## Barplot of the categorical features

In [None]:
cnt = 0
max_in_row = 1
for x in categorical_cols:
    val1 = df_heart[x].value_counts().index
    val1 = [name_change[x][str(val)] for val in val1]
    cnt1 = df_heart[x].value_counts().values
    plt.figure(cnt//max_in_row, figsize=(25,8))
    plt.subplot(1, max_in_row, (cnt)%max_in_row + 1)
    plt.title(x, fontsize=20)
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlabel(x, fontsize=16)
    plt.bar(val1, cnt1, color=mycolors);
    cnt += 1

#### There are more men than women in the data set

In [None]:
cnt = 0
max_in_row = 1
for x in categorical_cols:
    plt.figure(cnt//max_in_row, figsize=(25,8))
    plt.subplot(1, max_in_row, (cnt)%max_in_row + 1)
    plt.title(x, fontsize=20)
    plt.xticks(fontsize=16)
    plt.yticks(fontsize=16)
    plt.xlabel(x, fontsize=16)
    sns.kdeplot(data=df_heart, x=x, hue="output", fill=True, common_norm=False, alpha=.5, linewidth=0,);
    cnt += 1

## Conclusion
1.`sex`:
   * Male (`sex` = 1 ) has higher chance of heart attack
   
2.`cp`:
   * People with non-anginal pain (`cp` = 2 ) have higher chances of heart attack.
   
3.`restecg`:
   * People with having ST-T wave abnormality (`restecg` = 1 ) have higher chance of heart attack.

4.`exng`:
   * People with no exercise induced angina (`exng` = 0 ) have higher chance of heart attack.

5.`slp`:
   * People with the downslope of the peak exercise ST segment (`slp` = 2 ) have higher chance of heart attack.

6.`caa`:
   * People with 0 major vessels have a very higher chance of heart attack 

7.`thall`:
   * People with thall = 2 have higher chance of heart attack

## Count of the target 

In [None]:
f1 = df_heart['output'].map(lambda x:  '1 = more chance of heart attack' if x == 1 else '0 = less chance of heart attack')

plt.figure(figsize=(18,10))
val = f1.value_counts().index
cnt = f1.value_counts().values

plt.title('Count of the target', size=20)
plt.tick_params(labelsize=16)
plt.ylabel('Count', size=16)
plt.xlabel('output', size=16)
plt.bar(val, cnt, color = mycolors);
plt.show()

#### In dataset we have more cases with option `1`

## Correlation Matrix

In [None]:
plt.figure(figsize = (24, 24))
sns.heatmap(df_heart.corr(), cmap = "coolwarm", annot=True, fmt='.1f', linewidths=0.1);
plt.yticks(rotation=0, size=16)
plt.xticks(size=16)
plt.title('Correlation Matrix', size=26)
plt.show()

In [None]:
plt.figure(figsize = (24, 24))
sns.heatmap(df_heart.corr()>=0.4, cmap = "coolwarm", annot=True, fmt='.1f', linewidths=0.1);
plt.yticks(rotation=0, size=16)
plt.xticks(size=16)
plt.title('Correlation Matrix', size=26)
plt.show()

#### As we can see, the variables weekly correlate with each other

## Pairplot according to target variable

In [None]:
sns.pairplot(df_heart, hue='output');

# Training model

#### Make one-hot encoding for caterical columns and simple scaler train data

In [None]:
# df_heart = pd.get_dummies(df_heart, columns = categorical_cols)

X = df_heart.drop(['output'],axis=1)
y = df_heart['output']

# split the data into training and testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 2021)

X_train_raw = X_train.copy()
X_test_raw = X_test.copy()

X_train_norm = X_train.copy()
X_test_norm = X_test.copy()

X_train_stand = X_train.copy()
X_test_stand = X_test.copy()

X_train_own = X_train.copy()
X_test_own = X_test.copy()

scaler = StandardScaler()
X_train_stand[continous_cols] = scaler.fit_transform(X_train_stand[continous_cols])
X_test_stand[continous_cols] = scaler.transform(X_test_stand[continous_cols])

norm = MinMaxScaler()
X_train_norm[continous_cols] = norm.fit_transform(X_train_norm[continous_cols])
X_test_norm[continous_cols] = norm.transform(X_test_norm[continous_cols])

X_train_own["age"]= np.log(X_train.age)
X_train_own["trtbps"]= np.log(X_train.trtbps)
X_train_own["chol"]= np.log(X_train.chol)
X_train_own["thalachh"]= np.log(X_train.thalachh)

X_test_own["age"]= np.log(X_test.age)
X_test_own["trtbps"]= np.log(X_test.trtbps)
X_test_own["chol"]= np.log(X_test.chol)
X_test_own["thalachh"]= np.log(X_test.thalachh)

### A functions that makes life easier

In [None]:
def train_model(model, X, y):
    model.fit(X, y)
    return model


def predict_model(model, X, proba=False):
    if ~proba:
        y_pred = model.predict(X)
    else:
        y_pred_proba = model.predict_proba(X)
        y_pred = np.argmax(y_pred_proba, axis=1)

    return y_pred


list_scores = []

def run_model(name, model, X_train, X_test, y_train, y_test, fc, proba=False):
    print(name)
    print(fc)
    
    model1 = train_model(model, X_train, y_train)
    y_pred = predict_model(model1, X_test, proba)
    
    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    print(y_pred,'\n')
    print('accuracy: ', accuracy)
    print('recall: ',recall)
    print('precision: ', precision)
    print('f1: ', f1)
    print(classification_report(y_test, y_pred))
    
    
    plot_confusion_matrix(model, X_test, y_test, cmap='Blues');    
    plt.show()
    plot_learning_curve(model, X_train, y_train, cv=3);    
    plt.show()
    
    list_scores.append({'Model Name': name, 'Feature Scaling':fc, 'Accuracy': accuracy, 'Recall': recall, 'Precision': precision, 'F1':f1})

In [None]:
feature_scaling = {
    'Raw':(X_train_raw, X_test_raw, y_train, y_test),
    'Normalization':(X_train_norm, X_test_norm, y_train, y_test),
    'Standardization':(X_train_stand, X_test_stand, y_train, y_test),
    'Own':(X_train_own, X_test_own, y_train, y_test),
}

## Running some models on this data

In [None]:
model_svc = SVC(kernel='linear', C=1, random_state=2021)

for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    run_model('SVC', model_svc, X_train, X_test, y_train, y_test, fc_name)

In [None]:
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    svm = SVC()
    parameters = { 'C':np.arange(1,5,1),'gamma':[0.001, 0.005, 0.01, 0.05, 0.09, 0.1, 0.2, 0.5,1]}
    searcher = GridSearchCV(svm, parameters)
    
    run_model('Tuning SVC', searcher, X_train, X_test, y_train, y_test, fc_name )

In [None]:
logreg = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=2021)

for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    run_model('Logistic Regression', logreg, X_train, X_test, y_train, y_test, fc_name, proba=True)

In [None]:
for fc_name, value in feature_scaling.items():
    scores_1 = []
    X_train, X_test, y_train, y_test = value
    
    for i in range(1,50):
        knn = KNeighborsClassifier(n_neighbors = i)
        knn.fit(X_train, y_train)
        
        scores_1.append(accuracy_score(y_test, knn.predict(X_test)))
    
    max_val = max(scores_1)
    max_index = np.argmax(scores_1) + 1
    
    knn = KNeighborsClassifier(n_neighbors = max_index)
    knn.fit(X_train, y_train)

    run_model(f'KNeighbors Classifier n_neighbors = {max_index}', knn, X_train, X_test, y_train, y_test, fc_name)

In [None]:
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    
    dt = DecisionTreeClassifier()
    
    parameters = { 'max_depth':np.arange(1,5,1),'random_state':[2021]}
    searcher = GridSearchCV(dt, parameters)
    
    run_model('DecisionTree Classifier', searcher, X_train, X_test, y_train, y_test, fc_name )

In [None]:
rf = RandomForestClassifier(n_estimators=200, max_depth=3, random_state=2021)

for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    run_model('RandomForest Classifier', rf, X_train, X_test, y_train, y_test, fc_name)

In [None]:
gbt = GradientBoostingClassifier(n_estimators = 50, max_depth=2, subsample=0.8, max_features=0.2, random_state=2021)
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    run_model('GradientBoosting Classifier', gbt, X_train, X_test, y_train, y_test, fc_name)

In [None]:
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    xgb_model = xgb.XGBClassifier(n_estimators = 50, max_depth=3, random_state=2021, use_label_encoder=False, eval_metric='mlogloss')
        
    run_model('XGBoost Classifier', xgb_model, X_train, X_test, y_train, y_test, fc_name)

In [None]:
lgbm_model = lgbm.LGBMClassifier(max_depth = 3, n_estimators=50, subsample=0.8, random_state=2021)
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    run_model('Lightgbm Classifier', lgbm_model, X_train, X_test, y_train, y_test, fc_name)

In [None]:
cat_model = ctb.CatBoostClassifier(n_estimators = 80, depth=3, subsample=0.8, random_state=2021, verbose=0)
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value

    run_model('CatBoost Classifier',cat_model, X_train, X_test, y_train, y_test, fc_name)

## Additionaly we check Neural Network

#### `draw_learning_curve` - function to drawing learning curve history learning neural network
#### `callbacks` - function to generate unique callback to NN

In [None]:
def callbacks(name): 
    return [ 
        EarlyStopping(monitor = 'loss', patience = 7), 
        ReduceLROnPlateau(monitor = 'loss', patience = 4), 
        ModelCheckpoint(f'../working/{name}.hdf5', save_best_only=True) # saving the best model
    ]

def draw_learning_curve(history, keys=['accuracy', 'loss']):
    plt.figure(figsize=(20,8))
    for i, key in enumerate(keys):
        plt.subplot(1, 2, i + 1)
        sns.lineplot(x = history.epoch, y = history.history[key])
        sns.lineplot(x = history.epoch, y = history.history['val_' + key])
        plt.title('Learning Curve')
        plt.ylabel(key.title())
        plt.xlabel('Epoch')
        plt.legend(['train', 'test'], loc='best')
    plt.show()

#### Creating and training NN

In [None]:
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    X_train_nn, y_train_nn = X_train, y_train
    
    X_test_nn, y_test_nn = X_test, y_test
    y_train_nn = to_categorical(y_train_nn)
    y_test_nn = to_categorical(y_test_nn)
    num_feats = X_train_nn.shape[1]
    num_classes = 2

    model = Sequential([

            Dense(700, input_dim = num_feats, activation='relu'),
            Dropout(0.7),
            BatchNormalization(),

            Dense(num_classes, activation='softmax')
        ])
    model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
#     model.summary()

    learning_history = model.fit(X_train_nn, y_train_nn,
              batch_size = 32, epochs = 100, verbose = 0,
              callbacks = callbacks('mlp'),
              validation_data = (X_test_nn, y_test_nn)
            );
    
    model = load_model('../working/mlp.hdf5')
    y_pred = model.predict(X_test_nn, verbose = 0)
    y_pred = np.argmax(y_pred, axis = 1)

    accuracy = accuracy_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print(y_pred,'\n')
    print('accuracy: ', accuracy)
    print('recall: ',recall)
    print('precision: ', precision)
    print('f1: ', f1)
    print(classification_report(y_test, y_pred))

    draw_learning_curve(learning_history)

    list_scores.append({'Model Name': 'Neural Network', 'Accuracy': accuracy, 'Recall': recall, 'Precision': precision, 'F1':f1, 'Feature Scaling':fc_name})

## Summary scores

In [None]:
df_scores = pd.DataFrame(list_scores)
df_scores.style.highlight_max(color = 'lightgreen', axis = 0)

### The best model for this dataset - `SVC`

In [None]:
model_svc = SVC(kernel='linear', C=1, random_state=2021)
for fc_name, value in feature_scaling.items():
    X_train, X_test, y_train, y_test = value
    run_model('SVC', model_svc, X_train, X_test, y_train, y_test, fc_name)

#### Most important features for model

In [None]:
model_svc = SVC(kernel='linear', C=1, random_state=2021)
model_svc.fit(X_train, y_train)

imp = PermutationImportance(model_svc, random_state = 2021).fit(X_train, y_train)
eli5.show_weights(imp, feature_names = X_train.columns.values, top = 15)

### We can see that our previous conclusions concur with the importance of features with the model.

## Summary

#### We learned a lot of interesting knowledge about heart disease.

#### I would love to know your comments and note about this.

#### If you liked it, make sure to vote :)

#### I'm going to make the next notebook soon.

<font size="6">
    <div style="text-align: center"> <b> Author </b> </div>
</font>

<font size="5">
    <div style="text-align: center"> Jędrzej </div>
    <div style="text-align: center"> Dudzicz </div>
</font>