# Project Description
HeartBeats 💓🔍: Unraveling the Rhythm of Heart Attack Analysis & Prediction

## Overview
Welcome to HeartBeats 💓🔍! In this data science project, we embark on a captivating journey to explore the world of heart attack analysis and prediction. Our aim is to gain valuable insights into the factors that contribute to heart attacks and develop an accurate predictive model to assess the risk of heart attack occurrence. Through a comprehensive workflow of data exploration, visualization, feature engineering, and machine learning, we seek to uncover the rhythm behind this critical cardiovascular event.

## Dataset
Our dataset contains a wealth of information related to heart health, including various patient attributes, lifestyle factors, and medical indicators. It provides a unique opportunity to understand the interplay between these factors and the occurrence of heart attacks. By harnessing the power of data, we strive to bring clarity to this intricate medical challenge.

## Key Objectives

#### **Data Exploration**
We will start by diving deep into the dataset, unearthing meaningful patterns and relationships between the variables. Through insightful visualizations, we aim to gain a comprehensive understanding of the data landscape.

#### **Feature Engineering**
Armed with domain knowledge and creative thinking, we will engineer new features to enhance the predictive power of our model. Crafting relevant features is like composing the melody that guides our heart attack prediction.

#### **Machine Learning Model**
Leveraging advanced machine learning algorithms, we will build a robust predictive model to anticipate heart attack occurrences. Our model will learn from the past to pave the way for a healthier future.

#### **Model Evaluation**
Rigorous evaluation of our model's performance will be conducted to ensure its reliability and accuracy. We will fine-tune the model's parameters to strike the right chord in predicting heart attack risks.

#### **Insights & Interpretations**
Throughout the project, we will uncover fascinating insights into the factors that influence heart attack occurrences. By interpreting our model's outcomes, we hope to unlock novel perspectives on heart health.

#### **Impact & Implications**
HeartBeats 💓🔍 aspires to make a significant impact on the field of cardiovascular health. Our findings can empower healthcare professionals with better risk assessment tools, allowing for proactive interventions and potentially saving lives. Moreover, the knowledge gained from this project may foster public awareness about heart health and inspire lifestyle improvements.

### **Join the Rhythm**
We invite all data enthusiasts, medical professionals, and passionate learners to join us in the HeartBeats 💓🔍 project. Together, we will unravel the mysteries surrounding heart attacks and endeavor to create a harmonious symphony of data-driven insights. Let's take this journey to the rhythm of the heart! ❤️🚀

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
np.random.seed(0)
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('/kaggle/input/heart-attack-analysis-prediction-dataset/heart.csv')

data.shape

In [None]:
data.head()

In [None]:
data.hist( figsize=(20, 6) )
plt.tight_layout()

In [None]:
data.info()

In [None]:
data.describe()

# Separing Features

In [None]:
num_vars = [ 'age', 'trtbps', 'chol', 'thalachh', 'oldpeak' ]
cat_vars = [ 'sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall' ]
target = 'output'

# Univariate

## Target

In [None]:
sns.countplot(data, x=target)

In [None]:
fig, axs = plt.subplots(2,5, figsize=(20,8))
axs = axs.ravel()

for i, feature in enumerate(num_vars):
    sns.histplot(data, x=feature, ax=axs[i])
    sns.violinplot(data, y=feature, ax=axs[i+5])

plt.tight_layout()

# Categorical Features

In [None]:
fig, axs = plt.subplots(ncols=8, figsize=(20,4))
axs = axs.ravel()

for i, feature in enumerate(cat_vars):
    sns.countplot(data, x=feature, ax=axs[i])

plt.tight_layout()

# Bivariate

In [None]:
fig, axs = plt.subplots(2,5, figsize=(20,8))
axs = axs.ravel()

for i, feature in enumerate(num_vars):
    sns.histplot(data, x=feature, hue=target, ax=axs[i])
    sns.violinplot(data, y=feature, x=target, ax=axs[i+5])

plt.tight_layout()

In [None]:
fig, axs = plt.subplots(ncols=8, figsize=(20,4))
axs = axs.ravel()

for i, feature in enumerate(cat_vars):
    crosstab = pd.crosstab( data[feature], data.output, normalize=True )
    sns.heatmap(crosstab, annot=True, cmap='Blues', cbar=False, ax=axs[i])

plt.tight_layout()

# Correlation

In [None]:
from sklearn.preprocessing import OneHotEncoder

one_hot = OneHotEncoder(sparse_output=False)

cat_encoded = one_hot.fit_transform( data[ cat_vars ] )
vars_names = one_hot.get_feature_names_out()

data_cat = pd.DataFrame( cat_encoded, columns=vars_names )
data_corr = pd.concat( [ data.drop(cat_vars, axis=1).reset_index(drop=True), data_cat ], axis=1 )

# ---------------------------------------------------------------------------------------------

vars_corr = data_corr.drop( target, axis=1 ).corr()
target_corr = data_corr.corr()[ target ].drop(target, axis=0).sort_values(ascending=False)

In [None]:
fig, axs = plt.subplots(ncols=3, figsize=(20,6))

sns.heatmap( vars_corr, cmap='Blues', ax=axs[0] )
axs[0].set_title( 'Correlation Matrix' )

sns.heatmap( vars_corr, cmap='Blues', mask=vars_corr < 0.8, annot=True, ax=axs[1], vmin=0 )
axs[1].set_title( 'Correlation Matrix With Mask < 0.8' )

sns.heatmap( pd.DataFrame(target_corr), cmap='Blues', annot=True, vmin=0, ax=axs[2] )
axs[2].set_title( 'Target Correlation' )

plt.tight_layout()

# Modeling

In [None]:
X = data.drop(target, axis=1)
y = data[target]

X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.25 )
print('X size: ', X.shape[0])
print('Train size: ', X_train.shape[0])
print('Test size: ', X_test.shape[0])

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.metrics import classification_report

In [None]:
ct = ColumnTransformer(
    [("cat", OneHotEncoder(), cat_vars),
    ("num", StandardScaler(), num_vars)])

model = make_pipeline( ct, DecisionTreeClassifier() )

model

In [None]:
model.fit( X_train, y_train )
y_pred = model.predict( X_test )

print( classification_report( y_test, y_pred ) )

In [None]:
tree_classifier = model.named_steps['decisiontreeclassifier']

feature_names = ct.get_feature_names_out()

plt.figure(figsize=(35, 8))
plot_tree(tree_classifier, feature_names=feature_names, filled=True, rounded=True, fontsize=6)

plt.tight_layout()

In [None]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.linear_model import LogisticRegression

In [None]:
algorithms = {
    'Logistic Regression': LogisticRegression(),
    'Decision Tree': DecisionTreeClassifier(),
    'Naive Bayes': GaussianNB(),
    'KNN': KNeighborsClassifier(),
    'Random Forest': RandomForestClassifier(),
    'GBC' : GradientBoostingClassifier(),
    'XGB': XGBClassifier(),
    'LGBM': LGBMClassifier(),
    'Cat': CatBoostClassifier(verbose=False)
}

In [None]:

results = []
accuracys = []

for key, algorithm in algorithms.items():
    
    model = make_pipeline( ct, algorithm )
    model.fit( X_train, y_train )
    
    y_pred = model.predict( X_test )
    report = classification_report( y_test, y_pred, output_dict=True )
    
    # Reporting
    bar = '___' * 20
    formated_report = round(pd.DataFrame( report ).drop('accuracy', axis=1).transpose(), 4)
    accuracy = round(report['accuracy'], 4)
    score = f'{bar}\n\n{key}\nAccuracy: {accuracy}\n{bar}\n{formated_report}\n{bar}\n'
    results.append( score )
    accuracys.append( accuracy )
    print( score )

In [None]:
accuracy_results = pd.DataFrame( zip(algorithms.keys(), accuracys), columns=[ 'Algorithm', 'Accuracy' ] ).sort_values(ascending=False, by='Accuracy').set_index('Algorithm')

accuracy_results

In [None]:
accuracy_results.plot( figsize=(16,4), marker='o' )

# Hyperparameter Tunning

In [None]:
from sklearn.model_selection import GridSearchCV

def hiperparameter( algorithm, param_grid ):
    rkf = RepeatedKFold(n_splits=2, n_repeats=3)
    grid_search = GridSearchCV(algorithm, param_grid, cv=rkf, scoring='accuracy')
    model = make_pipeline( ct, grid_search)

    model.fit(X_train, y_train)

    print("Best Params")
    print(grid_search.best_params_)

    best_model = grid_search.best_estimator_
    accuracy = make_pipeline( ct, best_model ).score(X_test, y_test)
    print("Best Accuracy {:.4f}".format(accuracy))

In [None]:
param_grid = {
    'n_estimators': [50, 100, 200],
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20], 
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

hiperparameter( RandomForestClassifier(), param_grid )

In [None]:
param_grid = {
    'penalty': ['l1', 'l2'],
    'C': [0.1, 1.0, 10.0],
    'solver': ['liblinear', 'lbfgs', 'saga', 'elasticnet']
}
hiperparameter( LogisticRegression(max_iter=10000), param_grid )

# Best Choice is Logistic Regression
# Score Best Accuracy 0.8553