# About this dataset [Heart Disease UCI](https://www.kaggle.com/ronitf/heart-disease-uci)
* Age: Age of the patient
* Sex:
    * 0: Female
    * 1: Male
* Chest Pain Type:
    * 0: Typical Angina
    * 1: Atypical Angina
    * 2: Non-Anginal Pain
    * 3: Asymptomatic
* Resting Blood Pressure: Person's resting blood pressure.
* Cholesterol: Serum Cholesterol in mg/dl
* Fasting Blood Sugar:
    * 0:Less Than 120mg/ml
    * 1: Greater Than 120mg/ml
* Resting Electrocardiographic Measurement:
    * 0: Normal
    * 1: ST-T Wave Abnormality
    * 2: Left Ventricular Hypertrophy
* Max Heart Rate Achieved: Maximum Heart Rate Achieved
* Exercise Induced Angina:
    * 1: Yes
    * 0: No
* ST Depression: ST depression induced by exercise relative to rest.
* Slope: Slope of the peak exercise ST segment:
    * 0: Upsloping
    * 1: Flat
    * 2: Downsloping
* Thalassemia: A blood disorder called 'Thalassemia':
    * 0: Normal
    * 1: Fixed Defect
    * 2: Reversable Defect
* Number of Major Vessels: Number of major vessels colored by fluoroscopy.
* target :
    * 0 = less chance of heart attack
    * 1 = more chance of heart attack

## 1. Objective:
* Based on some independent features, to predict whether an individual is prone to heart attack or not.
* To study which feature impact more on the prediction
* Selection of best model to predict the heart attack. 

## 2. Questions to be answered:
1. Does the age of a person contribute towards heart attack?
2. Are different types of chest pain related to each other or the possibility of getting a heart attack?
3. Does high blood pressure increase the risk of heart attack?
4. Does the choestrol level eventually contribute as a risk factor towards heart attack?

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import cm
import seaborn as sns
import os

%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

# for preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold, cross_validate

# classifiers
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
import xgboost as xgb 

# Evaluation
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix, classification_report

In [None]:
df = pd.read_csv('../input/heart-disease-uci/heart.csv')

## EDA

In [None]:
df.head()

In [None]:
df.columns

In [None]:
# rename columns for easy understanding
df.columns = ['age', 'sex', 'chest_pain_type', 'resting_blood_pressure', 'cholesterol', 'fasting_blood_sugar', 'rest_ecg', 'max_heart_rate_achieved',
        'exr_induced_angina', 'st_depression', 'st_slope', 'num_major_vessels', 'thalassemia', 'target']

In [None]:
cat_cols = ['sex', 'chest_pain_type', 'fasting_blood_sugar', 'rest_ecg','exr_induced_angina', 'st_slope', 'num_major_vessels', 'thalassemia']
num_cols = ['age', 'resting_blood_pressure', 'cholesterol', 'max_heart_rate_achieved','st_depression']

In [None]:
df.shape

In [None]:
df.info()

In [None]:
df.isnull().any().sum()

In [None]:
df.describe().T

In [None]:
df['target'].value_counts()

In [None]:
df.duplicated().sum()

In [None]:
print(f"shape before removing duplicates: {df.shape}")
df.drop_duplicates(inplace = True)
print(f"shape after removing duplicates: {df.shape}")

### Univariate analysis

In [None]:
df['target'].value_counts().plot(kind = 'bar', color=['red', 'blue'])

### Numerical Feature Analysis

In [None]:
plt.figure(figsize=(20,10))
for i, col in enumerate(num_cols):
    plt.subplot(2,3, i+1)
    sns.kdeplot(data = df, x= col,hue = 'target', palette = 'Set1', fill = True)
    plt.xticks(rotation = 90)

In [None]:
sns.set_palette("pastel")
plt.figure(figsize=(20,10))
for i, col in enumerate(num_cols):
    plt.subplot(2,3, i+1)
    sns.histplot(data = df, x = col, hue = 'target', palette = 'Set1')
    plt.xticks(rotation = 90)
    plt.title(f"{col}", fontsize = 14)

In [None]:
# sns.set_palette("pastel")
plt.figure(figsize=(20,10))
for i, col in enumerate(num_cols):
    plt.subplot(2,3, i+1)
    sns.boxplot(data = df, x = 'target', y = col, palette = 'Pastel1' )
    sns.swarmplot(data = df, x = 'target', y = col, palette = 'Set1')
    plt.xticks(rotation = 90)
    plt.title(f"{col}", fontsize = 14)

## Categorical Feature Analysis

In [None]:
plt.figure(figsize=(20,10))
for i, col in enumerate(cat_cols):
    plt.subplot(2,4, i+1)
    sns.countplot(data = df, x = col, palette = 'Set1')
    plt.xticks(rotation = 90)
    plt.title(f"{col}", fontsize = 14)

In [None]:
plt.figure(figsize=(20,10))
for i, col in enumerate(cat_cols):
    plt.subplot(2,4, i+1)
    sns.countplot(data = df, x = col, hue = 'target', palette = 'Set1')
    plt.xticks(rotation = 90)
    plt.title(f"{col}", fontsize = 14)

## Q1. Does the age of a person contribute towards heart attack?

In [None]:
# df['age'].value_counts().plot(kind = 'hist')
sns.histplot(data = df, x = 'age', hue = 'target')

### Ans1
Youger are more prone toward heart attack rather than above 55

### Q2: Are different types of chest pain related to each other or the possibility of getting a heart attack?
Ans: chest pain type 0 is less likely to have heart attack while chest pain type 2 is most related to heart attack. 

### Q3 Does high blood pressure increase the risk of heart attack?
Ans: blood pressure does not impact the heart attack, even high blood pressure are less prone to heart attack, it seems weird, but this is what we can observe from the data. 

### Q4. Does the cholestrol level eventually contribute as a risk factor towards heart attack?
Ans: cholestrol level in range of 200-300 are maximum prone to heart attack. 

In [None]:
plt.figure(figsize=(10,20))
for i, col in enumerate(cat_cols):
    plt.subplot(4,2, i+1)
    sns.swarmplot(data = df, x = col, y = 'age', hue = 'target', palette = 'Set1')
    plt.xticks(rotation = 90)
    plt.title(f"{col}", fontsize = 14)

### Multivariate analysis

In [None]:
sns.pairplot(df[['age','resting_blood_pressure','cholesterol','max_heart_rate_achieved','st_depression','target']],hue = 'target',palette = 'Set1', diag_kind='kde')

### Correlation

In [None]:
plt.figure(figsize = (15,12))
sns.heatmap(df.corr(), annot = True, fmt = '.2f', cmap = 'viridis', cbar = True)

In [None]:
df.corr()['target'].sort_values(ascending = False)[1:].plot(kind = 'bar', lw = .4, color = 'blue')

## Data spliting and scaling

In [None]:
df[num_cols].head()

In [None]:
X = df.drop('target', axis = 1)
y = df['target']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=21)
X_train.shape, X_test.shape

In [None]:
# standardize only numerical columns
scaler = StandardScaler()
X_train[num_cols] = scaler.fit_transform(X_train[num_cols])
X_test[num_cols] = scaler.transform(X_test[num_cols])

## Model Training

In [None]:
key = ['LogisticRegression','KNeighborsClassifier','SVC','DecisionTreeClassifier','RandomForestClassifier','GradientBoostingClassifier','AdaBoostClassifier','XGBClassifier']
value = [LogisticRegression(random_state=9), KNeighborsClassifier(), SVC(), DecisionTreeClassifier(), RandomForestClassifier(), GradientBoostingClassifier(), AdaBoostClassifier(), xgb.XGBClassifier()]
models = dict(zip(key,value))

In [None]:
cv=KFold(5, shuffle=True, random_state=21)

In [None]:
def model_check(X, y, classifiers, cv):
    
    ''' A function for testing multiple classifiers and return several metrics. '''
    
    model_table = pd.DataFrame()

    row_index = 0
    for cls in classifiers:

        MLA_name = cls.__class__.__name__
        model_table.loc[row_index, 'Model Name'] = MLA_name
        
        cv_results = cross_validate(
            cls,
            X,
            y,
            cv=cv,
            scoring=('accuracy','f1','roc_auc'),
            return_train_score=True,
            n_jobs=-1
        )
        model_table.loc[row_index, 'Train Roc/AUC Mean'] = cv_results['train_roc_auc'].mean()
        model_table.loc[row_index, 'Test Roc/AUC Mean'] = cv_results['test_roc_auc'].mean()
        model_table.loc[row_index, 'Test Roc/AUC Std'] = cv_results['test_roc_auc'].std()
        model_table.loc[row_index, 'Train Accuracy Mean'] = cv_results['train_accuracy'].mean()
        model_table.loc[row_index, 'Test Accuracy Mean'] = cv_results['test_accuracy'].mean()
        model_table.loc[row_index, 'Test Acc Std'] = cv_results['test_accuracy'].std()
        model_table.loc[row_index, 'Train F1 Mean'] = cv_results['train_f1'].mean()
        model_table.loc[row_index, 'Test F1 Mean'] = cv_results['test_f1'].mean()
        model_table.loc[row_index, 'Test F1 Std'] = cv_results['test_f1'].std()
        model_table.loc[row_index, 'Time'] = cv_results['fit_time'].mean()

        row_index += 1        

    model_table.sort_values(by=['Test F1 Mean'],
                            ascending=False,
                            inplace=True)

    return model_table

In [None]:
raw_models = model_check(X_train, y_train, models.values(), cv)

In [None]:
raw_models

In [None]:
def f_imp(classifiers, X, y, bins):
    
    ''' A function for displaying feature importances'''
    
    fig, axes = plt.subplots(1, 2, figsize=(20, 8))
    axes = axes.flatten()

    for ax, classifier in zip(axes, classifiers):

        try:
            classifier.fit(X, y)
            feature_imp = pd.DataFrame(sorted(
                zip(classifier.feature_importances_, X.columns)),
                                       columns=['Value', 'Feature'])

            sns.barplot(x="Value",
                        y="Feature",
                        data=feature_imp.sort_values(by="Value",
                                                     ascending=False),
                        ax=ax,
                        palette='plasma')
            plt.title('Features')
            plt.tight_layout()
            ax.set(title=f'{classifier.__class__.__name__} Feature Impotances')
            ax.xaxis.set_major_locator(MaxNLocator(nbins=bins))
        except:
            continue
    plt.show()

In [None]:
f_imp([RandomForestClassifier(), DecisionTreeClassifier()], X_train, y_train, 6)

In [None]:
raw_models.columns

In [None]:
plt.figure(figsize = (8,5))
sns.barplot(data=raw_models, x = 'Train Accuracy Mean', y = 'Model Name', palette = 'Set1')

In [None]:
plt.figure(figsize = (8,5))
sns.barplot(data=raw_models, x = 'Test Accuracy Mean', y = 'Model Name', palette = 'Set1')

In [None]:
raw_models.set_index('Model Name', inplace = True)

In [None]:
plt.figure(figsize = (18,8))
raw_models[['Train Accuracy Mean','Test Accuracy Mean' ]].plot(kind = 'barh', colormap = cm.get_cmap('Spectral'), legend = False)

**Observations** 
* Logistic Regression, SVC and KNN has low variance while others has high variance with low bias.
* So I am considering Logistic Regression as best model for this problem.

In [None]:
lr = LogisticRegression()
lr.fit(X_train, y_train)
pred = lr.predict(X_test)

print(f'Accuracy score: {round(accuracy_score(y_test, pred) * 100, 2)} %')
plot_confusion_matrix(lr, X_test, y_test, cmap=plt.cm.Blues)