The starting step of building a machine learning model for a specific task given a dataset is to udnerstand the data provided and try to properly process it in order to help the model reach the best possible performance measure.
For our dataset here's the columns (features) defining it: 

* age: Age of the patient

* sex: Sex of the patient

* cp: Chest pain type, 0 = Typical Angina, 1 = Atypical Angina, 2 = Non-anginal Pain, 3 = Asymptomatic

* trtbps: Resting blood pressure (in mm Hg)

* chol: Cholestoral in mg/dl fetched via BMI sensor

* fbs: (fasting blood sugar > 120 mg/dl), 1 = True, 0 = False

* restecg: Resting electrocardiographic results, 0 = Normal, 1 = ST-T wave normality, 2 = Left ventricular hypertrophy

* thalachh: Maximum heart rate achieved

* oldpeak: Previous peak

* slp: Slope

* caa: Number of major vessels

* thall: Thalium Stress Test result ~ (0,3)

* exng: Exercise induced angina ~ 1 = Yes, 0 = No

* output: **Target variable**

# Introduction to the dataset, with some exploration

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import lightgbm as lgb
import xgboost as xgb
from sklearn.metrics import accuracy_score, classification_report, roc_curve

In [None]:
# Read the dataset
data = pd.read_csv("../input/heart-attack-analysis-prediction-dataset/heart.csv")
data.head()

In [None]:
# It's clear that there are some categorical features and continuous ones
categorical_features = ['sex', 'cp', 'fbs', 'restecg', 'exng', 'slp', 'caa', 'thall']
continuous_features = ['age', 'trtbps', 'chol', 'thalachh', 'oldpeak']

In [None]:
# Info about the dataset attributes
data.info()

In [None]:
# Check nan values 
data.isnull().sum()

In [None]:
# Describe continuous features
data[continuous_features].describe()

In [None]:
# Check the correlation between features of the dataset
plt.figure(figsize=(18, 9))
heatmap = sns.heatmap(data.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Features correlation Heatmap', fontdict={'fontsize':12}, pad=12);

In [None]:
# Another visualization of pairwise relationships in our dataset
sns.pairplot(data,hue='output')
plt.show()

From those basic exploration we can conclude that:
* This dataset doesn't contain nan values
* The continuous features (age, chol, ...) aren't in the same range
* The target variables is more correlated with 'cp', 'thalachh' and 'slp'
* There is no clear linear relationship between categorical features
* There are some outliers when visualizing the pairwise relationship with the categorical features
* ...

# Feature engineering

In [None]:
data_f = data
# Binning continuous features
# age
data_f['age'] = pd.cut(data_f['age'], bins=5, labels=range(5))
# trtbps
data_f['trtbps'] = pd.cut(data_f['trtbps'], bins=5, labels=range(5))
# chol
data_f['chol'] = pd.cut(data_f['chol'], bins=5, labels=range(5))
# thalachh
data_f['thalachh'] = pd.cut(data_f['thalachh'], bins=5, labels=range(5))

# Encoding categorical features
data_f = pd.get_dummies(data_f, columns = categorical_features, drop_first = True)

# Modeling

## Without feature engineering 

In [None]:
# Define the features and target
X = data.drop(['output'],axis=1)
y = data[['output']]

In [None]:
# Spliting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [None]:
# Logistic regression classifier (simplest one)
# Without feature engineering
clf = LogisticRegression()
# train the classifier
clf.fit(X_train, y_train)
# calculating the probabilities
y_pred_proba = clf.predict_proba(X_test)
# finding the predicted valued
y_pred = np.argmax(y_pred_proba,axis=1)
# printing the test accuracy
print("The test accuracy score of Logistric Regression Classifier is ", accuracy_score(y_test, y_pred))

In [None]:
y_true = y_test
y_pred = clf.predict(X_test)
print(classification_report(y_true, y_pred))

## With some basic feature engineering

In [None]:
# Define the features and target
X = data_f.drop(['output'],axis=1)
y = data_f[['output']]

# Scaling continuous features
scaler = RobustScaler()
X[continuous_features] = scaler.fit_transform(X[continuous_features])

In [None]:
# Spliting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [None]:
# Logistic regression classifier (simplest one)
# With feature engineering
clf = LogisticRegression()
# train the classifier
clf.fit(X_train, y_train)
# calculating the probabilities
y_pred_proba = clf.predict_proba(X_test)
# finding the predicted valued
y_pred = np.argmax(y_pred_proba,axis=1)
# printing the test accuracy
print("The test accuracy score of Logistric Regression Classifier is ", accuracy_score(y_test, y_pred))

In [None]:
y_true = y_test
y_pred = clf.predict(X_test)
print(classification_report(y_true, y_pred))

**With some basic feature engineering we observe how the model score rises from 0.8 to 0.87**

# Let's use other models

# RandomForest model

In [None]:
clf_model = RandomForestClassifier()

param_grid = {
    'n_estimators': [400, 700, 1000],
    'max_depth': [15,20,25],
    'max_leaf_nodes': [50, 100, 200]
}

gs = GridSearchCV(
        estimator=clf_model,
        param_grid=param_grid, 
        cv=10, 
        n_jobs=-1, 
        scoring='roc_auc',
        verbose=2
    )

fitted_clf_model = gs.fit(X_train, y_train)

print(fitted_clf_model.best_score_)
print(fitted_clf_model.best_params_)

In [None]:
# Show the classification report
y_true = y_test
y_pred = fitted_clf_model.predict(X_test)
print(classification_report(y_true, y_pred))

# LightGBM

In [None]:
# Define the features and target
X = data_f.drop(['output'],axis=1)
y = data_f[['output']]

In [None]:
# Spliting the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state = 42)

In [None]:
lgb_model = lgb.LGBMClassifier(boosting_type='gbdt',  objective='binary', metric='auc', learning_rate=0.001, num_boost_round=1000)
lgb_model.fit(X=X_train, y=y_train)
# calculating the probabilities
y_pred_proba = lgb_model.predict_proba(X_test)
# finding the predicted valued
y_pred = np.argmax(y_pred_proba,axis=1)
# printing the test accuracy
print("The test accuracy score of LightGBM Classifier is ", accuracy_score(y_test, y_pred))

In [None]:
# Show the classification report
y_true = y_test
y_pred = lgb_model.predict(X_test)
print(classification_report(y_true, y_pred))

In [None]:
# Plot the ROC curve 
y_pred_prob = lgb_model.predict_proba(X_test)[:,1]
fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

plt.plot([0,1],[0,1],"k--",'r+')
plt.plot(fpr,tpr,label='LighGBM classifier')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("LightGBM classifier ROC Curve")
plt.show()

In [None]:
# Lets see which features are important to the classification
lgb.plot_importance(lgb_model)
plt.rcParams['figure.figsize'] = [20, 9]
plt.show()

The visualization of features importance allows us to understand more the effect of some features that the model consider more important in its classification. Thus, more process can be done to help the model reach high performance level.
We can also continue to finetune the hyper-parameters of the model to gain some % in the accuracy measure.

In [None]:
gkf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

lgb_model = lgb.LGBMClassifier(
    boosting_type="gbdt",
    objective='binary',
    metric='auc'
)
param_grid = {
    'n_estimators': [200, 300, 400],
    'colsample_bytree': [0.5, 0.6, 0.7],
    'max_depth': [5, 10, 15],
    'num_leaves': [20, 30, 40, 50, 60],
    'reg_alpha': [1, 1.1, 1.2, 1.3],
    'reg_lambda': [1, 1.1, 1.2, 1.3],
    'min_split_gain': [0.3, 0.4],
    'subsample': [0.8, 0.9, ],
    'subsample_freq': [15, 20, 25],
    'learning_rate': [0.01, 0.001, 0.0001]
}

gs = GridSearchCV(
        estimator=lgb_model,
        param_grid=param_grid, 
        cv=gkf, 
        n_jobs=-1, 
        scoring='roc_auc',
        verbose=2
    )

fitted_lgb_model = gs.fit(X_train, y_train)

print(fitted_lgb_model.best_score_)
print(fitted_lgb_model.best_params_)

In [None]:
# Show the classification report
y_true = y_test
y_pred = fitted_lgb_model.predict(X_test)
print(classification_report(y_true, y_pred))