# Objective

The objective for this dataset is **to build a predictive model** that could best classify and predict if a patient would contract a heart disease based on associated health variables.

Without further ado, let's explore how we could accomplish the objective.

# Import Dependencies/Libraries & Datasets

First, we import our dependencies and libraries. I like to put all my libaries in one cell at the start so that I could detect any of the libraries that I'm missing. Of course, I could always go back and run the cell again after I add the necessary missing library.

Then, we add our dataset and save it as a variable in Pandas Dataframe.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import StandardScaler, PolynomialFeatures, MinMaxScaler
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report
from sklearn.neural_network import MLPClassifier
import warnings
warnings.filterwarnings("ignore") ## Telling Python to do not show any warnings for clean throughput

In [1]:
## Loading csv data to Pandas DataFrame
df = pd.read_csv('../input/heart-disease-uci/heart.csv')

# Exploratory Data Analysis

Now that we have imported our dataset, let's take a look at a few characteristics of the dataframe, as well as the the insights we could find on the features and target variables.

In [1]:
df.head(5)

In [1]:
df.info()

We have **13 features** and **one target (0 and 1)** variables, with a total row of **303**. 

There's also **no null values** in all of the columns, with all of the datapoints being integer or float datatypes. It's safe to say that we have a clean and pretty straightfoward dataset.

## Finding early insights 

In [1]:
df['age'].describe()

In [1]:
df['sex'].value_counts()

In [1]:
df['target'].value_counts()

## 1 --> Defective heart
## 0 --> Healthy heart

1. 50% of the patients are at least 55 years old
2. We have quite an imbalance proportion of male and female patients, with male being about 2 times the number of female patients
3. Since the distribution of our target variable is about the same AND more than 30 samples, we have a normal standard distribution and thus, we don't need to perform pre-processing steps for our model building

## Exploring The Variables

In [1]:
# Assign target column to the variable tar
tar = df["target"] 

# Assign nummerical columns to the variable num (all columns in data DF - target column)
feature = list(set(df.columns)-set(tar))

In [1]:
plt.figure(figsize=(25,25))
for i in range(0,len(feature)):
    plt.subplot(4, 4, i+1)
    sns.boxplot(y = df[feature[i]], x = df['target'], data=df)

Given that the datapoints for Patients with No heart disease overlapped with those with a heart disease, we could say that **there is no significant difference** between the two class in all of our feature variables. 

In [1]:
# This is a neat and easiest way to visualize the correlation between the variables

corr_matrix = df.corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr_matrix, cmap="PiYG", annot=True, square=False, fmt=".2g")
plt.title("Data: Correlations between Variables")

Note that since we do not have a highly correlated features, **we don't need to perform PCA** as a pre-processing steps in our model building to remove the highly-correlated variables.

# Splitting the Features & Target variables, Scale X and y

Now, we split our features and targets into X and y for our model building process. To make it more interesting, I use StandardScaler() on X_train and X_test to remove the mean and scale each variable to their unit variance. 

Because why not, right? Let's see if we could have a better model with a scaled sets.

In [1]:
X = df.drop(columns = 'target', axis = 1)
y = df['target']

In [1]:
# Stratify: So that the target class will be evenly distributed to train & test set i.e. not all 0 will be assigned to train/test and vice versa

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, stratify = y, random_state = 42)

In [1]:
# Let's check the row number of our split sets compared to the original one

(X.shape, X_train.shape, X_test.shape)

## Scale the X_train & X_test

In [1]:
## Fit & Transform Standard scaler on X_train & X_test

std = StandardScaler()

X_train_sc = std.fit(X_train).transform(X_train)
X_test_sc = std.transform(X_test)

# Model Building: Without Scaler

We'll use 4 models: kNN, Logistic Regression, Random Forest Classifier, and MultiLayer Perceptron Classifier.

We'll also use Receiver Operating Characteristics-Area Under Curve (ROC_AUC) as our metric score as it measures **how well predictions** are ranked, rather than the absolute values.

To simplify the steps in searching for the best parameters for each models, we'll use **GridSearchCV** function to loop through predefined hyperparameters and fit the estimator on the training set.

For each of the model, we'll also print out the **Classification Report and Confusion Matrix** to see the models' Precision and Recall score on a bird's eye view. 

### k-nearest neighbours Classifier (kNN)

In [1]:
param_grid_knn = {'n_neighbors': np.arange(1, 20),
              'p': [1,2],
              'weights': ['uniform','distance']}

grid_knn = GridSearchCV(KNeighborsClassifier(), 
                    param_grid_knn, scoring='roc_auc',
                    cv=5)

grid_knn.fit(X_train, y_train)

print("KNN Best Parameters: ", grid_knn.best_params_)

model_knn = grid_knn.best_estimator_
print("KNN Best Score: ", grid_knn.best_score_)

model_knn.fit(X_train, y_train)
y_pred_knn = model_knn.predict(X_test)

print('classification_report:\n',classification_report(y_test, y_pred_knn))
confusion_matrix(y_test, y_pred_knn)

Whoa! Note the 75% of the score given by this kNN model

### Logistic Regression

In [1]:
param_grid_lr = {'C': [0.001,0.01,0.1,1,10,100],
             'penalty': ['none','l2'],
             'fit_intercept':[True,False],
                'solver':['newton-cg','lbfgs','sag','saga']}

grid_lr = GridSearchCV(LogisticRegression(), 
                    param_grid_lr, scoring='roc_auc',
                    cv=5)

grid_lr.fit(X_train, y_train)

print("LR Best Parameters: ", grid_lr.best_params_)

model_lr = grid_lr.best_estimator_
print("LR Best Score: ", grid_lr.best_score_)

model_lr.fit(X_train, y_train)
y_pred_lr = model_lr.predict(X_test)

print('classification_report:\n',classification_report(y_test, y_pred_lr))
confusion_matrix(y_test, y_pred_lr)

### Random Forest

In [1]:
param_grid_rf = {'max_depth': np.arange(1, 20),
                'n_estimators':[1,10,20,50,100]}

grid_rf = GridSearchCV(RandomForestClassifier(), 
                    param_grid_rf, scoring='roc_auc',
                    cv=5)

grid_rf.fit(X_train, y_train)

print("RF Best Parameters: ", grid_rf.best_params_)

model_rf = grid_rf.best_estimator_
print("RF Best Score: ", grid_rf.best_score_)

model_rf.fit(X_train,y_train)
y_pred_rf = model_rf.predict(X_test)

print('classification_report:\n',classification_report(y_test, y_pred_rf))
confusion_matrix(y_test, y_pred_rf)

### MultiLayer Perceptron

In [1]:
param_grid_mlp = {'hidden_layer_sizes': [(20),(20,20),(20,20,20)],
    'activation': ['logistic', 'relu'],
    'solver': ['adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive']}

grid_mlp = GridSearchCV(MLPClassifier(),
                        param_grid_mlp,
                        scoring='roc_auc',
                        cv=5)

grid_mlp.fit(X_train, y_train)

print("MLP Best Parameters: ", grid_mlp.best_params_)

model_mlp = grid_mlp.best_estimator_
print("MLP Best Score: ", grid_mlp.best_score_)

model_mlp.fit(X_train,y_train)
y_pred_mlp = model_mlp.predict(X_test)

print('classification_report:\n',classification_report(y_test, y_pred_mlp))
confusion_matrix(y_test, y_pred_mlp)

# Model Training: With StandardScaler()

### k-nearest neighbour (kNN)

In [1]:
param_grid_knn_sc = {'n_neighbors': np.arange(1, 20),
              'p': [1,2],
              'weights': ['uniform','distance']}

grid_knn_sc = GridSearchCV(KNeighborsClassifier(), 
                    param_grid_knn_sc, scoring='roc_auc',
                    cv=5)

grid_knn_sc.fit(X_train_sc, y_train)

print("KNN Best Parameters: ", grid_knn_sc.best_params_)

model_knn_sc = grid_knn_sc.best_estimator_
print("KNN Best Score: ", grid_knn_sc.best_score_)

model_knn_sc.fit(X_train_sc,y_train)
y_pred_knn_sc = model_knn_sc.predict(X_test_sc)

print('classification_report:\n',classification_report(y_test, y_pred_knn_sc))
confusion_matrix(y_test, y_pred_knn_sc)

Hey! Looks like our kNN model's score **improves** when we use Scaled set.

### Logistic Regression

In [1]:
param_grid_lr_sc = {'C': [0.001,0.01,0.1,1,10,100],
             'penalty': ['none','l2'],
             'fit_intercept':[True,False],
                'solver':['newton-cg','lbfgs','sag','saga']}

grid_lr_sc = GridSearchCV(LogisticRegression(), 
                    param_grid_lr_sc, scoring='roc_auc',
                    cv=5)

grid_lr_sc.fit(X_train_sc, y_train)

print("LR Best Parameters: ", grid_lr_sc.best_params_)

model_lr_sc = grid_lr_sc.best_estimator_
print("LR Best Score: ", grid_lr_sc.best_score_)

model_lr_sc.fit(X_train_sc,y_train)
y_pred_lr_sc = model_lr_sc.predict(X_test_sc)

print('classification_report:\n',classification_report(y_test, y_pred_lr_sc))
confusion_matrix(y_test, y_pred_lr_sc)

### Random Forest

In [1]:
param_grid_rf_sc_sc = {'max_depth': np.arange(1, 20),
                'n_estimators':[1,10,20,50,100]}

grid_rf_sc = GridSearchCV(RandomForestClassifier(), 
                    param_grid_rf_sc_sc, scoring='roc_auc',
                    cv=5)

grid_rf_sc.fit(X_train_sc, y_train)

print("RF Best Parameters: ", grid_rf_sc.best_params_)

model_rf_sc = grid_rf_sc.best_estimator_
print("RF Best Score: ", grid_rf_sc.best_score_)

model_rf_sc.fit(X_train_sc,y_train)
y_pred_rf_sc = model_rf_sc.predict(X_test_sc)

print('classification_report:\n',classification_report(y_test, y_pred_rf_sc))
confusion_matrix(y_test, y_pred_rf_sc)

### MultiLayer Perceptron

In [1]:
param_grid_mlp_sc_sc = {'hidden_layer_sizes': [(20),(20,20),(20,20,20)],
    'activation': ['logistic', 'relu'],
    'solver': ['adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive']}

grid_mlp_sc = GridSearchCV(MLPClassifier(),
                        param_grid_mlp_sc_sc,
                        scoring='roc_auc',
                        cv=5)

grid_mlp_sc.fit(X_train_sc, y_train)

print("MLP Best Parameters: ", grid_mlp_sc.best_params_)

model_mlp_sc = grid_mlp_sc.best_estimator_
print("MLP Best Score: ", grid_mlp_sc.best_score_)

model_mlp_sc.fit(X_train_sc,y_train)
y_pred_mlp_sc = model_mlp_sc.predict(X_test_sc)

print('classification_report:\n',classification_report(y_test, y_pred_mlp_sc))
confusion_matrix(y_test, y_pred_mlp_sc)

Note that when we use the Scaled set, the models' scores are either enhanced or stay about the same. No harm done, right :)

# Conclusion

**Best Model:** Random Forest without StandardScaler

**Reason:**
* We have an **excellent roc_auc score**. Our model is able to distinguish the patients with heart disease and those who don’t above 90% of the time.

* We have **a higher Precision & Recall score** amongst all our models, which means if we choose those scores as our metrics, Random Forest will generate a better score overall.

* The **misclassified portions** (False Negative and False Positive) of datapoints in Random Forest without Scaling in the Classification Matrix is **lower than the others**. In regard to disease classification, we would want to try and avoid a misclassification as much as we could.

* The model gives out the **highest True Positive and True Negative portions** amongst all the models. Back to our objective, this model would be the best to classify and predict patients with a heart disease and those that truly do not have a heart disease.