# Predicting Heart Disease using Machine Learning



## Problem Definition
 Given clinical parameters about a patient, can we predict whether or not they have heart disease?



### Heart Disease Data Dictionary

The following are the features we'll use to predict our target variable (heart disease or no heart disease).

1. age - age in years 
2. sex - (1 = male; 0 = female) 
3. cp - chest pain type 
    * 0: Typical angina: chest pain related decrease blood supply to the heart
    * 1: Atypical angina: chest pain not related to heart
    * 2: Non-anginal pain: typically esophageal spasms (non heart related)
    * 3: Asymptomatic: chest pain not showing signs of disease
4. trestbps - resting blood pressure (in mm Hg on admission to the hospital)
    * anything above 130-140 is typically cause for concern
5. chol - serum cholestoral in mg/dl 
    * serum = LDL + HDL + .2 * triglycerides
    * above 200 is cause for concern
6. fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false) 
    * '>126' mg/dL signals diabetes
7. restecg - resting electrocardiographic results
    * 0: Nothing to note
    * 1: ST-T Wave abnormality
        - can range from mild symptoms to severe problems
        - signals non-normal heart beat
    * 2: Possible or definite left ventricular hypertrophy
        - Enlarged heart's main pumping chamber
8. thalach - maximum heart rate achieved 
9. exang - exercise induced angina (1 = yes; 0 = no) 
10. oldpeak - ST depression induced by exercise relative to rest 
    * looks at stress of heart during excercise
    * unhealthy heart will stress more
11. slope - the slope of the peak exercise ST segment
    * 0: Upsloping: better heart rate with excercise (uncommon)
    * 1: Flatsloping: minimal change (typical healthy heart)
    * 2: Downslopins: signs of unhealthy heart
12. ca - number of major vessels (0-3) colored by flourosopy 
    * colored vessel means the doctor can see the blood passing through
    * the more blood movement the better (no clots)
13. thal - thalium stress result
    * 1,3: normal
    * 6: fixed defect: used to be defect but ok now
    * 7: reversable defect: no proper blood movement when excercising 
14. target - have disease or not (1=yes, 0=no) (= the predicted attribute)

**Note:** No personal identifiable information (PPI) can be found in the dataset.

It's a good idea to save these to a Python dictionary or in an external file, so we can look at them later without coming back here.

## Preparing the tools

At the start of any project, it's custom to see the required libraries imported in a big chunk like you can see below.

However, in practice, your projects may import libraries as you go. After you've spent a couple of hours working on your problem, you'll probably want to do some tidying up. This is where you may want to consolidate every library you've used at the top of your notebook (like the cell below).

The libraries you use will differ from project to project. But there are a few which will you'll likely take advantage of during almost every structured data project. 

* [pandas](https://pandas.pydata.org/) for data analysis.
* [NumPy](https://numpy.org/) for numerical operations.
* [Matplotlib](https://matplotlib.org/)/[seaborn](https://seaborn.pydata.org/) for plotting or data visualization.
* [Scikit-Learn](https://scikit-learn.org/stable/) for machine learning modelling and evaluation.

In [None]:
# imports

In [None]:
# Regular EDA and plotting libraries
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt
import seaborn as sns 

# for plots to appear in the notebook
%matplotlib inline 

# Pipeline
from sklearn.pipeline import make_pipeline

# preprocessing
from sklearn.preprocessing import StandardScaler

## Models
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

## Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

## Loading Data



In [None]:
df = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv") 
df.shape # (rows, columns)

## Data Exploration (Exploratory Data Analysis or EDA)


In [None]:
# Let's check the top 10 rows of our dataframe
df.head()

In [None]:
# Let's see how many positive (1) and negative (0) samples we have in our dataframe
df.target.value_counts()

In [None]:
# Normalized value counts
df.target.value_counts(normalize=True)

In [None]:
# Plotting the value counts with a bar graph
df.target.value_counts().plot(kind="bar", color=["salmon", "lightblue"])
plt.xticks(rotation=0); # for keeping the labels on the x-axis vertical

In [None]:
df.info()

In [None]:
df.describe()

### Heart Disease Frequency according to Gender



In [None]:
df.sex.value_counts()

There are 207 males and 96 females in our data.

In [None]:
# Comparing target column with sex column
pd.crosstab(df.target, df.sex)

In [None]:
pd.crosstab(df.target, df.sex).plot(kind="bar", 
                                    figsize=(10,6), 
                                    color=["salmon", "lightblue"])
plt.xticks(rotation=0); # for keeping the labels on the x-axis vertical

In [None]:
pd.crosstab(df.target, df.sex).plot(kind="bar", figsize=(10,6), color=["salmon", "lightblue"])

# Adding some attributes to it
plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No Disease, 1 = Disease")
plt.ylabel("Amount")
plt.legend(["Female", "Male"])
plt.xticks(rotation=0);

### Age vs Max Heart rate for Heart Disease


In [None]:
plt.figure(figsize=(10,6))

# Starting with positve examples
plt.scatter(df.age[df.target==1], 
            df.thalach[df.target==1], 
            c="salmon")

# Now for negative examples, we want them on the same plot, so we call plt again
plt.scatter(df.age[df.target==0], 
            df.thalach[df.target==0], 
            c="lightblue")

# Adding some helpful info
plt.title("Heart Disease in function of Age and Max Heart Rate")
plt.xlabel("Age")
plt.legend(["Disease", "No Disease"])
plt.ylabel("Max Heart Rate");



Let's check the age **distribution**.

In [None]:
# Histograms are a great way to check the distribution of a variable
df.age.plot.hist(edgecolor='black', bins=15)
plt.title('Age Distribution')
plt.xlabel('Age');

 It's a **normal distribution** 

### Heart Disease Frequency per Chest Pain Type



In [None]:
pd.crosstab(df.cp, df.target)

In [None]:
pd.crosstab(df.cp, df.target).plot(kind="bar", 
                                   figsize=(10,6), 
                                   color=["lightblue", "salmon"])

# Adding attributes to the plot
plt.title("Heart Disease Frequency Per Chest Pain Type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Frequency")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation = 0);

### Correlation between independent variables



In [None]:
corr_matrix = df.corr()
corr_matrix 

In [None]:
# Let's make it look a little prettier
corr_matrix = df.corr()
plt.figure(figsize=(15, 10))
sns.heatmap(corr_matrix, 
            annot=True, 
            linewidths=0.5, 
            fmt= ".2f", 
            cmap="YlGnBu")
plt.title('Correlation');

##  Model Creation


In [None]:
# Independent variables
X = df.drop("target", axis=1)

# Target variable / dependent variable
y = df.target.values

In [None]:
X.head()

In [None]:
# Targets
y

### Training and test split


In [None]:
# Split into train & test set
X_train, X_test, y_train, y_test = train_test_split(X, # independent variables 
                                                    y, # dependent variable
                                            test_size = 0.2, # percentage of data to use for test set
                                                    random_state=3)

In [None]:
X_train.head()

In [None]:
y_train, len(y_train)

Beautiful, we can see we're using 242 samples to train on. Let's look at our test data.

In [None]:
X_test.head()

In [None]:
y_test, len(y_test)

And we've got 61 examples we'll test our model(s) on. Let's build some.

### Model choices

I will be using the following Models and comparing their results.

1. Logistic Regression  
2. Gaussian Naive Bayes
3. RandomForest Classifier

In [None]:
# Logistic Regression

In [None]:
model_scores = { }

pipe = make_pipeline(StandardScaler(), LogisticRegression())

pipe.fit(X_train, y_train)  # apply scaling on training data

model_scores['LogisticRegression'] = pipe.score(X_test, y_test)  # apply scaling on testing data, without leaking training data.

In [None]:
# Gaussian Naive Bayes

In [None]:
pipe = make_pipeline(StandardScaler(), GaussianNB())

pipe.fit(X_train, y_train)  # apply scaling on training data

model_scores['GaussianNB'] = pipe.score(X_test, y_test) # apply scaling on testing data, without leaking training data

In [None]:
# Random Forest Classifier

In [None]:
pipe = make_pipeline(StandardScaler(), RandomForestClassifier())

pipe.fit(X_train, y_train)  # apply scaling on training data

model_scores[' RandomForestClassifier'] = pipe.score(X_test, y_test)  # apply scaling on testing data, without leaking training data

In [None]:
model_scores

Beautiful! Since our models are fitting, let's compare them visually.

## Model Comparison



In [None]:
model_compare = pd.DataFrame(model_scores, index=['accuracy'])
model_compare.T.plot.bar(figsize=(12,6))
plt.xticks(rotation=0)
plt.legend(loc='center left', bbox_to_anchor=(1, 0.5))
plt.tight_layout;

## Hyperparameter tuning and cross-validation





### Tuning models with with `RandomizedSearchCV`

In [None]:
# Different LogisticRegression hyperparameters

log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

# Different RandomForestClassifier hyperparameters

rf_grid = {"n_estimators": np.arange(10, 1000, 50),
           "max_depth": [None, 3, 5, 10],
           "min_samples_split": np.arange(2, 20, 2),
           "min_samples_leaf": np.arange(1, 20, 2)}

# Different Gaussian Naive Bayes hyperparameters
nb_grid = {
    'var_smoothing': np.logspace(0,-9, num=100)
}


#### Tuning LogisticRegression

In [None]:
#  random hyperparameter search for LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=log_reg_grid,
                                cv=5,
                                n_iter=20,
                                verbose=True)

rs_log_reg.fit(X_train, y_train);

In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test, y_test)

After tuning the LogisticRegression Model's accuracy increased from 86% to 88%

#### Tuning RandomForestClassifier

In [None]:
# random hyperparameter search for RandomForestClassifier
rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions=rf_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

rs_rf.fit(X_train, y_train);

In [None]:
# the best parameters for RandomForestClassifier
rs_rf.best_params_

In [None]:
# Evaluating the randomized search random forest model
rs_rf.score(X_test, y_test)

#### Tuning Gaussian Naive Bayes

In [None]:
# random hyperparameter search for RandomForestClassifier
rs_gnb = RandomizedSearchCV(GaussianNB(),
                           param_distributions=nb_grid,
                           cv=5,
                           n_iter=20,
                           verbose=True)

rs_gnb.fit(X_train, y_train);

In [None]:
# the best parameters for Gaussian Naive Bayes
rs_gnb.best_params_

In [None]:
# Evaluating the randomized search GaussianNB model
rs_gnb.score(X_test, y_test)

##### Models Accuracy increased from 85% to 87%



### Tuning a model with `GridSearchCV`

In [None]:
# Logistic Regression

In [None]:
# Different LogisticRegression hyperparameters
log_reg_grid = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}

# grid hyperparameter search for LogisticRegression
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid=log_reg_grid,
                          cv=5,
                          verbose=True)

gs_log_reg.fit(X_train, y_train);

In [None]:
# best parameters 
gs_log_reg.best_params_

In [None]:
# Evaluate the model
gs_log_reg.score(X_test, y_test)

#### Logistic Regression gave us the highest model accuracy of 89% after tuning..

## Evaluating our Classification model, beyond accuracy



In [None]:
y_preds = gs_log_reg.predict(X_test)


### ROC Curve and AUC Scores


In [None]:
# Import ROC curve function from metrics module
from sklearn.metrics import plot_roc_curve

# Plot ROC curve and calculate AUC metric
plot_roc_curve(gs_log_reg, X_test, y_test);



### Confusion matrix 



In [None]:
# Display confusion matrix
print(confusion_matrix(y_test, y_preds))

In [None]:
sns.set(font_scale=1.5) # Increasing font size

def plot_conf_mat(y_test, y_preds):
    """
    Plots a confusion matrix using Seaborn's heatmap().
    """
    fig, ax = plt.subplots(figsize=(3, 3))
    ax = sns.heatmap(confusion_matrix(y_test, y_preds),
                     annot=True, # Annotating the boxes
                     cbar=False)
    plt.xlabel("true label")
    plt.ylabel("predicted label")
    
plot_conf_mat(y_test, y_preds)

### Classification report



In [None]:
print(classification_report(y_test, y_preds))

In [None]:
# Best hyperparameters
gs_log_reg.best_params_

In [None]:
# Instantiating model with best hyperparameters (found with GridSearchCV)
clf = LogisticRegression(C=0.23357214690901212,
                         solver="liblinear")

In [None]:
# Cross-validated accuracy score
cv_acc = cross_val_score(clf,
                         X,
                         y,
                         cv=5, # 5-fold cross-validation
                         scoring="accuracy") # accuracy as scoring
cv_acc

Since there are 5 metrics here, we'll take the average.

In [None]:
cv_acc = np.mean(cv_acc)
cv_acc

#### Precision

In [None]:
# Cross-validated precision score
cv_precision = np.mean(cross_val_score(clf,
                                       X,
                                       y,
                                       cv=5, # 5-fold cross-validation
                                       scoring="precision")) # precision as scoring
cv_precision

#### Recall

In [None]:
# Cross-validated recall score
cv_recall = np.mean(cross_val_score(clf,
                                    X,
                                    y,
                                    cv=5, # 5-fold cross-validation
                                    scoring="recall")) # recall as scoring
cv_recall

#### F1 Score

In [None]:
# Cross-validated F1 score
cv_f1 = np.mean(cross_val_score(clf,
                                X,
                                y,
                                cv=5, # 5-fold cross-validation
                                scoring="f1")) # f1 as scoring
cv_f1

In [None]:
# Visualizing cross-validated metrics
cv_metrics = pd.DataFrame({"Accuracy": cv_acc,
                            "Precision": cv_precision,
                            "Recall": cv_recall,
                            "F1": cv_f1},
                          index=[0])
cv_metrics.T.plot.bar(title="Cross-Validated Metrics", legend=False)
plt.xticks(rotation=0);

# Thank You :)