# Heart disease classification using machine learning

Given the meidcal parameters of a patient, predict whether he/she has a heart disease or not.


### Data

The dataset used for this project is available on Kaggle. https://www.kaggle.com/ronitf/heart-disease-uci
<br>
Information about the dataset is given below and can also be found on the above mentioned link.

<br>

1.age
<br>
2.sex
<br>
3.chest pain type (4 values)
<br>
4.resting blood pressure
<br>
5.serum cholestoral in mg/dl
<br>
6.fasting blood sugar > 120 mg/dl
<br>
7.resting electrocardiographic results (values 0,1,2)
<br>
8.maximum heart rate achieved
<br>
9.exercise induced angina
<br>
10.oldpeak = ST depression induced by exercise relative to rest
<br>
11.the slope of the peak exercise ST segment
<br>
12.number of major vessels (0-3) colored by flourosopy
<br>
13.thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
<br>

### We will be using sklearn pipelines for prediction and for cross validation , RandomizedSearchCV will be used.



## Preparing the tools


In [None]:
# Import all the tools we need

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Models from Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model evaluators
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve

## Loading data 

In [None]:
df = pd.read_csv("../input/heart-disease-uci/heart.csv")
df.shape

In [None]:
df.head() # first 5 elements

In [None]:
# finding the number of positive and negative tests on the dataset
df["target"].value_counts()

In the above cell , "1" refers to patients having a heart disease and "0" refers to the patients not having a heart disease . Let's plot it on a matplotlib plot.

In [None]:
df["target"].value_counts().plot(kind="bar", color=["orange", "purple"]);

In [None]:
df.info()

In [None]:
df.isna().sum()

Our dataset does not have a missing value.

# Testing out different parameters relating to heart disease

### Heart disease versus gender

In [None]:
df.sex.value_counts()

In [None]:
# Comparing target column with sex column
pd.crosstab(df.target, df.sex)

In the above crosstab , sex value "0" indicates female patients and "1" indicates male patients.

In [None]:
pd.crosstab(df.target, df.sex).plot(kind="bar",
                                    figsize=(10, 6),
                                    color=["orange", "purple"])

plt.title("Heart Disease Frequency for Sex")
plt.xlabel("0 = No Disease, 1 = Disease")
plt.ylabel("Number of patients")
plt.legend(["Female", "Male"]);
plt.xticks(rotation=0);

### Age versus max heart rate

In [None]:
# create a matplotlib figure
plt.figure(figsize=(10, 8))

# Scatter plot with positively tested patients
plt.scatter(df.age[df.target==1],
            df.thalach[df.target==1],
            c="orange")

# Scatter with negatively tested patients
plt.scatter(df.age[df.target==0],
            df.thalach[df.target==0],
            c="purple")

plt.title("Age versus max heart rate")
plt.xlabel("Age")
plt.ylabel("Max Heart Rate")
plt.legend(["Disease", "No Disease"]);

### Heart Disease Frequency versus Chest Pain Type


In [None]:
pd.crosstab(df.cp, df.target)

In [None]:

pd.crosstab(df.cp, df.target).plot(kind="bar",
                                   figsize=(10, 6),
                                   color=["orange", "purple"])

# Add some communication
plt.title("Heart Disease Frequency versus Chest Pain Type")
plt.xlabel("Chest Pain Type")
plt.ylabel("Amount")
plt.legend(["No Disease", "Disease"])
plt.xticks(rotation=0);

In [None]:
# correlation matrix
df.corr()

## 5. Modelling 

In [None]:
# Splitting data into X and y
x = df.drop("target", axis=1)

y = df["target"]

In [None]:
x

In [None]:
y

In [None]:
# Split data into train and test sets
np.random.seed(42)

# Split into train & test set
x_train, x_test, y_train, y_test = train_test_split(x,
                                                    y,
                                                    test_size=0.2)

In [None]:
len(x_train)

In [None]:
len(y_train)

We will try 2 different classification models for this , i.e. logistic regressor and random forest classifier.

In [None]:
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
          "Random Forest": RandomForestClassifier()}

# Creating a function to fit and score models
def fit_and_score(models, x_train, x_test, y_train, y_test):
    
    np.random.seed(42)
    
    # Make a dictionary to keep model scores
    model_scores = {}
    
    # Loop through models
    for name, model in models.items():
        
        # Fit the model to the data
        model.fit(x_train, y_train)
        
        # append the evaluated score to model_scores
        model_scores[name] = model.score(x_test, y_test)
    return model_scores

In [None]:
model_scores = fit_and_score(models=models,
                             x_train=x_train,
                             x_test=x_test,
                             y_train=y_train,
                             y_test=y_test)

model_scores

Here we can see that logistic regression has outperformed random forest classification.

### Model Comparison

In [None]:
model_compare = pd.DataFrame(model_scores, index=["accuracy"])
model_compare.T.plot.bar();

## Hyperparameter tuning with RandomizedSearchCV

In [None]:
# grid for hyperparameters of logistic regression
grid_1 = {"C": np.logspace(-4, 4, 20),
                "solver": ["liblinear"]}


grid_2 = {"n_estimators": np.arange(10,100,10),
          "max_depth": [None,3,5,7,10],
          "min_samples_split": np.arange(2,20,2),
          "min_samples_leaf": np.arange(1,20,2),
          "max_features": [0.5,1,"sqrt","auto"],
          "max_samples": [100]}

First we tune the logistic regressor

In [None]:
# Tune LogisticRegression

np.random.seed(42)

# Setup random hyperparameter search for LogisticRegression
model_1 = RandomizedSearchCV(LogisticRegression(),
                                param_distributions=grid_1,
                                cv=5,
                                n_iter=50,
                                verbose=True)

# Fitting random hyperparameter search model for LogisticRegression
model_1.fit(x_train, y_train)

In [None]:
model_1.best_params_  # get the best parameters for the model

In [None]:
model_1.score(x_test, y_test)  # evaluate the model on the test set

Now for random forest classifier

In [None]:
# Setup random seed
np.random.seed(42)

# Setup random hyperparameter search for RandomForestClassifier
model_2 = RandomizedSearchCV(RandomForestClassifier(), 
                           param_distributions=grid_2,
                           cv=5,
                           n_iter=100,
                           verbose=True)

# Fitting random hyperparameter search model for RandomForestClassifier()
model_2.fit(x_train, y_train)

In [None]:
model_2.best_params_

In [None]:
# Evaluate the model on the test set
final_score=model_2.score(x_test, y_test)
final_score

# We were successfully able to increase the score of the random forest classification model , however the score for the logistic regression model remained the same. Further improvement can be done by implementing the GridSearchCV

In [None]:
# predcting the outcome based on input
y_preds = model_2.predict(x_test)
y_preds

In [None]:
y_test

In [None]:
# Plot ROC curve and calculate and calculate AUC metric
plot_roc_curve(model_2, x_test, y_test)

The area under curve for this roc curve is 93 percent

## Now we can calculate  the evaluation metrics like accuracy , precision , f1 score and recall.

In [None]:
# first we reassign the previously built logistic regression and random forest classification models their best parameters
reg_model = LogisticRegression(solver= 'liblinear',
                                C= 0.23357214690901212)

rf_model = RandomForestClassifier(n_estimators = 60,
                                  min_samples_split = 12,
                                  min_samples_leaf = 1,
                                  max_samples = 100,
                                  max_features = 1,
                                  max_depth = None)

In [None]:
# defining a function to get evaluation metrics
def get_eval_metrics(model):
    
    # accuracy
    cv_accuracy = cross_val_score(model,
                         x,
                         y,
                         cv=5,
                         scoring="accuracy")
    cv_accuracy = np.mean(cv_accuracy)
    
    # precision 
    cv_precision = cross_val_score(model,
                         x,
                         y,
                         cv=5,
                         scoring="precision")
    cv_precision=np.mean(cv_precision)
    
    # recall
    cv_recall = cross_val_score(model,
                         x,
                         y,
                         cv=5,
                         scoring="recall")
    cv_recall = np.mean(cv_recall)
    
    # f1 score
    cv_f1 = cross_val_score(model,
                         x,
                         y,
                         cv=5,
                         scoring="f1")
    cv_f1 = np.mean(cv_f1)
    
    return cv_accuracy , cv_precision , cv_recall , cv_f1

In [None]:
get_eval_metrics(reg_model)

In [None]:
get_eval_metrics(rf_model)

### In this project , I have used random forest classifier and logistic regressor for training and testing the dataset. And in the end used the obtained best parameters from randomized search CV to calculate evaluation metrics. 