# Predicting heart disease using ML

In this notebook we are going to use various ML libraries in an attempt to build a ML model which can predict whether some one has heart disease or not

## Approach followed

1. Problem definition
2. Data 
3. Evaluation
4. Features
5. Modelling
6. Experimenting(will be followed in ever step)

### 1. Problem Defenition

> Given the required clinical data can we predict whether a person has heart disease or not

### 2. DATA

* refer UCI : https://archive.ics.uci.edu/ml/datasets/heart+Disease
* refer Kaggle : https://www.kaggle.com/ronitf/heart-disease-uci

We are going to user 14(Widely used) dataset out of 76(Original) attributes

### 3. Evaluation

> If we can get an accuracy of accuracy >= 95% .Then we can say it as a good model.

### 4. Features

*** Creating Data Dictionary ***

 
#### Attributes (13) : (Independent variables)

* age
* sex
* chest pain type (4 values)
* resting blood pressure
* serum cholestoral in mg/dl
* fasting blood sugar > 120 mg/dl
* resting electrocardiographic results (values 0,1,2)
* maximum heart rate achieved
* exercise induced angina
* oldpeak = ST depression induced by exercise relative to rest
* the slope of the peak exercise ST segment
* number of major vessels (0-3) colored by flourosopy
* thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

#### Final (1) : (Dependent on Attributes)

* target (yes - 1 / no - 0)

## Preparing the tools

* pandas
* numpy
* seaborn
* matplotlib

for data analysis and manipulation

* Regression
* Classification

Scikit-learn models

* Spliting
* Cross validation
* Evaluation method libraries
 
for evaluation 


In [1]:
# Data analysis libraries
import pandas as pd
import numpy as np
import seaborn as sns
# to make the plots appear in the notebook
%matplotlib inline 
import matplotlib.pyplot as plt

# Models from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

# Model Evaluation
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import precision_score, recall_score, f1_score
from sklearn.metrics import plot_roc_curve


## Load the data

In [1]:

heart_disease = pd.read_csv("/kaggle/input/heart-disease-uci/heart.csv")

# Displaying the first 10 data in the dataset
heart_disease.head(10)

### Data Exploration
The goal here is to gain more information from the data

1. What are you trying to solve ?
2. What kind of data we have and how to treat different types ?
3. What is missing from the data and how to deal with it ?
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data ?

In [1]:
# How many class we have
heart_disease["target"].value_counts()

In [1]:
heart_disease["target"].value_counts().plot(kind = "bar", color = ["red","green"]);

In [1]:
# Checking whether our data has any missing values
heart_disease.isna().sum()

we don't have any missing data here 

In [1]:
# Describing our data
heart_disease.describe()

> [NOTE]: The upcoming steps can be performed with an attributes but here we are going to use some columns which gets our eyes

## Heart Disease Frequency VS Sex

### SEX
* 1 - MALE
* 0 - FEMALE

### Target
* 1 - YES
* 0 - NO

In [1]:
heart_disease.sex.value_counts()

In [1]:
pd.crosstab(heart_disease.target,heart_disease.sex)

this based on our present dataset it may be different in the real world

In [1]:
# ploting the cross tab

pd.crosstab(heart_disease.target,heart_disease.sex).plot(kind = "bar",
                                                         figsize = (10,6),
                                                         color=["hotpink","blue"]);
plt.title("Heart Disease Frequency VS Sex")
plt.xlabel("1 - affected    0 - not affected")
plt.ylabel("no.of people")
plt.legend(["Female","Male"]);

### Max heart rate(thalach)  VS  Age for heart disease

In [1]:
heart_disease["thalach"].value_counts()

as the length is 91(91 differnt values) we cannot use bar graph 

In [1]:
# Scatter plot
plt.figure(figsize = (10,6))

# Scatter with positive examples
plt.scatter(heart_disease.age[heart_disease.target == 1],
            heart_disease.thalach[heart_disease.target == 1],
            c="red")

#Scatter with negative examples
plt.scatter(heart_disease.age[heart_disease.target == 0],
            heart_disease.thalach[heart_disease.target == 0],
            c="green")

# Labeling
plt.title("Affected VS Not affected: AGE and MAX HEART RATE")
plt.xlabel("AGE")
plt.ylabel("HEART RATE")
plt.legend(["Affected","Not affected"]);

As the finding the trend by ourself is dificult we will make the ML model to do the work for us

In [1]:
# Check how the data has been spread over
heart_disease.age.plot.hist();

A perfectly distributed data looks like : https://www.simplypsychology.org/normal-distribution.html

Here we are looking for outliers but we dont have any here

## Hear Disease vs Chest pain types
 Chest pain:
* 0: Typical angina - related to heart
* 1: Atypical angina - not related to heart
* 2: Non-anginal - not related to heart
* 3: Asymptomatic - not showing signs of disease

In [1]:
pd.crosstab(heart_disease.cp,heart_disease.target)

In [1]:
# Plot the crosstab

pd.crosstab(heart_disease.cp,heart_disease.target).plot(kind = "bar",
                                                        figsize = (10,6),
                                                        color = ["green","red"])

plt.title("Heart disease frequency for different chest pain")
plt.xlabel("Chest pain type")
plt.ylabel("Frequency")
plt.legend(["No Disease","Disease"]);

## Correlation between the Attributes and the target

In [1]:
# Tabular format
heart_disease.corr()

In [1]:
# Visual format
corr_mat = heart_disease.corr()
fig, ax = plt.subplots(figsize = (15,10))
ax = sns.heatmap(corr_mat,
                 annot=True,
                 linewidths=0.5,
                 fmt=".2f",
                 cmap="YlGnBu")

To know more about correlation : https://www.displayr.com/what-is-a-correlation-matrix/#:~:text=A%20correlation%20matrix%20is%20a,a%20diagnostic%20for%20advanced%20analyses.

## Modeling 
* We have a classification problem

To know more about Supervised data Classification VS Regression:https://www.google.com/search?q=classification+vs+regression&rlz=1C1ONGR_enIN973IN973&oq=classification+vs+&aqs=chrome.0.0i433i512j0i512j69i57j0i512l7.9790j1j15&sourceid=chrome&ie=UTF-8

In [1]:
# Splitting the data

x = heart_disease.drop("target",axis=1)
y = heart_disease["target"]

Visualizing X an Y

In [1]:
x

In [1]:
y

In [1]:
# Spliting the data into train and test dataset

# to reproduce the exact data chosen
np.random.seed(42) 

# Actual splitting
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.2)


why we need to split our data into test and train : https://docs.microsoft.com/en-us/analysis-services/data-mining/training-and-testing-data-sets?view=asallproducts-allversions#:~:text=Separating%20data%20into%20training%20and,of%20evaluating%20data%20mining%20models.&text=Because%20the%20data%20in%20the,the%20model's%20guesses%20are%20correct.

In [1]:
x_train

In [1]:
y_train,len(y_train)

### Train and Test the set by Fitting it into a Model

* LogisticRegression
* KNeighbors
* Ensembler(RandomForestClassifier)

Yes we can use Logistic"Regression" for classification for more details : https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

In [1]:
# Put models in a dictionary
models = {
          "Logistic Regression": LogisticRegression(solver='liblinear'), 
          "KNN": KNeighborsClassifier(),
          "Random Forest": RandomForestClassifier()
          }

# Create function to fit and score models
def fit_and_score(models, x_train, x_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data
    X_test : testing data
    y_train : labels assosciated with training data
    y_test : labels assosciated with test data
    """
    # Random seed for reproducible results
    np.random.seed(42)
    # Make a list to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(x_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(x_test, y_test)
    return model_scores

In [1]:
model_scores = fit_and_score(models = models,
                             x_train = x_train,
                             x_test = x_test,
                             y_train = y_train,
                             y_test = y_test)

model_scores

if we see in the above result Logistic Regression has higher value. 

The above warning suggesting us an way to improve the logistic regression model

## Model comparision

In [1]:
model_compare = pd.DataFrame(model_scores, index = ["accuracy"])
model_compare.T.plot.bar();

## Improving the model

we are going to do the following:

* Hyperparameter tuning
* Feature importance
* Confusion Matrix
* Cross-validation
* Precision (mean absolute error)
* Recall (mean squared error)
* F1 score (root mean squared error)
* Classification report
* ROC curve
* Area Under the curve


### Hyperparameter Tuning

> KNN

In [1]:
# Tuning KNN

train_score = []
test_score = []

# Create a list of different values for n_neighbors

neighbors = range(1,21)

# Setup KNN instance

knn = KNeighborsClassifier()

# Loop through differnt n_neighbors

for i in neighbors:
    knn.set_params(n_neighbors = i)
    
    # fit the algo
    knn.fit(x_train,y_train)
    
    # Update the training score and test scores
    
    train_score.append(knn.score(x_train,y_train))
    
    test_score.append(knn.score(x_test,y_test))

In [1]:
train_score

In [1]:
test_score

In [1]:
plt.plot(neighbors, train_score, label = "Train Score")
plt.plot(neighbors, test_score, label = "Test Score")
plt.xticks(np.arange(1,21,1))
plt.xlabel("Number of neighbors")
plt.ylabel("Model Score")
plt.legend()

print(f"Maximum KNN score on the test data: {max(test_score)*100:.2f}%")

The heighest score is 75.41%
* [note] we can still try different range of n_neighbors int this case i tried some and stil didn't get anything to the mark

Good bye KNN....

## Hyperparameter with RandomizedSearchCV

to know more about it : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html

> Logistic Regression
And
> RandomForest

In [1]:
# Create a hyperparameter grid for logostic regression

log_reg_grid = {"C" : np.logspace(-4, 4, 20),
                "solver" : ["liblinear"]}

# Create a hyperparameter for RandomForest Classifier

rf_grid = {"n_estimators" : np.arange(10, 1000, 50),
           "max_depth" : [None, 3, 5, 10],
           "min_samples_split" : np.arange(2, 20, 2),
           "min_samples_leaf" : np.arange(1, 20, 2)}

In [1]:
# Tune using RandomizedSearchCV

np.random.seed(42)

# LogisticRegression
rs_log_reg = RandomizedSearchCV(LogisticRegression(),
                                param_distributions = log_reg_grid,
                                cv = 5,
                                n_iter = 20,
                                verbose = True)
# Fit the model

rs_log_reg.fit(x_train, y_train)

In [1]:
# Check the best params
rs_log_reg.best_params_

In [1]:
rs_log_reg.score(x_test,y_test)

the model didnt improverd it remained the same, let leave it as it is for now

***RandomForestClassifier***

In [1]:
# Set the random seed
np.random.seed(42)

#RandomForest

rs_rf = RandomizedSearchCV(RandomForestClassifier(),
                           param_distributions = rf_grid,
                           cv = 5,
                           n_iter = 20,
                           verbose = True)

# Fit the model
rs_rf.fit(x_train,y_train)

In [1]:
# Best parameters

rs_rf.best_params_

In [1]:
rs_rf.score(x_test,y_test)

The score has certainly increased by 0.32

But still Logistic Regression holds upperhand here

bye bye RandomForestRegression...

### Improving the Logistic model

Let revisit what we did while improving the model
* By hand - KNN eliminated
* RandomizedSearchCv - RandomForestClassification eliminated
* GridSearchCv - upcoming...

## Tuning Hyperperameter using GridSearchCv
> LogisticRegression


to know more about GridSearchCv : https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [1]:
# Set up hyperparameter
log_reg_grid = {"C" : np.logspace(-4, 4, 20),
                "solver" : ["liblinear"]}

# Set the grid 
gs_log_reg = GridSearchCV(LogisticRegression(),
                          param_grid = log_reg_grid,
                          cv = 5,
                          verbose = True)

# Fit the model
gs_log_reg.fit(x_train,y_train)

In [1]:
# Check the best parameters
gs_log_reg.best_params_

In [1]:
# Evaluate

gs_log_reg.score(x_test,y_test)

Still we did get the same as baseline and RandomizedSearchCV :\

## Evaluating the models beyond score
* ROC curve & AUC score 
* Confusion matrix
* Classification report
    * Precision
    * Recall
    * F1 score

 We have to make prediction inorder to compare the models

In [1]:
y_pred = gs_log_reg.predict(x_test)

y_pred

> ROC curve & AUC score

* Check out this link : https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc#:~:text=An%20ROC%20curve%20

perfect model AUC score = 1

In [1]:
plot_roc_curve(gs_log_reg, x_test, y_test);

AUC score = 0.92

> Confusion matrix

In [1]:
def plot_conf_mat(y_test,y_preds):
    fig, ax = plt.subplots(figsize=(3,3))
    ax = sns.heatmap(confusion_matrix(y_test,y_preds),
                     annot = True,
                     cbar = False)
    
    plt.xlabel("Predicted label")
    plt.ylabel("True label")
    
plot_conf_mat(y_test,y_pred)

> Classification report

* Precision
* Recall
* F1 score

In [1]:
print(classification_report(y_test,y_pred))

These are calculates only with single split

### Using cross validation
> Classification report

* Accuracy
* Precision
* Recall
* F1 score

In [1]:
# Create a new classifier using best params

clf  = LogisticRegression(C=0.23357214690901212,
                          solver = "liblinear")

In [1]:
# Cross-validated Accuracy

cv_acc = cross_val_score(clf,x,y,cv=5,scoring="accuracy")

cv_acc

acc_mean = np.mean(cv_acc)
acc_mean

In [1]:
# Cross-validated Precision

cv_pre = cross_val_score(clf,x,y,cv=5,scoring="precision")

cv_pre

pre_mean = np.mean(cv_pre)
pre_mean

In [1]:
# Cross-validated Recall

cv_re = cross_val_score(clf,x,y,cv=5,scoring="recall")

cv_re

re_mean = np.mean(cv_re)
re_mean

In [1]:
# Cross-validated F1 score

cv_f1 = cross_val_score(clf,x,y,cv=5,scoring="f1")

cv_f1

f1_mean = np.mean(cv_f1)
f1_mean

In [1]:
# Visualizing cross validated matrix

cv_metrics = pd.DataFrame({"Accuracy":acc_mean,
                           "Precision":pre_mean,
                           "Recall":re_mean,
                           "F1":f1_mean},
                            index = [0])
cv_metrics.T.plot.bar(title = "Crovalidated Classification Report",legend = False);

## Important Features

> This is different for different models

for LogisticRegression

In [1]:
clf  = LogisticRegression(C=0.23357214690901212,
                          solver = "liblinear")

# fit the model

clf.fit(x_train,y_train);

In [1]:
# Coefficient : how the attributes contribute for prediction
clf.coef_

In [1]:
# Match the coefficient to the columns
feature_dict = dict(zip(heart_disease.columns,list(clf.coef_[0])))
feature_dict

In [1]:
# Visualize it
feature_df = pd.DataFrame(feature_dict, index = [0])
feature_df.T.plot.bar(title="Feature Importance",legend = False);

We can analyse the above graph and improve our dataset much more

# Conclusion
* We got accuracy of 88.5% which is low in this case
* We cannot practically implement it

> Want we can do?

* Try better model like CatBoost or XGBoost
* Try to collect more data