# Predicting heart disease using machine learning 

This notebook looks into using various `Python-based machine learning` and `data science libraries` in an attempt to build a machine learning model capable of predicting whether or not someone has heart disease based on their medical attributes

We are going to take the following approach
1. Problem definition
2. Data
3. Evaluation
4. Features
5. Modelling
6. Experimentation


## 1. Problem Definition

In a statement,
> Given clinical parameters about a patient, can we predict whether or not they have heart disease?

## 2. Data

The original data if from the Cleveland data from the UCI machine Learning Repository. https://archive.ics.uci.edu/ml/datasets/heart+disease

There is also a version on Kaggle. https://www.kaggle.com/datasets/johnsmith88/heart-disease-dataset

## 3. Evaluation

> If we can reach 95% accuracy at predicting whether or not a patient has heart disease during the proof of concept, we will pursue the project.

## 4. Features

This where you get different information about each of the features in your data

**Create data dictionary**

1. age
>age in years

2. sex
>(1 = male; 0 = female)

3. cp
>chest pain type

4. trestbps
>resting blood pressure (in mm Hg on admission to the hospital)

5. chol
>serum cholestoral in mg/dl

6. fbs
>(fasting blood sugar &gt; 120 mg/dl) (1 = true; 0 = false)

7. restecg
>resting electrocardiographic results

8. thalach
>maximum heart rate achieved

9. exang
>exercise induced angina (1 = yes; 0 = no)

10. oldpeak
>ST depression induced by exercise relative to rest

11. slope
>the slope of the peak exercise ST segment

12. ca
>number of major vessels (0-3) colored by flourosopy

13. thal
>1 = normal; 2 = fixed defect; 3 = reversable defect

14. target
>1 or 0

## Preparing the tools

We are going to use Pandas, Numpy and Matplotlib for data analysis and manipulation

In [None]:
# Import all the tools we need

# Regular EDA and plotting libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import dill
import sys

import warnings

# Suppress FutureWarnings
warnings.simplefilter(action='ignore', category=FutureWarning)


# So that our plot will appear inside the notebook
%matplotlib inline 



# Models from scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
import lightgbm as lgb
from xgboost import XGBClassifier

#Model Evaluations
from sklearn.model_selection import train_test_split, GridSearchCV,cross_val_score,RandomizedSearchCV
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.metrics import precision_score, recall_score, f1_score,plot_roc_curve,RocCurveDisplay




# Load data

In [None]:
df =pd.read_csv("D:\\OneDrive\\Documents\\Personal Project Portfolio\\heart-diesease-project_Classification\\app\\data\\heart-disease.csv")
df.shape # (Rows, Columns)

In [None]:
df.head()

# Data Exploratory Analysis (EDA)

The goal here is to find our more about the data and become a subject matter expert on the dataset you are working with.

1. What questions are you trying to solve
2. What kind of data do you have and how do we treat different types?
3. What is missing from the data and how do you deal with it
4. Where are the outliers and why should you care about them?
5. How can you add, change or remove features to get more out of your data?


In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.target.value_counts() # method 1 of referencing columns

In [None]:
df['target'].value_counts() # method 2 of referencing columns

In [None]:
df.target.value_counts().plot(kind='bar',color=['black','gray'],
                              ylabel='counts', xlabel='target', title='Target Value Count');

In [None]:
df.info()

In [None]:
df.isna().sum()

In [None]:
df.describe()

### Heart Disease Frequency according to Sex

In [None]:
df.sex.value_counts()

In [None]:
df.sex.value_counts().plot(kind='bar',color=['black','gray']
                           )
plt.legend(['1:Male', '0:Female']);



In [None]:
# compare target column with the sex column
pd.crosstab(df.target,df.sex)

In [None]:
#create a plot of the crosstab
pd.crosstab(df.target,df.sex).plot(kind='bar',
                                  figsize=(10,6),
                                  color=['black','gray']);

plt.title('Heart Disease Frequency for Sex')
plt.xlabel('0= No Disease, 1 = Disease')
plt.ylabel ('Number of People')
plt.legend(['Female','Male']);
plt.xticks(rotation=0);

In [None]:
df.thalach.value_counts()

#### Age Vs Max Heart Rate for Heart Disease

In [None]:
# Create another figure
plt.figure(figsize=(10,6))

# scatter with positive example
plt.scatter(df.age[df.target==1],
            df.thalach[df.target==1],
            color=['Salmon'])

# scatter with negative example
plt.scatter(df.age[df.target==0],
            df.thalach[df.target==0],
            color='black');

#Add some helpful info

plt.title('Heart Disease in function of Age and Max Heart Disease')
plt.xlabel('Age')
plt.ylabel('Max Heart Rate')
plt.legend(['disease','No Disease']);

In [None]:
# Check the distribution of the Age column (Spread of the data)
df.age.plot.hist(color='gray');

## Compare the Chest Pain Type to Target
Heart disease frequency per chest pain type

cp: chest pain type
* Value 1: typical angina
* Value 2: atypical angina
* Value 3: non-anginal pain
* Value 4: asymptomatic

In [None]:
pd.crosstab(df.cp,df.target)

In [None]:
## make the crosstab more visual

pd.crosstab(df.cp,df.target).plot(kind='bar',color=['black','gray'],
                                 figsize=(10,6))

# Add some communication
plt.title('Heart Diseease and Chest Pain Type')
plt.xlabel('Type of Chest pain :0= typical angina, 1= atypical angina, 2= non-anginal pain, 3 = asymptomatic')
plt.ylabel('Number of people')
plt.legend(['No Disease','Disease']);
plt.xticks(rotation=0);

In [None]:
df.head(2)

In [None]:
fig, axs = plt.subplots(3,6, figsize=(17,17),sharey=True)

sns.histplot(df, x='age',kde=True, ax=axs[0,0] )
sns.histplot(df, x='sex',kde=True, ax=axs[0,1] )
sns.histplot(df, x='cp',kde=True, ax=axs[0,2] )
sns.histplot(df, x='trestbps',kde=True, ax=axs[0,3] )
sns.histplot(df, x='chol',kde=True, ax=axs[0,4] )
sns.histplot(df, x='fbs',kde=True, ax=axs[0,5] )
sns.histplot(df, x='restecg',kde=True, ax=axs[1,0] )
sns.histplot(df, x='thalach',kde=True, ax=axs[1,1] )
sns.histplot(df, x='exang',kde=True, ax=axs[1,2] )
sns.histplot(df, x='oldpeak',kde=True, ax=axs[1,3] )
sns.histplot(df, x='slope',kde=True, ax=axs[1,4] )
sns.histplot(df, x='ca',kde=True, ax=axs[1,5] )
sns.histplot(df, x='thal',kde=True, ax=axs[2,0] )
sns.histplot(df, x='target',kde=True, ax=axs[2,1] );

plt.tight_layout()

In [None]:
# Make a correlation matrix of all the features
df.corr()

In [None]:
# Let's make our correlation matrix a little better
corr_matrix=df.corr()
fig,ax =plt.subplots(figsize=(15,10))

ax =sns.heatmap(corr_matrix,
               annot=True,
               linewidth=0.5,
               fmt='.2f',
               cmap='YlGnBu')

## 5. Modelling

In [None]:
df.head()

In [None]:
# Split the data into X and y
X= df.drop('target',axis=1)
X

In [None]:
y = df.target
y

In [None]:
# Split the data into train and test sets
np.random.seed(42)

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)
X_train.shape,y_train.shape,X_test.shape,y_test.shape

Now that we have the data into training and test sets, it's time to build a machine learning model

We will train it (find the patterns) on the training set
And we will test it (use the patterns) on the test set

We are going to try out 3 different machine learning models
1. Logistic Regression
2. K-Nearest Neighbours Classifier
3. Random Forest Classifier

In [None]:
# Put models in a dictionary
models = {'Logistic Regression': LogisticRegression(),
          'KNN':KNeighborsClassifier(),
          'Randon Forest':RandomForestClassifier(),
          'SVM': SVC(),
          'XGB': XGBClassifier(), 
          'LightGBM': lgb.LGBMClassifier()}

# Create a function to fit and score models
def fit_and_score (models,X_Train,X_test,y_train,y_test):
    '''
    Fits and evaluates given machine learning models
    models: a dict of differeny Scikit-Learn machine learning models
    X_train: training data (no labels)
    X_test: test data (no labels)
    y_train: training labels
    y_test: test labels

    '''
    # set random seed
    np.random.seed(42)

    #make a dictionary to keep model scores
    model_scores={}

    # Loop through the models'
    for name,model in models.items():
        #Fit the model to the data
        model.fit(X_train,y_train)
        #Evaluate the model and append the score to model_score
        model_scores[name]=model.score(X_test,y_test)
    return model_scores




In [None]:
model_scores = fit_and_score(models=models,
                             X_Train=X_train,
                            X_test=X_test,
                            y_train=y_train,
                            y_test=y_test)

model_scores

## Model Comparison

In [None]:
model_compare=pd.DataFrame(model_scores,index=['accuracy'])
model_compare


In [None]:
model_compare.T.plot(kind='bar');

Now we have got a baseline model and know a model's first predictions aren't always what we should base our next steps off.
What should i do?

let's look at the following:
* Hyperparameter tuning (*all types of models*)
* Feature importance (*all types of models*)
* Confusion matrix (*classification models only*)
* Cross-validation (*classification models only*???)
* Precision (*classification models only*)
* Recall (*classification models only*)
* F1 Score (*classification models only*)
* Classification report (*classification models only*)
* ROC curve (*classification models only*)
* Area under the curve (AUC) (*classification models only*)

## Hyperparameter tuning (by hand)



In [None]:
# Let's tune KNN
train_scores =[]
test_scores =[]

# Create a list if different values for n_neighbors
neighbors = range(1,21)

#Setup KNN instance 
knn = KNeighborsClassifier()

# loop through different n_neighbors
for i in neighbors:
    knn.set_params(n_neighbors=i)
    
    #fit the algorithm
    knn.fit(X_train,y_train)
    
    #update the training scores list
    train_scores.append(knn.score(X_train,y_train))
    
    #update the test scores list
    test_scores.append(knn.score(X_test,y_test))
    



In [None]:
train_scores

In [None]:
test_scores

In [None]:
plt.plot(neighbors,train_scores,label='train score')
plt.plot(neighbors,test_scores,label='test score');


#Add some communications
plt.title("KNN n-neighbors tuning")
plt.xlabel('Number of n-neighbors')
plt.xticks(np.arange(1,21,1))
plt.ylabel('score')
plt.legend(['train score','test score'])

print (f'Maximum KNN score on the test data: {max(test_scores)*100:.2f}%');

## Hyperparameter tuning (by RandomizedSearchCV )

We are going to tune:
* LogisticRegression()
* RandomForestClassifier()
.......RandomizedSearchCv

In [None]:
# Create a hyperparameter grid for LogisticRegression()
log_reg_grid = {"C":np.logspace(-4,4,20),
               'solver':['liblinear']}

# Create a hyperparameter grid for RandomForestClassifier()
rf_grid= {'n_estimators':np.arange(10,1000,50),
         'max_depth':[None,3,5,10],
         'min_samples_split':np.arange(2,20,2),
         'min_samples_leaf':np.arange(1,20,2)}

Now we have hyperparameter grids setup for each of our models, lets tune them using RandomizedSearchCV... 

In [None]:
# Tune LogisticRegression

np.random.seed(42)

# setup random hyperparameter search for LogisticRegression
rs_log_reg =RandomizedSearchCV(LogisticRegression(),
                              param_distributions=log_reg_grid,
                              cv=5,
                              n_iter=20,
                              verbose=True)

# Fit random hyperparameter search model for LogisticRegression
rs_log_reg.fit(X_train,y_train)


In [None]:
rs_log_reg.best_params_

In [None]:
rs_log_reg.score(X_test,y_test)

Now we have tuned LogisticRegression(), let's do the same for RandomForestClassifier()

In [None]:
# Setup random seed
np.random.seed(42)

# setup random hyperparameter search for RandomForestClassifier
rf =RandomizedSearchCV(RandomForestClassifier(),
                              param_distributions=rf_grid,
                              cv=5,
                              n_iter=20,
                              verbose=True)

# Fit random hyperparameter search model for RandomForestClassifier
rf.fit(X_train,y_train)


In [None]:
# Find the best hyperparameters
rf.best_params_

In [None]:
# Evaluate the randomized search RandomForestClassifier model
rf.score(X_test,y_test)

## Tuning hyperparameter (Use the GridSearchCV for the LogisticRegression() model)

## Hyperparameter Tuning with GridSearchCV

Since our logisticRegression model provides the best score so far, we'll try and improve them again using GridSearchCV...

In [None]:
# Different parameters for our LogisticRegression model
log_reg_grid= {'C': np.logspace(-4,4,100),
              'solver':['liblinear']}

#Setup grid hyperparameter search for LogisticRegression
gs_log_reg =GridSearchCV(LogisticRegression(),
                        param_grid=log_reg_grid,
                        cv=5,
                        verbose=True)

#Fit grid hyperparameter search model
gs_log_reg.fit(X_train,y_train)

In [None]:
gs_log_reg.best_params_

In [None]:
# Evaluate the grid search LogisticRegression() model
gs_log_reg.score(X_test,y_test)

## Evaluating our tuned machine learning classifier, beyond accuracy

* ROC curve and AUC
* Confusion matrix
* Classification report
* Precision
* Recall
* F1-score

...and it would be great if cross-validation was used where possible.

To make comparisons and evaluate our trained model, first we need to make predictions




In [None]:
# Make predictions with tuned model
y_preds =gs_log_reg.predict(X_test)

In [None]:
y_preds

In [None]:
#Plot the ROC curve and calculate the AUC metric
plot_roc_curve(gs_log_reg,X_test,y_test);

In [None]:
# Use the latest method for ROC Curve
#import the required function
from sklearn.metrics import RocCurveDisplay


RocCurveDisplay.from_estimator(gs_log_reg, X_test, y_test);


In [None]:
RocCurveDisplay.from_predictions(y_test,y_preds);

In [None]:
# Confusion matrix
print (confusion_matrix(y_test,y_preds))

In [None]:
#Visualise using seaborn
sns.set(font_scale=1.5)

def plot_conf_mat(y_test,y_preds):
    '''
    Plot a nice looking confusion matrix using seaborn's heatmap()
    '''
    fig,ax =plt.subplots(figsize=(3,3))
    ax =sns.heatmap(confusion_matrix(y_test,y_preds),
                   annot=True,
                   cbar=False)
    plt.xlabel('True Label')
    plt.ylabel('Predicted Label')
    
plot_conf_mat(y_test,y_preds)

Now we've got the ROC curve, an AUC metric, a confusion matrix, let's get a classification report as well as cross-validated precision, recall and f1-score

In [None]:
print(classification_report(y_test,y_preds))

### Calculate evaluation metrics (precision ,recall and f1-score) using cross-validation

We are going to calculate precision, recall and f1-score of our model using cross-validation and to do so we will be using cross_val_score().


In [None]:
#check the best hyperparameters 
gs_log_reg.best_params_

In [None]:
# Create a new classifier with best parameters
clf =LogisticRegression(C=0.20565123083486536,solver ='liblinear')



In [None]:
#cross-validated accuracy
cv_acc=cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring ='accuracy')
cv_acc

In [None]:
cv_acc=np.mean(cv_acc)
cv_acc

In [None]:
#cross-validated precision
cv_precision=cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring ='precision')
cv_precision

In [None]:
cv_precision=np.mean(cv_precision)
cv_precision

In [None]:
#cross-validated recall
cv_recall =cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring ='recall')
cv_recall=np.mean(cv_recall)
cv_recall


In [None]:
#cross-validated f1-score
cv_f1 =cross_val_score(clf,
                       X,
                       y,
                       cv=5,
                       scoring ='f1')
cv_f1=np.mean(cv_f1)
cv_f1

In [None]:
# Visualise the cross-validated metrics
cv_metrics =pd.DataFrame({'Accuracy':cv_acc,
                        'Precision':cv_precision,
                        'Recall':cv_recall,
                        'F1':cv_f1},index =[0])
cv_metrics

In [None]:
cv_metrics.T.plot(kind='bar',color='black',title='Cross-validated classification metrics',legend=False);

### Feature Importance

Feature importance is another way of asking, "which features contributed most to the outcomes of the model and how did they contribute?"

Finding feature importance is different for each machine learning model.One way to find feature importance is to search for ('MODEL NAME') feature importance

Let's find the feature importance for our LogisticRegression model


In [None]:
# Fit an instance of LogisticRegression

clf= LogisticRegression(C=0.20565123083486536,solver ='liblinear')

clf.fit(X_train,y_train)

In [None]:
# Check coef_
clf.coef_

In [None]:
# Match coef's of features to columns
feature_dict=dict(zip(df.columns,list(clf.coef_[0])))
feature_dict

In [None]:
# Visualise feature importances
feature_df =pd.DataFrame(feature_dict,index=[0])
feature_df.T.plot.barh(title='Feature Imporatnce',legend=False);

In [None]:
pd.crosstab(df['sex'],df['target'])

In [None]:
pd.crosstab(df['slope'],df['target'])

slope-the slope of the peak exercise ST segment
* 0: Upsloping: better heart rate with exercise (uncommon)
* 1: Flatsloping: minimal change (typical healthy heart)
* 2: Downsloping signs of unhealthy heart

## 6. Experimentation

If you haven't hit your evaluation metric target yet......ask yourself...

* Could you collect more data?
* Could you try a better model? Like CatBoost or XGBoost?
* Could you improve the current models ? (beyond what is done so far)

If model is good enough (you have hit your evaluation metric) how would you export it and share it with others?

In [None]:
# Lets try XGBoostm lightGBM and SVM clasifier
# !pip install xgboost
# !pip install lightgbm
# !pip install SVC

In [None]:
# Add helper function with hyperparameter tuning with GridSearch CV
def evaluate_models(X_train, y_train, X_test, y_test, models, param):
    try:
        """
        This method fits and score the models provided while doing a gridsearch cross
        validation using the parameter grid provided
        
        input: X-train - Training data input features
             y_train - Training data label 
             X_test - Test data input features
             y_test - Test data labels
             models - ML model to experiment with
             param :dict - parameter settings to try as values.
             
        Returns: a dictionary of the a key values pair of model and score
        """ 
                
        report = {}

        for i in range(len(list(models))):
            model = list(models.values())[i]
            para=param[list(models.keys())[i]]

            gs = GridSearchCV(model,para,cv=3, verbose=3)
            gs.fit(X_train,y_train)

            model.set_params(**gs.best_params_)
            model.fit(X_train,y_train)

            test_model_score = model.score(X_test, y_test)

            report[list(models.keys())[i]] = test_model_score
        return report
    except Exception as e:
        raise e

In [None]:
def load_object(file_path):
    try:
        with open(file_path, "rb") as file_obj:
            return dill.load(file_obj)

    except Exception as e:
        raise CustomException(e, sys)
        
        
def save_object(file_path, obj):
    with open(file_path, 'wb') as file_obj:
        dill.dump(obj, file_obj)

In [None]:
# update the fit_score_best_model helper function

def fit_and_score_best_model(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates models passed to it.
    
    Parameters:
    -----------
    models : Dictionary of machine learning models
    X_train : training data (without labels)
    X_test : test data (without labels)
    y_train : training labels
    y_test : test labels
    
    Returns:
    -----------
    The best performing model out of the model passed to it
    
    """
    
    # Set random seed
    np.random.seed(42)

    # Create the parameter grid
    params = {
    'Logistic Regression': {
        'penalty': ['l2'],  # Removing 'None', 'l1', 'elasticnet'
        'dual': [False],    # Keeping only False
        'solver': ['lbfgs', 'liblinear']  # Most common solvers
    },
    'SVM': {
        'C': [0.1, 1, 10],  # Reduced range
        'kernel': ['linear', 'rbf'],  # Most common kernels
        'degree': [3],  # Commonly used degree for polynomial kernel
        'gamma': ['scale']  # Default value
    },
    'Random Forest': {
        'criterion': ['gini'],  # Keeping only 'gini'
        'max_features': ['sqrt'],  # Common practice
        'n_estimators': [32, 128],  # Reduced range
        'max_leaf_nodes': [None],  # Default value
        'bootstrap': [True],  # Common practice
        'n_jobs': [1],  # Using single core to avoid memory issues
        'max_samples': [None],  # Default value
        'min_samples_split': [2, 10],  # Reduced range
        'min_samples_leaf': [1, 5]  # Reduced range
    },
    'XGB': {
        'learning_rate': [0.01, 0.1],  # Reduced range
        'max_depth': [6, 8],  # Commonly used depths
        'gamma': [2, 9],  # Reduced range
        'sampling_method': ['uniform'],  # Default value
        'grow_policy': ['depthwise'],  # Common practice
        'n_estimators': [32, 128]  # Reduced range
    },
    'LightGBM': {
        'boosting_type': ['gbdt'],  # Common practice
        'max_depth': [-1, 2],  # Reduced range
        'learning_rate': [0.01, 0.1],  # Reduced range
        'n_estimators': [100, 200],  # Reduced range
        'num_leaves': [31, 50]  # Reduced range
    }
}

               
    # Evaluate the model and append its score to model_report
    model_report:dict=evaluate_models(X_train=X_train, y_train=y_train, X_test=X_test, y_test=y_test,
                                                models=models, param=params)
        
    # To get best model score from dict
    best_model_score = max(sorted(model_report.values()))
        
    # To get best model name from dict
    best_model_name = list(model_report.keys())[
                list(model_report.values()).index(best_model_score)
            ]
    best_model = models[best_model_name]

    if best_model_score < 0.6:
        print("No best model found")

    save_object(
                file_path="D:\\OneDrive\\Documents\\Personal Project Portfolio\\"
                "heart-diesease-project_Classification\\app\\artifacts\\best_model\\model.pkl",
                obj=best_model
            )

    return print(f'Best Model: {best_model}, Score: {best_model_score}')

In [None]:
# put models in a dictionary
models = {'Logistic Regression':LogisticRegression(),'SVM': SVC(),'Random Forest': RandomForestClassifier(),
           'XGB': XGBClassifier(), 'LightGBM': lgb.LGBMClassifier() }

best_model = fit_and_score_best_model(models, X_train=X_train, X_test=X_test,
                                      y_train=y_train, y_test=y_test)

best_model

## Add Data Transformation Step
We would standadise the features since few of the features were skewed (e.g oldpeak)

In [None]:
# Create the transformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

imputer = SimpleImputer()
preprocessor= Pipeline(steps=[
                    ('scaler', StandardScaler()),
                    ('imputer', imputer)
                    ]
                    )

X_train = preprocessor.fit_transform(X_train)

# save the transformer

save_object('artifacts/preprocessor.pkl', preprocessor)

In [None]:
# Transfrom the test data
X_test = preprocessor.transform(X_test)

X_train.shape, X_test.shape