<a href="https://colab.research.google.com/github/vijay313v/c/blob/main/Copy_of_Cardiovascular_health_risk.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project title: Cardiovascular risk prediction**

# **Problem Description:**
Cardiovascular diseases (CVDs) are the leading cause of death globally, taking an estimated 18.6 million lives each year, which accounts for 33% of all the global deaths. CVDs are a group of disorders of the heart and blood vessels and include coronary heart disease, cerebrovascular disease, rheumatic heart disease and other conditions. More than four out of five CVD deaths are due to heart attacks and strokes, and one third of these deaths occur prematurely in people under 70 years of age.

It is important to detect cardiovascular disease as early as possible so that management with counselling and medicines can begin.

Our main aim here is to predict whether the patient has a 10-year risk of future coronary heart disease (CHD).

## **Data description:**
The problem and aim stated above can be solved with the help of machine learning models and the data that we have. The dataset is from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The dataset provides the patients’ information. It includes over 4,000 records and 15 attributes.

## **Defining the columns:**

### **Demographic:**
• Sex: male or female("M" or "F")

• Age: Age of the patient;(Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

### **Behavioral:**
• is_smoking: whether or not the patient is a current smoker ("YES" or "NO")

• Cigs Per Day: the number of cigarettes that the person smoked on average in one day.(can be considered continuous as one can have any number of cigarettes, even half a cigarette.)

### **Medical( history):**
• BP Meds: whether or not the patient was on blood pressure medication (Nominal)

• Prevalent Stroke: whether or not the patient had previously had a stroke (Nominal)

• Prevalent Hyp: whether or not the patient was hypertensive (Nominal)

• Diabetes: whether or not the patient had diabetes (Nominal)

### **Medical(current):**
• Tot Chol: total cholesterol level (Continuous)

• Sys BP: systolic blood pressure (Continuous)

• Dia BP: diastolic blood pressure (Continuous)

• BMI: Body Mass Index (Continuous)

• Heart Rate: heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

• Glucose: glucose level (Continuous)

### **Predict variable (desired target):**
• 10-year risk of coronary heart disease CHD(binary: “1”, means “Yes”, “0” means “No”)

In [None]:
#Importing libraries
import pandas as pd
import numpy as np
import math
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix,classification_report,roc_auc_score

import warnings
warnings.filterwarnings("ignore")

In [None]:
#Mounting the drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
#loading dataset
cv_df =pd.read_csv(r"/content/data_cardiovascular_risk (1).csv")

In [None]:
# shape of dataset
cv_df.shape

**3390 observation and 17 columns**

In [None]:
#summary of datset
cv_df.info()

In [None]:
#statistical description of dataframe
cv_df.describe()

In [None]:
#Checking the distribution of the target variable
cv_df['TenYearCHD'].value_counts()

# **EDA:**

### **Which Age people has high chances of positive CHD.**

In [None]:
#Analysis on Age column
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['age'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' Which Age more prone to CHD')
plt.legend(['No Risk', 'At Risk'])

plt.show()

**Age between 47 to 65 has high chances of positive CHD.**

## **Education:**

In [None]:
#Analysis on Education column
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['education'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' Which Age more prone to CHD')
plt.legend(['No Risk', 'At Risk'])

plt.show()

**As we can see that, most of the "At Risk" cases are in the 1st level of education and least in the 4th level. This is a pretty misleading result because the number of "At Risk" cases here seems to be affected by the total number of people in that category. Therefore we can infer that this is not a good comparitive point.**

##**Which sex is most likely to suffer from positive CHD.**

In [None]:
# Analysis of sex column
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['sex'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title('Gender more prone to CHD')
plt.legend(['No Risk', 'At Risk'])

# Calculate and add percentage text to the plot
total = len(cv_df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height + 5, f'{height/total*100:.2f}%', ha='center')

plt.show()

**Being a Male has high chances of CHD compare to Female.**

## **is_smoking**

In [None]:
# Analysis on is_smoking
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['is_smoking'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' ')
plt.legend(['No Risk', 'At Risk'])

# Calculate and add percentage text to the plot
total = len(cv_df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height + 5, f'{height/total*100:.2f}%', ha='center')

plt.show()

**Smoking increases the chances of positive CHD.**

## **Does people taking BPMeds effect on CHD**

In [None]:
# Analysis on BPMeds
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['BPMeds'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' ')
plt.legend(['No Risk', 'At Risk'])

# Calculate and add percentage text to the plot
total = len(cv_df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height + 5, f'{height/total*100:.2f}%', ha='center')

plt.show()

**erson taking BP medicines has high chances of CHD.**

## **How does prevalent Stroke effect in positive CHD factor.**

In [None]:
#Analysis on prevalent stroke column
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['prevalentStroke'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' IS prevalent Stroke effect in CHD')
plt.legend(['No Risk', 'At Risk'])

# Calculate and add percentage text to the plot
total = len(cv_df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height + 5, f'{height/total*100:.2f}%', ha='center')
plt.show()

**Prevalent stroke increases chances of CHD in future.**

## **Does Prevalent Hypertension has effect on positive CHD in future**

In [None]:
#Analysis on  prevalent Hpypertension column
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['prevalentHyp'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' Effect of prevalent Hpypertension on CHD')
plt.legend(['No Risk', 'At Risk'])

# Calculate and add percentage text to the plot
total = len(cv_df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height + 5, f'{height/total*100:.2f}%', ha='center')

plt.show()

**prevalent Hypertension increases chances of CHD in patients**

## **Does Diabetes affect the chances of having a positive CHD risk factor**

In [None]:
#Analysis of Diabetes column
plt.figure(figsize=(8, 6))

# Create a custom palette
my_palette = {0: 'blue', 1: 'orange'}

# Create the countplot
ax = sns.countplot(x=cv_df['diabetes'], hue=cv_df['TenYearCHD'], palette=my_palette)

plt.title(' Effect of Diabetes on CHD')
plt.legend(['No Risk', 'At Risk'])

# Calculate and add percentage text to the plot
total = len(cv_df)
for p in ax.patches:
    height = p.get_height()
    ax.text(p.get_x() + p.get_width() / 2, height + 5, f'{height/total*100:.2f}%', ha='center')

plt.show()

**Diabetes increases chances of positive Coronary Heart Disease**

## **Others**

In [None]:
sns.boxplot(data=cv_df[['totChol','sysBP','diaBP','BMI','heartRate','glucose']])
plt.title('Box Plot  of Continuous Variables')

plt.show()

**from above using the boxplot we get the idea of range of different columns and outliers**

In [None]:

sns.displot(data=cv_df, x='totChol', hue='TenYearCHD', kind='kde', fill=True, height=5, aspect=2)
plt.axvline(200, 0,1,color='yellow')
plt.axvline(240, 0,1,color='red')
plt.xlabel('total Cholestrol, (mg/dL)')
plt.show()

**Total cholesterol includes low-density lipoprotein (bad) cholesterol and high-density lipoprotein (good) cholesterol.**

**Less than 200 mg/dL is desirable level, 200 - 239 mg/dL is borderline high level and 240 mg/dL and above comes in the category of high level.**

**In our dataset most of the people are either in the borderline area(between yellow and red vertical line) or in the high level(beyond red line).**

**People who are at risk of CHD have total cholesterol ranging in between less than 100 mg/dL to 700 mg/dL almost.**

In [None]:

sns.displot(data=cv_df, x='glucose', hue='TenYearCHD', kind='kde', fill=True, height=5, aspect=2)
plt.axvline(70, color='black')
plt.axvline(140, color='black')
plt.xlabel('glucose (mg/dL)')
plt.show()

**The normal glucose ranges from 70 mg/dL to 140 mg/dL given at what time of the day it was done whether it was before meal, after meal, during fasting or before bedtime.**

**It is clear from the distribution that there has been cases with glucose level as low as 20 mg/dL to 25 mg/dL and as high as 400 mg/dL to 440 mg/dL.**

**Also we can see that the glucose level is touching the high and the low value for both the cases whether the risk of coronary heart disease is present or not.**

In [None]:
# Create the displot with the modified hue
sns.displot(data=cv_df, x='heartRate', hue='TenYearCHD', kind='kde', fill=True, height=5, aspect=2)
plt.axvline(60, 0,1,color='black')
plt.axvline(100, 0,1,color='black')
plt.xlabel('heart rate (bpm)')
plt.show()

**Resting healthy heart rate for a normal human body is between 60 bpm to 100 bpm but in our dataset it ranges between 38 bpm to 155 bpm.**

**In patients with known coronary heart disease, elevated heart rate reduces diastolic filling time and increases cardiac workload, resulting in supply demand mismatch with consequent ischemia(condition in which the blood flow (and thus oxygen) is restricted or reduced in a part of the body) and angina(chest pain caused by reduced blood flow to the heart).**

**Surprisingly in our dataset no conclusion can be made to distinguish between the people who are at risk of CHD or not at risk as for both categories of people the heart rate varies similarly.**

In [None]:

sns.set_style("darkgrid")
sns.relplot(data=cv_df, x="BMI", y="age",hue="TenYearCHD",col='TenYearCHD',style='TenYearCHD',kind="scatter",palette='RdBu')
plt.show()

**BMI in our dataset ranges in between 15 to almost 60.**

**People with BMI in the range 18.5 to 24.9 are considered healthy, 25.0 to 29.9 as overweight and after 30.0 are classified as obese.**

**People with risk of coronary heart disease are spread quite evenly.**

**So there must be other factors other than BMI that are contributing to the potential risk of coronary heart disease.We have cases where people are in the category of obese but still not at risk of CHD and a lot of people in the category of healthy but still at the risk of CHD.**

In [None]:
sns.displot(data=cv_df, x='sysBP', hue='TenYearCHD', kind='kde', fill=True, height=5, aspect=2)
plt.axvline(120, 0,1,color='black')
plt.xlabel('systolic BP (mmHg)')
plt.show()

In [None]:
sns.displot(data=cv_df, x='diaBP', hue='TenYearCHD', kind='kde', fill=True, height=5, aspect=2)
plt.axvline(80, 0,1,color='black')
plt.xlabel('diabolic BP (mmHg)')
plt.show()

**When it comes to systolic and diastolic blood pressure measurements.**

**SysBP less than 120 is considered normal and DiaBP less than 80 is considered normal.**

**In Systolic BP we have cases where the readings are almost 380 mmHg which is very high number. It is also visible that people that are more at risk of CHD have readings reaching upto 380 mmHg.**

**Although there is no evidence to assume whether BP readings(systolic and diastolic)are contributing to the risk of CHD or not.**

## **Data Cleaning :**

In [None]:

#Checking for null values in our dataset
cv_df.isnull().sum()

I'll be using a mixed approach of imputing null values with some meaningfull value and deleting the observations with null values.

Since the glucose column has a lot of null values, I'll impute them with the mean glucose value. After this, the number of null values present will be of a very small order when compared to the size of the dataset, therefore I'll just delete them.

In [None]:
# List of columns with missing values that have to fill with mean
columns_to_fill = [ 'glucose']

# Iterate through the columns and fill NaN values with the mean of each column
for col in columns_to_fill:
    mean_value = cv_df[col].mean()
    cv_df[col].fillna(mean_value, inplace=True)


In [None]:

#Deleting the rest of the null values
cv_df.dropna(inplace=True)

In [None]:
#Checking if the null values are dropped properly
cv_df.isnull().sum()

## **Handling duplicate values:**

In [None]:

#Checking for duplicate values
cv_df.duplicated().sum()

**Dataset doesnt contain any duplicate values.**

# **Outliers:**

## **Feature Engineering:**

### **Feature Encoding :**
Machine Learning model work with numerical values therefore categorical columns have to converted/encoded into numerical variables.This process is known as Feature Encoding

Here we have two columns that require encoding and they are "sex" and "is_smoking".

In [None]:
#Encoding the categorical columns
cv_df['sex'] = cv_df['sex'].apply(lambda x: 1 if x == 'M' else 0)
cv_df['is_smoking'] = cv_df['is_smoking'].apply(lambda x: 1 if x == 'YES' else 0)

### **Grouping columns for better Understanding :**

In [None]:
def smoke_pattern (cigperday:float):
  """A function that returns the Smoking level
     by taking cigarettes per day as an input."""

  if cigperday==0:                    #Non smoker
    return 1
  elif cigperday>0 and cigperday<=10:       #Smoker with more than 0 and less than 10 cigs per day
    return 2
  elif cigperday>10 and cigperday<=20:      #Smoker with more than 10 and less than 20 cigs per day
    return 3
  elif cigperday>20 and cigperday<=30:      #Smoker with more than 20 and less than 30 cigs per day
    return 4
  elif cigperday>30 and cigperday<=40:      #Smoker with more than 30 and less than 40 cigs per day
    return 5
  else:                         #Smoker with more than 40 cigs per day
    return 6

In [None]:
#Creating the Smokepattern column
cv_df['smoke_pattern'] = cv_df['cigsPerDay'].apply(lambda x: smoke_pattern(x))

In [None]:

#Removing columns upon whom grouping has been done
cv_df.drop(columns={'is_smoking','cigsPerDay'},axis=1,inplace=True)

In [None]:
cv_df.head()

Smoke pattern column created with correct values.

## **BPLevel:**

Next, I will combine the "sysBP" and "diaBP" columns to create a new column called the "BPLevel".

In [None]:
# Defining a function to assign blood pressure levels
def bp_level(row):
    if row['sysBP'] < 120 or row['diaBP'] < 80:
        return 1  # Normal level
    elif (120 <= row['sysBP'] < 130) or row['diaBP'] < 80:
        return 2  # Elevated level
    elif (130 <= row['sysBP'] < 140) or (80 <= row['diaBP'] < 90):
        return 3  # High BP stage 1
    elif (140 <= row['sysBP'] < 180) or (90 <= row['diaBP'] < 120):
        return 4  # High BP stage 2
    else:
        return 5  # Hypertensive crisis

# Create the 'BPLevel' column using the function
cv_df['BPLevel'] = cv_df.apply(bp_level, axis=1)

# Remove the 'sysBP' and 'diaBP' columns
cv_df.drop(columns=['sysBP', 'diaBP'], inplace=True)

# Checking if the 'BPLevel' column is created properly
cv_df.head()

## **DiabetesLevel:**

In [None]:
# Define a function to assign diabetes levels
def diabetes_level(glucose):
    if glucose < 53:
        return 1  # Severe Hypoglycemia
    elif 53 <= glucose < 70:
        return 2  # Hypoglycemia
    elif 70 <= glucose < 125:
        return 3  # Normal
    elif 125 <= glucose < 200:
        return 4  # Pre Diabetic
    else:
        return 5  # Severe Diabetes

# Create the 'DiabetesLevel' column using the function
cv_df['DiabetesLevel'] = cv_df['glucose'].apply(diabetes_level)

# Remove the 'diabetes' and 'glucose' columns
cv_df.drop(columns=['diabetes', 'glucose'], inplace=True)

# Checking if the 'DiabetesLevel' column is created properly
cv_df.head()

## **Checking correlation for feature removal:**

In [None]:
#Plotting correlation matrix using sns heatmap
corr_matrix= cv_df.corr()
plt.figure(figsize=(10,10))
sns.heatmap(corr_matrix,annot=True,cmap='coolwarm')
plt.title("Correlation between the variables of the dataset")
plt.show()

**There is no high correlation between majority variables but there for majority of the variables but there is a high correlation between "prevalentHyp" and "BPLevel". Here i will remove "prevalentHyp" because this is somehow direct related with "BPLevel" in mediacal terms.**

In [None]:
# Remove columns with high correlation
cv_df.drop('prevalentHyp', axis=1, inplace=True)

## **Checking the distribution of the data:**

This will reduce variables that do not contribute much in predicting the target variables.

In [None]:
# Creating a list of all the independent variables
independent_cols=list(set(cv_df.columns)-{'TenYearCHD'})

In [None]:
n=1
plt.figure(figsize=(14,30))
for i in independent_cols:
  plt.subplot(12,4,n)
  n= n+1
  sns.distplot(cv_df[i],color='teal')
  plt.title(i)
  plt.tight_layout()


As we can see from the distribution, there is a high class imbalance for the columns BPMeds and prevalentStroke, so they won't be able to impact the prediction of the target variable much and therefore we'll delete them.

From the EDA process we also saw that Education is not a great contributing factor, therefore I'll remove the education column also.

Also id column is not useful so i am removing it .

In [None]:
#Removing useless columns
cv_df.drop(columns={'BPMeds','prevalentStroke','education','id'},axis=1,inplace=True)

## **Dealing with class imbalance:**

A dataset is imbalanced if the classification categories are not approximately equally represented. This affects the quality of our machine larning model and also causes a mistake of classifying the minority class as the majority class. Therefore we will try to deal with this class imbalance if it exists in our dataset.

In [None]:
#Checking for class imbalance for the target variable
cv_df['TenYearCHD'].value_counts()

As we can see, there is a high class imbalance here.

Few echniques to solve class imbalance:

1.Resampling (undersampling or oversampling)

2.SMOTE

3.Using BalancedBaggingClassifier.

  and more....
  
In this project, to deal with class imbalance I will be using the SMOTE technique(synthetic minority oversampling technique)

## **SMOTE**


Synthetic Minority Oversampling Technique (SMOTE) is a statistical technique for increasing the number of cases in your dataset in a balanced way. The component works by generating new instances from existing minority cases that you supply as input.

In [None]:
# Creating the dataset for the independent and dependent variables.
X = cv_df.drop('TenYearCHD', axis=1)
Y = cv_df['TenYearCHD'].reset_index(drop=True)

# Applying the SMOTE technique to solve class imbalance
smote = SMOTE(sampling_strategy='minority')
X_resampled, Y_resampled = smote.fit_resample(X, Y)

# Displaying the first few rows of the resampled independent variables (X_resampled)
X_resampled.head()

In [None]:
Y_resampled.value_counts()

**Class Imbalance is now removed.**

## **Splitting and scalling the Data:**

In [None]:
#Splitting the data
X_train,X_test,Y_train,Y_test = train_test_split(X_resampled,Y_resampled,test_size=0.25,random_state=12)

### **Feature Scaling:**

Feature Scaling is a technique to standardize the independent features present in the data in a fixed range. It is performed during the data pre-processing to handle highly varying magnitudes or values or units. If feature scaling is not done, then a machine learning algorithm tends to weigh greater values, higher and consider smaller values as the lower values, regardless of the unit of the values.

**StandardScaler is used to resize the distribution of values ​​so that the mean of the observed values ​​is 0 and the standard deviation is 1.**

In [None]:
def scale_numeric_columns(data):
    scaler = StandardScaler()
    numeric_data = data.select_dtypes(include=['number'])
    scaled_data = scaler.fit_transform(numeric_data)
    scaled_df = pd.DataFrame(scaled_data, columns=numeric_data.columns)
    return scaled_df

In [None]:
#Scaling the independent dataset
X_train = scale_numeric_columns(X_train)
X_test = scale_numeric_columns(X_test)

## **Model Implementation:**

Machine learning models can be described as programs that are trained to find patterns or trends within data and predict the result for new data.

In this project we are dealing with a classification problem, therefore we will be using classification models.

In this project we will be including the following models:

1.Logistic regression.

2.Decision tree classifier.

3.Random forest classifier.

4.Gradient Boosting classifier.

**NOTE:**

All these models have similar training and predicting processes, so writing code for each one of them seperately makes it quite boring and lengthy. To solve this problem we can use the concept of ML pipelines. To implement this I will be using functions to execute the ML model trainings and also to evaluate the ML models.

## **Performance Metrics:**

Different performance metrics are used to evaluate machine learning model. Based on our task we can choose our performance metrics. Since our task is of classification and that too binary class classification, whether client will or will not subscribe for deposits.

Here we will be using AUC ROC

In [None]:
# Performance Metrics
def model_evaluator(actual, preds, ml_model, mode):
    cm = confusion_matrix(actual, preds)
    print("Confusion Matrix:\n", cm, '\n')
    sns.heatmap(cm, annot=True, cmap='coolwarm', fmt='d')
    plt.xlabel('Predicted Labels')
    plt.ylabel('Actual Labels')
    plt.title(f'Confusion Matrix for {ml_model} on the {mode} set')
    plt.show()

    roc_auc = roc_auc_score(actual, preds)
    print('\nROC AUC Score:', roc_auc)

    print('\nClassification Report:\n')
    target_names = ['Class 0', 'Class 1']
    print(classification_report(actual, preds, target_names=target_names))

# Model Pipeline
def model_pipeline(X_train, X_test, Y_train, Y_test, ml_model, param_grid=None, kind='evaluate'):
    model = None

    if ml_model == 'Logistic Regression':
        model = LogisticRegression(random_state=12)
    elif ml_model == 'Decision Tree Classifier':
        model = DecisionTreeClassifier()
    elif ml_model == 'Random Forest Classifier':
        model = RandomForestClassifier()
    elif ml_model == 'Gradient Boosting Classifier':
        model = GradientBoostingClassifier()
    else:
        print("Enter correct model name: Logistic Regression, Decision Tree Classifier, Random Forest Classifier, or Gradient Boosting Classifier.")
        return

    if param_grid:
        gs_model = GridSearchCV(estimator=model, param_grid=param_grid, cv=5, scoring='roc_auc', verbose=True)
        gs_model.fit(X_train, Y_train)
        print("Best parameters for", ml_model, ":", gs_model.best_params_, '\n')
        model = gs_model.best_estimator_

    model.fit(X_train, Y_train)
    train_predictions = model.predict(X_train)
    test_predictions = model.predict(X_test)

    if kind == 'evaluate':
        print("1. Train set evaluation:")
        model_evaluator(Y_train, train_predictions, ml_model, 'Train')
        print("\n2. Test set evaluation:")
        model_evaluator(Y_test, test_predictions, ml_model, 'Test')
    elif kind == 'model_explainability':
        return model

In [None]:
# Define parameter grids
param_grid_dt = {
    'max_depth': [4, 6, 8, 10],
    'min_samples_split': [5, 10, 20, 30, 40, 50],
    'min_samples_leaf': [5, 10, 15, 20]
}

param_grid_rf = {
    'n_estimators': [50, 65, 80, 95, 120],
    'max_depth': [3, 5, 7, 9, 10]
}

param_grid_gb = {
    'n_estimators': [80, 100],
    'max_depth': [5, 7, 8],
    'learning_rate': [0.001, 0.01, 0.05]
}

### **Logistic Regression:**

In [None]:
#Evluate Decision Tree model
model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Logistic Regression')

### **Decision Tree Classifier:**

In [None]:
model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Decision Tree Classifier', param_grid=param_grid_dt, kind='evaluate')

### **Ranndom Forest Classifier**

In [None]:
model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Random Forest Classifier', param_grid=param_grid_dt, kind='evaluate')

### **Gradient Boosting Classifier**

In [None]:
model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Gradient Boosting Classifier', param_grid=param_grid_gb, kind='evaluate')

## **Model Explainability:**

In [None]:
#Installing the shap library
!pip install shap

In [None]:
#Importing the SHAP library
import shap

In [None]:
#Creating an object for the logistic regression model
lr_classifier = model_pipeline(X_train,X_test,Y_train,Y_test, ml_model='Logistic Regression',kind='model_explainability')

In [None]:
# Explainer for SHAP values
explainer = shap.Explainer(lr_classifier, X_train)
shap_values = explainer(X_train)

# Plot SHAP summary
shap.summary_plot(shap_values, X_train, feature_names=X_train.columns)

In [None]:
# Feature importance function for tree-based models
def feature_importance(model, feature_names):
    importances = model.feature_importances_
    indices = np.argsort(importances)
    plt.figure(figsize=(8, 6))
    plt.title('Feature Importance')
    plt.barh(range(len(indices)), importances[indices], color='lightgreen', align='center')
    plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
    plt.xlabel('Relative Importance')
    plt.show()

In [None]:

# Define parameter grids
param_grid_dt = {
    'max_depth': [4, 6, 8, 10],
    'min_samples_split': [5, 10, 20, 30, 40, 50],
    'min_samples_leaf': [5, 10, 15, 20]
}

param_grid_rf = {
    'n_estimators': [50, 65, 80, 95, 120],
    'max_depth': [3, 5, 7, 9, 10]
}


param_grid_gb = {
    'n_estimators': [80, 100],
    'max_depth': [5, 7, 8],
    'learning_rate': [0.001, 0.01, 0.05]}

In [None]:

# Train and explain Decision Tree model using feature importance
dt_classifier = model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Decision Tree Classifier', param_grid=param_grid_dt, kind='model_explainability')
feature_importance(dt_classifier, X_train.columns)

In [None]:
# Train and explain Random Forest model using feature importance
rf_classifier = model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Random Forest Classifier', param_grid=param_grid_rf, kind='model_explainability')
feature_importance(rf_classifier, X_train.columns)

In [None]:
# Train and explain Random Forest model using feature importance
gb_classifier = model_pipeline(X_train, X_test, Y_train, Y_test, ml_model='Gradient Boosting Classifier', param_grid=param_grid_gb, kind='model_explainability')
feature_importance(gb_classifier, X_train.columns)

## **Conclusion:**

**EDA Insights:**

1. People who have age between 47 to 65 have high chances of Positive CHD.

2. Being a Male has increase chances of Positive CHD.

3. People who are smoking has high chances of Positive CHD.

4. Persons taking BP Medicines has more chances of Positive CHD.

5. Prevalent Stroke and Prevalent Hypertension are also major factors which
   increases chances of CHD.

6. Person with Diabetes are more prone to Positive Coronary Heart Diseases.

**Results from ML Model:**

1.Logistic regression gives a ROCAUC score of 0.6277 on the testing set. This is worst performing model.

2.Decision tree model gives a ROCAUC score of 0.7069 on the testing set.

3.Random Forest Classifier model gives a ROCAUC score of 0.7533 on the testing set. This is the best performing model.

4.Gradient Boosting Classifier model gives a ROCAUC score of 0.7520 on the testing set.

5.Model explainability has been achieved by SHAP library's summary plot and an attribute called feature_importance_ of the tree based algorithms.

6.Total cholestrol and age are the two most important factors to predict the CHD risk factor.