# **Project Name    - Cardiovascular Risk Prediction**

##### **Project Type**    - Classification
##### **Contribution**    - Individual
##### **Name -** Ratul Dutta


# **Project Summary -**

The goal of this project was to use machine learning techniques to predict the 10-year risk of future coronary heart disease (CHD) in patients using data from an ongoing cardiovascular study on residents of the town of Framingham, Massachusetts. The dataset provided informationon over 4,000 patients and included 15 attributes, each representing a potential risk factor for CHD. These attributes included demographic,behavioral, and medical risk factors.

To prepare the data for analysis, extensive preprocessing was performed to clean and transform the data. This included handling missingvalues using median, mode, as well as identifying and removing outliers using the Interquartile Range (IQR) method. Skewed continuousvariables were also transformed using log and square root transformations to reduce skewness and improve model performanceFeature.

Feature selection was performed using variance inflation factor to remove multicollinearity and a new feature called pulse pressure was createdto capture the relationship between systolic and diastolic blood pressure. Redundant columns were also removed to simplify the dataset. Themost important features for predicting CHD risk were identified as 'age', 'sex', 'education', 'cigs_per_day', 'bp_meds', 'prevalent _stroke','prevalent_hyp', 'diabetes', 'total_cholesterol', 'bmi, 'heart _rate', 'glucose', and 'pulse_pressure'.

To handle the imbalanced nature of the dataset, the SMOTE combined with Tomek links undersampling technique was used to balance theclass distribution and improve model performance. The data was also scaled using standard scalar method to ensure that all features were onthe same scale.

Several machine learning models were evaluated on their performance on the primary evaluation metric of recall. After careful analysis, theNeural Network (tuned) was chosen as the final prediction model because it had the highest recall score among the models evaluated. By selecting a model with a high recall score, the goal was to correctly identify as many patients with CHD risk as possible, even if it meant having
some false positives.

Overall, this project demonstrated the potential of machine learning techniques to accurately predict CHD risk in patients using data from a
cardiovascular study. By carefully preprocessing and transforming the data, selecting relevant features, and choosing an appropriate model
based on its performance on a relevant evaluation metric, it was possible to achieve a positive business impact by accurately predicting CHD
risk in patients.

# **GitHub Link -**

https://github.com/ratul837/CardioClassification.git

# **Problem Statement**


Cardiovascular diseases (CVDs) are the major cause of mortality worldwide. According to WHO, 17.9 million people died from CVDs in 2019, accounting for 32% of all global fatalities. Though CVDs cannot be treated, predicting the risk of the disease and taking the necessary precautions and medications can help to avoid severe symptoms and, in some cases, even death. As a result, it is critical that we accurately predict the risk of heart disease in order to avert as many fatalities as possible.

Therefore, the problem statement is to develop a machine learning model that can accurately predict an individual's risk of developing CVD by incorporating a wide range of variables, including demographic, clinical, and lifestyle factors. The model should be interpretable, reliable, and able to provide personalized risk predictions that can aid in the development of prevention strategies for high risk individuals.

# **General Guidelines** : -  

1.   Well-structured, formatted, and commented code is required.
2.   Exception Handling, Production Grade Code & Deployment Ready Code will be a plus. Those students will be awarded some additional credits.
     
     The additional credits will have advantages over other students during Star Student selection.
       
             [ Note: - Deployment Ready Code is defined as, the whole .ipynb notebook should be executable in one go
                       without a single error logged. ]

3.   Each and every logic should have proper comments.
4. You may add as many number of charts you want. Make Sure for each and every chart the following format should be answered.
        

```
# Chart visualization code
```
            

*   Why did you pick the specific chart?
*   What is/are the insight(s) found from the chart?
* Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

5. You have to create at least 20 logical & meaningful charts having important insights.


[ Hints : - Do the Vizualization in  a structured way while following "UBM" Rule.

U - Univariate Analysis,

B - Bivariate Analysis (Numerical - Categorical, Numerical - Numerical, Categorical - Categorical)

M - Multivariate Analysis
 ]





# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import missingno as mno
from sklearn.preprocessing import LabelEncoder
from imblearn.combine import SMOTETomek
from sklearn.linear_model import Ridge, Lasso
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import accuracy_score, classification_report,confusion_matrix,roc_curve,roc_auc_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import make_scorer, recall_score, f1_score, precision_score, recall_score
from xgboost import XGBRFClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import confusion_matrix
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from statsmodels.stats.outliers_influence import variance_inflation_factor
import datetime as dt
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset
from google.colab import drive
drive.mount('/content/drive')


df = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/Supervised Machine Learning Classification Project/data_cardiovascular_risk.csv')
cardio=df.copy()

### Dataset First View

In [None]:
# Dataset First Look
cardio.head()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
print("This dataset has",cardio.shape[0]," rows & ",cardio.shape[1]," columns")

### Dataset Information

In [None]:
# Dataset Info
cardio.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
print("This dataset has ",len(cardio[cardio.duplicated()])," values")

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
cardio.isnull().sum()

In [None]:
# Visualizing the missing values
#create the missing value matrix
mno.matrix(cardio, figsize=(18,6),sparkline=False,color=(0.27,0.52,1.0))
plt.title('Missing values in dataset')
# Show the graph
plt.show()

### What did you know about your dataset?

1. Null values/missing values - 510
2. Dataset has 3390 rows and 17 columns
3. No duplicate Values found in Dataset
4. Columns to convert into numeric (ML model training)

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
cardio.columns

In [None]:
# Dataset Describe
cardio.describe().T

### Variables Description

'id: This column represents an identifier or unique identifier for each individual in the dataset.

'age': It represents the age of the individual in years

'education': This column represents the education level of the individual, which could be encoded categorically (e.g., high school, college,
postgraduate) or numerically (e.g., years of education completed).

'sex': It indicates the biological sex of the individual, typically encoded as binary values (e.9., 0 for female, 1 for male)..

'is_smoking: This column indicates whether the individual is currently smoking or not, typically encoded as binary values (e.g., 0 for non-
smoker, 1 for smoker).

'cigsPerDay': It represents the number of cigarettes smoked per day by the individual.

BPMeds': This column indicates whether the individual is taking blood pressure medication, typically encoded as binary values (e.g., 0 for n taking medication, 1 for taking medication).

'prevalentStroke': It indicates whether the individual has a history of stroke, typically encoded as binary values (e.g., 0 for no stroke history,
stroke history).

'prevalentHyp': This column indicates whether the individual has prevalent hypertension (high blood pressure), typicaly encoded as binaryvalues (e.g., 0 for no hypertension, 1 for hypertension)

'diabetes': It indicates whether the individual has diabetes, typically encoded as binary values (e.g., 0 for no diabetes, 1 for diabetes).

'totChol: This column represents the total cholesterol level of the individual in mg/dL (milligrams per deciliter).

'sysBP: It represents the systolic blood pressure of the individual in mmHg (millimeters of mercury)

'diaBP': This column represents the diastolic blood pressure of the individual in mmHg.

'BMI': It indicates the body mass index of the individual, which is a measure of body fat based on height and weight.

'heartRate': This column represents the resting heart rate of the individual in beats per minute.

'glucose': It represents the blood glucose (sugar) level of the individual in mg/dL
'TenYearCHD': This column indicates the presence or absence of the ten-year risk of developing coronary heart disease (CHD) for the individual,typically encoded as binary values (e.g, 0 for no risk, 1 for risk).

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
cardio.nunique()

## **3. Data Wrangling Code**

In [None]:
# missing value percentage
def missing_value():
  missing=cardio.columns[cardio.isnull().any()].tolist()
  return missing
print(round(cardio[missing_value()].isnull().sum().sort_values(ascending=False)/len(cardio)*100,2))

In [None]:
# visualizing the data distribution of the numerical columns which have missing values
for i in ['glucose','totChol','BMI','heartRate','cigsPerDay']:
  plt.figure(figsize=(15,9))
  sns.distplot(cardio[i])

In [None]:
# filling the numerical missing values with mean value fo the column
cardio['BMI']=cardio['BMI'].fillna(cardio['BMI'].mean())
cardio['totChol']=cardio['totChol'].fillna(cardio['totChol'].mean())
cardio['glucose']=cardio['glucose'].fillna(cardio['glucose'].mean())
cardio['heartRate']=cardio['heartRate'].fillna(cardio['heartRate'].mean())
cardio['cigsPerDay']=cardio['cigsPerDay'].fillna(cardio['cigsPerDay'].mean())

In [None]:
cardio.isnull().sum()

In [None]:
print(type(cardio['education'].mode()))
print(type(cardio['education'].mode()[0]))

In [None]:
# filling the categorical missing values
cardio['education']=cardio['education'].fillna(cardio['education'].mode()[0])
cardio['BPMeds']=cardio['BPMeds'].fillna(cardio['BPMeds'].mode()[0])

In [None]:
cardio.isnull().sum()

### Dividing Numerical and Categorical data

In [None]:
cardio.columns

In [None]:
# to make clean dataset drop the id column as it is not used in prediction
cardio=cardio.drop(columns='id')

In [None]:
# separate categorical variables
categorical_var=[x for x in cardio.columns if cardio[x].dtype=='O']
print("This dataset has",len(categorical_var),"categorical variables")

# separate numerical varibles
numerical_var=[x for x in cardio.columns if cardio[x].dtype!='O']
print("This dataset has",len(numerical_var),"numerical variables")

In [None]:
numerical_var

In [None]:
new_list=[]
for x in cardio.columns:
  if len(cardio[x].unique())<20:
    print(x,'Values:',cardio[x].unique())
    new_list.append(x)
print("So there is total",len(new_list),"discreet variables")

In [None]:
# continous feature
continous=[x for x in cardio.columns if x not in new_list]
print("So there is total",len(continous),"continous variables")

### What all manipulations have you done and insights you found?

1. From the above analysis I found that as most of the numerical missing value column data is skewed, and the data has some features with missing value, so I imputed the missing values with mean features.

2. Segregated the data into discreet and continuous for better understanding and analysis of independent variables.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#### **Chart - 1 - Analysis of dependent variables**

In [None]:
# Chart - 1 visualization code
cardio['TenYearCHD'].value_counts().plot.pie(autopct='%1.1f%%')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

Used pie chart as I had to show some properties of dependent variables.

##### 2. What is/are the insight(s) found from the chart?

Here we can see that there is imbalance distribution of data as 'TenYearCHD' variable has 85% of 0 and 15% of 1 values.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Found data imbalance in data which can affect the efficiency of the models. Need to use some sampling technique to treat the imbalance.

#### **Chart - 2 - Effect of education on Risk analysis**

In [None]:
# Chart - 2 visualization code
df1=cardio.groupby('TenYearCHD')['education'].value_counts().unstack(0)
print(df1)
df1=df1.divide(df1.sum(axis=1),axis=0)*100
df1.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

Used bar chart as I had to show some properties of dependent variables.

##### 2. What is/are the insight(s) found from the chart?

Pepole with education of 1 year hs the highest number of percentage who has risk of CHD, but other's are having not lesser risk

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Pepole with one year education having more tend to get CHD more than others. Surpringisly pepole with 4 years of education are on the second rank.

#### **Chart - 3 - Analysing Risk on Gender basis**

In [None]:
# Chart - 3 visualization code
df2=cardio.groupby('TenYearCHD')['sex'].value_counts().unstack(0)
print(df2)
df2=df2.divide(df2.sum(axis=1),axis=0)*100
df2.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
#plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

Used bar chart as i need to comapre between male and female having risk of CHD.

##### 2. What is/are the insight(s) found from the chart?

Males having higher chances of CHD as compare to female.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Yes, gained insights help to create positive bussiness impact like males having higher risk of CHD. It can lead to bussiness oppurtunites.

The insight indiactes a higher risk of CHD in males, it may have negative implications on certain sectors or bussinesses. For example, higher healthcare costs associated with treating CHD-related conditions in males.

#### **Chart - 4 - Effect of Smoking**

In [None]:
# Chart - 4 visualization code
df3=cardio.groupby('TenYearCHD')['is_smoking'].value_counts().unstack(0)
print(df3)
df3=df3.divide(df3.sum(axis=1),axis=0)*100
df2.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is used for clear compariosn and easy to understand the effect of smoking on the diseases.

##### 2. What is/are the insight(s) found from the chart?

Males are more prone to smoking as commpare to females. But the difference is lesser. It is one of the reason that males having more chances of CHD.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

If the insight highlights the increased risk of CHD in smokers, it can create positive business opportunities in healthcare industry and related sectors. Increased demand for smoking cessation programs, products, and services aimed at helping individuals quit smoking and reduce risk of CHD.


#### **Chart - 5 - Effect of Blood Pressure Medicines on**

In [None]:
# Chart - 5 visualization code
df4=cardio.groupby('TenYearCHD')['BPMeds'].value_counts().unstack(0)
print(df4)
df4=df4.divide(df4.sum(axis=1),axis=0)*100
df2.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is used for clear compariosn and easy to understand the effect of Blood Pressure Medicine on the diseases.

##### 2. What is/are the insight(s) found from the chart?

Pepole who have blood pressure problems and already taken medicine are mor prone to CHD.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

A person with blood pressure issues who are already on medication, it can lead to positive business opportunities in the healthcare industry and related sectors. Increased demand for advanced diagnostic tools, treatments, and medications for managing blood pressure and reducing the risk of CHD in this specific population segment.

If the insight indicates that individuals with blood pressure problems who are already on medication are more prone to CHD, it may have
negative implications for certain businesses or sectors. For example:
Potential challenges for businesses offering blood pressure medications if the effectiveness in reducing the risk of CHD is not significant
enough to motivate individuals to continue using the medications.


#### **Chart - 6 - Effect of Prevalent Stroke**

In [None]:
# Chart - 5 visualization code
df5=cardio.groupby('TenYearCHD')['prevalentStroke'].value_counts().unstack(0)
print(df5)
df5=df5.divide(df5.sum(axis=1),axis=0)*100
df2.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is used for clear compariosn and easy to understand the effect of Prevalent  Stroke on the diseases.

##### 2. What is/are the insight(s) found from the chart?

Patients having Prevalent stroke are more prone to CHD risk.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can use this information to aware the pepole who has prevalent stroke before are more prone to the risk of having CHD.

#### **Chart - 7 - Effect of Diabetes**

In [None]:
# Chart - 7 visualization code
df6=cardio.groupby('TenYearCHD')['diabetes'].value_counts().unstack(0)
print(df6)
df6=df6.divide(df6.sum(axis=1),axis=0)*100
df2.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is used for clear compariosn and easy to understand the effect of diabetes on the diseases.

##### 2. What is/are the insight(s) found from the chart?

Diabetes Patient having more prone to CHD risk

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can use this information to aware the pepole who have diabetes are having more risk of CHD.

#### **Chart - 8 - Effect of Prevalent Hypertension**

In [None]:
# Chart - 8 visualization code
df6=cardio.groupby('TenYearCHD')['prevalentHyp'].value_counts().unstack(0)
print(df6)
df6=df6.divide(df6.sum(axis=1),axis=0)*100
df2.plot(kind="bar")
plt.ylim(0,100)
plt.ylabel('Percentage')
plt.show()

##### 1. Why did you pick the specific chart?

Bar chart is used for clear compariosn and easy to understand the effect of Prevalent Hypertension on the diseases.

##### 2. What is/are the insight(s) found from the chart?

Hyper Tension Patents are more prone to risk of CHD.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

We can use this information to aware the pepole who have Hyper Tension are having more risk of CHD. So that they can take percatuionary meaures.

#### **Chart - 9 - Continous data distributions**

In [None]:
for i in continous:
  plt.figure()
  sns.distplot(cardio[i])

##### 1. Why did you pick the specific chart?

Use distplot to show the data distrubution.

##### 2. What is/are the insight(s) found from the chart?

Most of the data are skewed ned to transform the data.

#### **Chart - 10 Effect of BMI and Gender**

In [None]:
# Chart - 10 visualization code
plt.figure(figsize=(15,8))
sns.violinplot(x='sex',y='BMI',data=cardio)
plt.show()

##### 1. Why did you pick the specific chart?

Violin plots display the shape, spread, and density of the data distribution. They provide information about the location of the median, quartiles, and outliers, similar to a box plot. Additionally, the width of the violin at different points indicates the density of the data at those values.

##### 2. What is/are the insight(s) found from the chart?

Healthy BMI is considered as a score between 20 - 25. most of the Males lie slightly above 25, and most of the females lie below 25. Butwhisker of violin plot in female goes upto 55-60.

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

Are there any insights that lead to negative growth? Justify with specific reason.These insights suggest that, on average, males tend to have slightly higher BMI scores than females, with more females falling within thehealthy BMI range. The presence of outliers among females with higher BMI scores may indicate a subgroup of individuals who are at a grisk of weight-related health issues

#### **Chart - 11 - Smoking and Gender**

In [None]:
# Chart - 11 visualization code
plt.figure(figsize=(30,8))
cardio.groupby('sex')['cigsPerDay'].value_counts().unstack(0).plot.bar(figsize=(20,8))
plt.show()

##### 1. Why did you pick the specific chart?

Used bar chart to compare the no. of cigarates per day with respect to Gender.

##### 2. What is/are the insight(s) found from the chart?

Majority non-somkers are female. But in some instances like 1,2,3,5,8,9,10 cigarates per day females are in lead.

#### **Chart - 12 - Systolic Blood Pressure and Gender**

In [None]:
# Chart - 12 visualization code
plt.figure(figsize=(15,8))
sns.violinplot(x='sex',y='sysBP',data=cardio)
plt.show()

##### 1. Why did you pick the specific chart?

Violin plots display the shape, spread, and density of the data distribution. They provide information about the location of the median, quartiles, and outliers, similar to a box plot. Additionally, the width of the violin at different points indicates the density of the data at those values.

##### 2. What is/are the insight(s) found from the chart?

Female experince higher BP in some instances

#### **Chart - 13 - Diabetes and Glucose**

In [None]:
# Chart - 13 visualization code
plt.figure(figsize=(15,8))
sns.violinplot(x='diabetes',y='glucose',data=cardio)
plt.show()

##### 1. Why did you pick the specific chart?

Violin plots display the shape, spread, and density of the data distribution. They provide information about the location of the median, quartiles, and outliers, similar to a box plot. Additionally, the width of the violin at different points indicates the density of the data at those values.

##### 2. What is/are the insight(s) found from the chart?

Average Blood Glucose is 70 - 100 mg/dL. We can see majority of non-diabetes in this range, on the other hand the diabetes share negative values to more than 400 mg/dL.

#### **Chart - 14 - Correlation Heatmap**

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,8))
sns.heatmap(cardio.corr(),annot= True, cmap='coolwarm')


##### 1. Why did you pick the specific chart?

Use heatmap to visualize the correlation between the features

##### 2. What is/are the insight(s) found from the chart?

Systolic BP - Diastolic BP, Systolic BP - Prevalent Hypertension and Dialostic BP - Prevalent Hyper Tension have high correlation.

#### **Chart - 15 - Pair Plot**

In [None]:
# Pair Plot visualization code
sns.pairplot(cardio)

##### 1. Why did you pick the specific chart?

Used Pairplot for visualizing the relationships between multiple variables in dataset.

##### 2. What is/are the insight(s) found from the chart?

Here we can also see Systolic BP - Diastolic BP, Systolic BP - Prevalent Hypertension and Dialostic BP - Prevalent Hyper Tension have high correlation.

## **5. Hypothesis Testing**

### Hypothetical Statement

#### 1. State my research hypothesis as a null hypothesis and alternative hypothesis

Null Hypothesis : There is no statistically significant relationship between the 15 attributes (demographic, behavioral, and medical risk factors) and the 10-year risk of CHD.

Alternative Hypothesis : There is a statistically significant relationship between at least one of the 15 attributes and the 10-year risk of CHD.

#### 2. Perform an Statistical Test

In [None]:
import statsmodels.api as sm

#defining Null and alternative hypothesis
null_hypo='There is no relationship between prevalent stroke and TenYearCHD'
alt_hypo="There is relationship between prevalent stroke and TenYearCHD"

#performing linear regression
X=sm.add_constant(cardio['prevalentStroke'].apply(lambda x: 1 if x==1 else 0))
y=cardio['TenYearCHD']
model=sm.OLS(y,X).fit()

#printing summary statistics
print(model.summary())

#Extracting p-value for temperature coefficient
p_value=model.pvalues[1]
print('p-values:',p_value)

#### Which statistical test have you done to obtain P-Value?


We use the OLS (ordinary least squares) function from the statsmodels package.
In this case, since the p-value (0.00006) is less than the significance level of 0.05, we can reject the null hypothesis that "There is no relationship
between prevalent_stroke and TenYearCHD". Therefore, based on the available data, there is evidence to support the alternative hypothesis that
"There is a relationship between is_smoking and TenYearCHD"

#### Why did you choose the specific statistical test?

The OLS regression model is commonly used to assess the relationship between a dependent variable and one or more independent variables.

## **6. Feature Engineering & Data Pre-processing**

#### 1. Checking and Handling Outliers

In [None]:
# checking outliers
for x in continous:
  plt.figure(figsize=(10,5))

  plt.subplot(1,2,1)
  fig=sns.boxplot(y=cardio[x])
  fig.set_ylabel(x)

  plt.subplot(1,2,2)
  fig1=sns.distplot(cardio[x].dropna())
  fig1.set_xlabel(x)

  plt.show()

In [None]:
# handling outliers

# capping the outlier rows with percentile
for x in continous:
  up=cardio[x].quantile(.95)
  down=cardio[x].quantile(.05)
  cardio.loc[(cardio[x]>up),x]=up
  cardio.loc[(cardio[x]<down),x]=down

#again checking outliers
for x in continous:
  plt.figure(figsize=(9,5))
  plt.subplot(1,2,1)
  fig=sns.boxplot(y=cardio[x])

  plt.subplot(1,2,2)
  fig= sns.distplot(cardio[x].dropna())
  fig.set_xlabel(x)

  plt.show()

#### 2. Categorical  Encoding

In [None]:
df=cardio.copy()

# use label encoding
le=LabelEncoder()
df['sex']= le.fit_transform(df['sex'])
df['is_smoking']=le.fit_transform(df['is_smoking'])

#### 3. Calculate VIF

In [None]:
# calculate VIF
def calc_vif(X):
  vif=pd.DataFrame()
  vif["Variables"]=X.columns
  vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
  return (vif.sort_values(by="VIF",ascending=False).reset_index(drop=True))

#the_variance_Inflation_Factor_(VIF)
test_df=calc_vif(df[[i for i in df.describe().columns if i not in ['TenYearCHD']]])
print(test_df)

In [None]:
# Eliminating the features having higher VIF ( >5 points)
x=df.drop(columns=['TenYearCHD','sysBP','diaBP','glucose','BMI','totChol','heartRate','is_smoking'])
y=df['TenYearCHD']
calc_vif(x)

#### 4. Data Transformation

In [None]:
x.head()

In [None]:
# Data Transformation needed as the data was skewed
# log transformation
x=np.log(x+1)

In [None]:
x.head()

In [None]:
# now visulaize the data after the data transformation
for a in x.describe().columns:
  plt.figure(figsize=(9,5))

  plt.subplot(1,2,1)
  fig=sns.boxplot(y=x[a])
  fig.set_ylabel(a)

  plt.subplot(1,2,2)
  fig1=sns.distplot(x[a].dropna())
  fig1.set_xlabel(a)

  plt.show()

In [None]:
# split the dataset into train and test dataset
X_train,X_test,y_train,y_test=train_test_split(x,y,test_size=0.3,random_state=42)
print(X_test.shape)
print(X_train.shape)

In [None]:
# scaling the dataset in between range from 0 to 1
ms=MinMaxScaler()
X_train=ms.fit_transform(X_train)
X_test=ms.fit_transform(X_test)

In [None]:
# checking data imbalance
print(cardio['TenYearCHD'].value_counts())

# handling data imbalance by using SMOTE(Synthetic Minority Over-sampling Technique)
smote=SMOTETomek(random_state=42)
X_train,y_train=smote.fit_resample(X_train,y_train)

print(X_train.shape[0])
print(y_train.shape[0])

## **7. ML Model Implemention**

In [None]:
x.columns.values.tolist()

In [None]:
score_df = pd.DataFrame()

scoring = make_scorer(f1_score,pos_label=1)

features=[i for i in df.columns if i not in ['TenYearCHD']]

def analyse_model(model,X_train,X_test,y_train,y_test):
  model.fit(X_train,y_train)

  try:
    try:
      importance = model.feature_importances_
      feature = features
    except:
      importance = np.abs(model.coef_[0])
      feature = x.columns.values.tolist()
    indicies = np.argsort(importance)
    indicies = indicies[::-1]
  except:
    pass

  for x,act,label in ((X_train,y_train,'Train_set'),(X_test,y_test,'Test_set')):

    # plotting evaluation matrix for train and test dataset
    pred=model.predict(x)
    pred_proba=model.predict_proba(x)[:,1]
    report=pd.DataFrame(classification_report(y_pred=pred,y_true=act,output_dict=True))
    fpr,tpr,thresholds = roc_curve(act,pred_proba)

    #classification  report
    plt.figure(figsize=(18,3))
    plt.subplot(1,3,1)
    sns.heatmap(report.iloc[:-1,:-1].T,annot=True,cmap='coolwarm')
    plt.title(f'{label} Report')

    # Confusion matrix
    plt.subplot(1,3,2)
    sns.heatmap(confusion_matrix(y_true=act,y_pred=pred),annot=True,cmap='coolwarm')
    plt.title(f'{label} Report')
    plt.xlabel('Predicted labels')

    global score_df
    score_df[model]={'precision': precision_score(act,pred),'recall': recall_score(act,pred),'f1_score':f1_score(act,pred),'accuracy':accuracy_score(act,pred)}

    # AUC_ROC Curve
    plt.subplot(1,3,3)
    plt.plot([0,1],[0,1],'k--')
    plt.plot(fpr,tpr,label=f'AUC={np.round(np.trapz(tpr,fpr),3)}')
    plt.legend(loc=4)
    plt.title(f'{label} AUC_ROC Curve')
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.tight_layout()

    #print("Importance:", importance)
    #print("Feature:", feature)
    #plotting feature immportance
  try:
    plt.figure(figsize=(18,3))
    plt.bar(range(len(indicies)),importance[indicies])
    plt.xticks(range(len(indicies)),[feature[i] for i in indicies])
    plt.title('Feature Importance')
  except:
    #print(indicies)
    pass
  plt.show()

  return model

#### **1. Logistic Regression**

In [None]:
lr=LogisticRegression()
grid={'penalty':['l1','l2'],                 # Regularization l1 or, l2
      'C':[0.1,0.2,0.3],                     # Inverse of Regularization Strength
      'solver':['liblinear','saga'],         # solver algorithm
      'max_iter':[100,200,300,10000]}        # Maximum number of iteration

lr=GridSearchCV(lr,param_grid=grid,cv=5)

lr.fit(X_train,y_train)

In [None]:
analyse_model(lr.best_estimator_,X_train,X_test,y_train,y_test)

#### **2. Support Vector Classifier**

In [None]:
svc=SVC(random_state=0,probability=True)

# Cross  Validation & Hyper Parameter Tuning

# Hyperparameter Grid

grid={
    'kernel':['Linear','rbf','poly','sigmoid'],
    'C':[0.1,1,10,100],
    'max_iter':[1000]
}

svc = GridSearchCV(svc,param_grid=grid,cv=5)
svc.fit(X_train,y_train)

In [None]:
# analyse the model

analyse_model(svc.best_estimator_,X_train,X_test,y_train,y_test)

### **3. Naive Bayes Classifier**

In [None]:
nvc = GaussianNB()

# analyse the model performance

analyse_model(nvc,X_train,X_test,y_train,y_test)

### **4. XGBoost Classifier**

In [None]:
xgb=XGBRFClassifier(randomm_state=0)

grid={
    'n_estimator':[150],
    'max_depth':[8,10],
    'eta':[0.05,0.08,0.1]
}

xgb=GridSearchCV(xgb,scoring=scoring,param_grid=grid,cv=5)
xgb.fit(X_train,y_train)

In [None]:
# analyse the model performmance

analyse_model(xgb.best_estimator_,X_train,X_test,y_train,y_test)

### **5. Neural Network Classifier**

In [None]:
nnc=MLPClassifier(random_state=0)

# Cross  Validation & Hyper Parameter Tuning

# Hyperparameter Grid

grid={
    'hidden_layer_sizes':[(50,),(100,)],
    'activation':['relu','tanh'],
    'solver':['adam'],
    'alpha':[0.0001,0.001],
    'learning_rate':['constant','adaptive']
}

# GridSearchCV to find the best parameters

NNC=GridSearchCV(nnc,scoring=scoring,param_grid=grid,cv=5)
NNC.fit(X_train,y_train)

In [None]:
# analyse the model performmance

analyse_model(NNC.best_estimator_,X_train,X_test,y_train,y_test)

### **Defining The Accuracy Score Of ML Models**

In [None]:
score_df=score_df.T

score_df['model'] = ['Logistic Regression','SVC','Gaussian NB','XGBRFClassifier','Neural Network Classifier']

score_df = score_df.set_index('model')

score_df

### **1. Which Evaluation metrics did you consider for a positive business impact and why?**


1. Accuracy: Accuracy measures the overall correctness of predictions and is useful when class distribution is balanced and all classes are equally important.

2. Precision: Precision represents the proportion of true positive predictions among the total positive predictions. It is valuable when the cost of false positives is high, such as in scenarios where false positives could lead to significant consequences or financial losses.

3. Recall (Sensitivity): Recall measures the proportion of true positive predictions among the actual positive instances. It is important when the cost of false negatives is high, as it focuses on minimizing the number of false negatives or missed positive instances.

4. F1-score: The F1-score is the harmonic mean of precision and recall. It provides a balanced measure that considers both precision and recall, making it useful when you want to optimize for both false positives and false negatives.

### **2. Which ML model did you choose from the above created models as your final prediction model and why?**

After evaluating the performance of several machine learning models on the Framingham Heart Study dataset, I have selected the Logistic Regression as our final prediction model. This decision was based on the model's performance on our primary evaluation metric of recall, which measures the ability of the model to correctly identify patients with CHD risk. In our analysis, we found that the Logistic Regression and Support Vector Classifier had the highest recall score among the models we evaluated.

We chose recall as our primary evaluation metric because correctly identifying patients with CHD risk is critical to achieving our business objectives. By selecting a model with a high recall score, we aim to ensure that we correctly identify as many patients with CHD risk as possible, even if it means that we may have some false positives. Overall, we believe that the Logistic Regression is the best choice for our needs and will help us achieve a positive business impact.


# **Conclusion**

In conclusion, this project demonstrated the potential of machine learning techniques to accurately predict the 10-year risk of future coronary heart disease (CHD) in patients using data from an ongoing cardiovascular study. Key points from this project include:

1. Careful data preprocessing and transformation improved the performance of machine learning models and enabled more accurate predictions.

2. Feature selection was important for identifying the most relevant predictors of CHD risk.

3. The Logistic Regression Model was chosen as the final prediction model due to its high recall score.

4. Techniques such as SMOTE combined with Tomek links undersampling and MinMax scalar scaling were used to handle imbalanced data and improve model performance.

5. This project provides a valuable example of how machine learning techniques can be applied to real-world problems to achieve positive business impact.

Overall, this project highlights the importance of careful data preparation and analysis in machine.