<a href="https://colab.research.google.com/github/sanjananasa/cardiovascular-risk-prediction/blob/main/CardioVascular_Risk_Prediction_(Classification_Capstone_Project).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Name**    - CardioVascular Disease Risk Prediction



## **Project Type**    - Classification
## **Contribution**    - Individual
## **Name**- Sanjana Nasa

# **Project Summary -**

#The goal of this project is to use Machine Learning techniques to predict whether the patient has a 10-year risk of future coronary heart disease (CHD). The dataset provides the patients’ information. It includes over 4,000 records and 16 attributes. Each attribute is a potential risk factor. There are demographic, behavioral, and medical risk factors.

#To prepare the data for analysis, extensive processing is performed to clean and transform data.This includes handling missing values, using median,mode and KNN imputation techniques, identifying and removing outliers. Skewed continuous variables were transformed using log transformation and improve model performance.

#Feature selection is performed using VIF method to remove multicollinearity. Few new variables were created and redundant variables were removed.

# **GitHub Link -**

Provide your GitHub Link here.

# **Problem Statement**


**What exactly are cardiovascular diseases?**

A group of conditions affecting the heart and blood vessels is known as cardiovascular diseases. They consist of heart disease, which affects the blood vessels that supply the heart muscle. The majority of the time, a blockage that prevents blood from flowing to the heart or brain is to blame for heart attacks and strokes, which are typically sudden events. A buildup of fatty deposits on the inner walls of the blood vessels that supply the heart or brain is the most common cause of this.

**WHY DO WE NEED CARDIOVASCULAR RISK PREDICTION?**

The greatest obstacle facing the medical industry is accurately predicting and diagnosing heart disease. Heart diseases are influenced by numerous factors.

Heart disease is even referred to as a "silent killer" because it kills people without showing any obvious symptoms.

When high-risk patients are diagnosed with heart disease early, it is easier to make lifestyle changes, which in turn lowers the risk of complications.
Based on the way people currently live, machine learning can help predict the likelihood of heart disease in the coming years.

# ***Let's Begin !***

## ***1. Know Your Data***

### Import Libraries

In [None]:
# Import Libraries
import pandas as pd
import numpy as np
from datetime import datetime as dt

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline


# Importing warnings library. Would help to throw away warnings caused.
import warnings
warnings.filterwarnings('ignore')

### Dataset Loading

In [None]:
# Load Dataset

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
data=pd.read_csv("/content/drive/MyDrive/data_cardiovascular_risk.csv")

### Dataset First View

In [None]:
# Dataset First Look
#first five rows of the dataset
data.head()


In [None]:
#last five rows of the dataset
data.tail()

### Dataset Rows & Columns count

In [None]:
# Dataset Rows & Columns count
data.shape
print("number of rows",data.shape[0])
print("number of columns",data.shape[1])

### Dataset Information

In [None]:
# Dataset Info
data.info()

#### Duplicate Values

In [None]:
# Dataset Duplicate Value Count
data.duplicated().sum()

#### Missing Values/Null Values

In [None]:
# Missing Values/Null Values Count
miss_val=data.isnull().sum().sort_values(ascending=False)
print(miss_val)

In [None]:
# Visualizing the missing values
import missingno as msno
msno.matrix(data)
plt.show()

## ***2. Understanding Your Variables***

In [None]:
# Dataset Columns
data.columns

In [None]:
# Dataset Describe
data.describe().round(2)

### Variables Description

#**Demographic**

**age** : Age of the patient (Continuous - Although the recorded ages have been truncated to whole numbers, the concept of age is continuous)

**education** : level of education from 1 to 4 (Ordinal Variable)

**sex** : male or female ("M" or "F")


#**Behavioral**

**is_smoking** : whether or not the patient is a current smoker ("YES" or "NO")

**cigsPerDay** : the number of cigarettes that the person smoked on average in one day (can be considered continuous as one can have any number of cigarettes, even half a cigarette.)


#**Medical( history)**

**BPMeds** : whether or not the patient was on blood pressure medication (Nominal)

**prevalentStroke** : whether or not the patient had previously had a stroke (Nominal)

**prevalentHyp** : whether or not the patient was hypertensive (Nominal)

**diabetes** : whether or not the patient had diabetes (Nominal)


#**Medical(current)**

**totChol** : total cholesterol level (Continuous)

**sysBP** : systolic blood pressure (Continuous)

**diaBP** : diastolic blood pressure (Continuous)

**BMI** : Body Mass Index (Continuous)

**heartRate** : heart rate (Continuous - In medical research, variables such as heart rate though in fact discrete, yet are considered continuous because of large number of possible values.)

**glucose** : glucose level (Continuous)


#**Predict variable (desired target)**

**TenYearCHD** : (binary: “1”, means “Yes”, “0” means “No”)

### Check Unique Values for each variable.

In [None]:
# Check Unique Values for each variable.
unique_values = {}
for column in data.columns:
    unique_values[column] = data[column].unique()

# Display the unique values for each variable
for column, values in unique_values.items():
    print(f"Unique values for column '{column}':")
    print(values)
    print()

## 3. ***Data Wrangling***

### Data Wrangling Code

In [None]:
#dropping the id column
data=data.drop(columns='id')

In [None]:
# Write your code to make your dataset analysis ready.

#converting datatype
data['glucose'] =data['glucose'].astype(float)

In [None]:


#Handling missing values
miss_val[:7]


In [None]:
#checking missing value percentage of each variable
miss_val[:7]/len(data)*100

As we can see that the percentage of missing/null values is not very high in any of the columns, so we can easily handle these missing values.

so firstly ,lets visualise these columns so that we could take the appropriate decision.

In [None]:
#visualising columns with missing values
for i in ['glucose','education','BPMeds','totChol','cigsPerDay','BMI','heartRate']:
  plt.figure(figsize=(6,4))
  sns.distplot(data[i])

Here we can see in the above plots that distribution of continuous columns like **glucose,totChol,BMI,heartRate,cigs_per_day** that have missing values is skewed. so in those columns we can fill the missing values with **median**.

And for categorical variables i.e. **education and BPMeds** , we can fill the missing values with **Mode**

In [None]:
#imputing continuous missing values with median()
data['glucose']=data['glucose'].fillna(data['glucose'].median())
data['totChol']=data['totChol'].fillna(data['totChol'].median())
data['BMI']=data['BMI'].fillna(data['BMI'].median())
data['heartRate']=data['heartRate'].fillna(data['heartRate'].median())
data['cigsPerDay']=data['cigsPerDay'].fillna(data['cigsPerDay'].median())

In [None]:
#imputing categorical missing data with mode
data['education']=data['education'].fillna(data['education'].mode()[0])
data['BPMeds']=data['BPMeds'].fillna(data['BPMeds'].mode()[0])

In [None]:
#checking for missing values again
data.isnull().sum()

Now there are no null values in the data

In [None]:
#segregating data into continuous and discrete variables

#discrete variables
discrete=[]

for var in data.columns:
  if var not in ['TenYearCHD']:
    if len(data[var].unique())<10:
      discrete.append(var)
      print(var,'values:',data[var].unique())


In [None]:
#continuous variables
continuous=[var for var in data.columns if var not in discrete and var not in ['TenYearCHD']]
print(continuous)


### What all manipulations have you done and insights you found?

 1. Dropped 'id' column as we saw that index and id were same, so there was no point in keeping it.

 2. converted "glucose" column data type from 'object' to 'float'.

 3. we found that there are some variables in dataset with missing values, so I imputed them with median and mode accordingly.

 4. segregated discrete and continuous variables for better understanding and analysis.

## ***4. Data Vizualization, Storytelling & Experimenting with charts : Understand the relationships between variables***

#**UNIVARIATE ANALYSIS**

## Visualising the Dependent(target) variable distribution

In [None]:
#  visualization code
data['TenYearCHD'].value_counts().plot.pie(autopct='%1.1f%%')
plt.legend()
plt.show()

##### 1. Why did you pick the specific chart?

**To show the proportion of dependent variable**

##### 2. What is/are the insight(s) found from the chart?

**we can see that there is imbalanced distribution of data as 'TenYearCHD' has 85% of 0 and 15% 1 values.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**The imbalance in the data can affect the model efficiency. so need to perform some sampling techniques to teat the imbalance.**

#distribution of continuous variables

In [None]:
#distribution of continuous variables
for col in continuous:
  plt.figure()
  sns.distplot(data[col])

For numerical features, we can see that the majority of distributions are right-skewed. The distributions of totChol (total cholesterol) and BMI are roughly comparable. The distribution of glucose is highly skewed to the right. It demonstrates that glucose has many outliers.

#Outlier Analysis of Numeric feature

In [None]:
plt.figure(figsize=(15,5))
plt.suptitle('Outlier Analysis of Numeric Features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(continuous):
  plt.subplot(2, 4, i+1)                       # subplots 2 rows, 4 columns

  # boxplots
  sns.boxplot(data[col])
  # x-axis label
  plt.xlabel(col)
  plt.tight_layout()

Outliers are visible in the 'cigsPerDay', 'totChol', 'sysBP', 'diaBP', 'BMI', 'heartRate', and 'glucose' columns.

#**Bivariate and Multivariate Analysis**

## Education and risk of disease

In [None]:
# visualization code
freq=data.groupby('TenYearCHD')['education'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**To show the comparison between education level and risk of disease**

##### 2. What is/are the insight(s) found from the chart?

**People with education of 1 year has highest number of percentage who has risk of CHD. Others aren't much behind.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**People with 1 year of education tend to get CHD more often than others, Surprisingly, people with 4 years of education are on the second rank.**

## Gender and risk of disease

In [None]:
# visualization code
freq=data.groupby('TenYearCHD')['sex'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**To show the comparison between the risk of CHD with respect to 'sex'.**

##### 2. What is/are the insight(s) found from the chart?

**We can clearly see that males have higher percentage of CHD risk as compared to females.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**If the insight reveals that males are more prone to CHD, it can lead to positive business opportunities in the healthcare industry.**

## Smoking and risk of disease

In [None]:
# visualization code
freq=data.groupby('TenYearCHD')['is_smoking'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**To compare the risk of CHD with respect to smoking,i.e, whether a person smokes or not.**

##### 2. What is/are the insight(s) found from the chart?

**we found that people who smoke has higher percentage of risk of CHD as compared to people who do not smoke.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**If the insight highlights the increased risk of CHD in smokers, then the potential impacts could be increased demand for smoking cessation programmes , products and services aimed at helping people quit smoking and reduce the risk of CHD.**

##Blood pressure medicines (BPMeds) and risk of CHD

In [None]:
#  visualization code
freq=data.groupby('TenYearCHD')['BPMeds'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**To show the comparison of percentage of risk of CHD with respect to BP medications**



##### 2. What is/are the insight(s) found from the chart?

**Here we can see that people with Blood pressure problem who are already on BP medications are prone to CHD.**

## Prevalent stroke and risk of CHD

In [None]:
#  visualization code
freq=data.groupby('TenYearCHD')['prevalentStroke'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**Used bar chart to show the comparison of percentage of CHD risk with respect to 'prevalent stroke'.**

##### 2. What is/are the insight(s) found from the chart?

**Patients with prevalent stroke symptoms have higher percentage of CHD risk**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**We can use this information to create awareness among people so that they could take precautionary measures.**

## Diabetes and risk of CHD

In [None]:
# visualization code
freq=data.groupby('TenYearCHD')['diabetes'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**Used bar chart to show the comparison of percentage of CHD risk with respect to 'diabetes'.**

##### 2. What is/are the insight(s) found from the chart?

**Diabetic patients tend to have higher risk of CHD.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**We can use this insight to create awareness among the people so that people who are already diabetic can be more careful and accordingly take precautionary measures to avoid CHD.**

## Prevalent Hypertension and risk of CHD

In [None]:
#  visualization code
freq=data.groupby('TenYearCHD')['prevalentHyp'].value_counts().unstack(0)
(freq.divide(freq.sum(axis=1),axis=0)*100).plot(kind='bar')
plt.ylim(0,100)
plt.ylabel("percentage")
plt.xticks(rotation=0)
plt.show()

##### 1. Why did you pick the specific chart?

**Used bar chart to show the comparison of percentage of CHD risk with respect to 'prevalent hypertension'.**

##### 2. What is/are the insight(s) found from the chart?

**Patients who have history of hypertension are at higher risk of CHD.**

##### 3. Will the gained insights help creating a positive business impact?
Are there any insights that lead to negative growth? Justify with specific reason.

**We can use this information to create awareness among people so that they could be more cautious and make necessay changes in their lifstyle.
for example- A person with hypertension can prevent CHD by consuming less sodium, quit smoking, losing weight.etc**

## smoking frequency among male and female

In [None]:
#  visualization code
smoke_freq=data.groupby('sex')['cigsPerDay'].value_counts().unstack(0).plot.bar(figsize=(12,8))
plt.show()


##### 1. Why did you pick the specific chart?

**Used bar chart in order to show the comparison between male and female with respect to number of cigarettes they smoke per day.**

##### 2. What is/are the insight(s) found from the chart?

**Majority of non smokers are females but in some cases like 1,3,5,9,10 and 15 cigarettes per day, females are in lead.**

## Systolic and Diastolic Blood Pressure and Gender

In [None]:
# visualization code

#systolic BP
plt.figure(figsize=(10,6))
sns.violinplot(x='sex',y='sysBP',data=data)
plt.show()



In [None]:
#Diastolic BP
plt.figure(figsize=(10,6))
sns.violinplot(x='sex',y='diaBP',data=data)
plt.show

##### 1. Why did you pick the specific chart?

**Violin plot shows the spread and density of data distribution.It provides information about median, quartiles and outliers.The width of the plot at different points shows density of data at those values.**

##### 2. What is/are the insight(s) found from the chart?

**Females experience high BP in certain instances than males.**

##  Regression plot between target variable and numerical features

In [None]:

# Checking Linearity of all numerical features with our target variable

# figsize
plt.figure(figsize=(15,5))
# title
plt.suptitle('Bivariate Analysis of Numerical features', fontsize=20, fontweight='bold', y=1.02)

for i,col in enumerate(continuous):
  plt.subplot(2, 4, i+1)                     # subplots of 2 rows and 4 columns

  # regression plots
  sns.regplot(x=data[col], y='TenYearCHD', data=data)
  # x-axis lable
  plt.xlabel(col)
  plt.tight_layout()

Numerous Independent numerical variables are linked to our Target variable and have a positive relationship with our target variable.

## **Correlation Heatmap**

In [None]:
# Correlation Heatmap visualization code
plt.figure(figsize=(15,5))
sns.heatmap(data.corr(), annot=True)

The correlation coefficient is a numerical measure of the strength and direction of a linear relationship between two variables. In other words, it measures the extent to which changes in one variable are associated with changes in the other variable. The correlation coefficient ranges from -1 to 1, with -1 indicating a perfect negative correlation, 1 indicating a perfect positive correlation, and 0 indicating no correlation.

The correlation coefficient is an important tool in data analysis and machine learning, as it can help to identify relationships between variables and can be used in feature selection techniques to remove highly correlated features, which can reduce overfitting and improve the performance of the model.

## ***5. Hypothesis Testing***

### Hypothetical Statement - 1

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis-**There is no relationship between 'is_smoking' and 'TenYearCHD'

**Alternate hypothesis-**There is  relationship between 'is_smoking' and 'TenYearCHD'


#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
import statsmodels.api as sm

#defining null and alternate hypothesis
null="there is no relationship between is_smoking and TenYearCHD"
alt="there is relationship between is_smoking and TenYearCHD"

#performing linear regression
x=sm.add_constant(data['is_smoking'].apply(lambda x: 1 if x=='YES' else 0))
y=data['TenYearCHD']
model=sm.OLS(y,x).fit()

print(model.summary())

#extracting p-value
p_val=model.pvalues[1]
print("P-value",p_val)


##### Which statistical test have you done to obtain P-Value?

**I have used OLS (ordinary least square) function from statsmodels package**

**In this case , we have p-values as 0.0468 whic is less than the significance level of 0.05, so we can reject the null hypothesis. Therefore, based on the given data we have enough evidence to support the alternate hypothesis that "There is relation between is_smoking and TenYearCHD"**

### Hypothetical Statement - 2

#### 1. State Your research hypothesis as a null hypothesis and alternate hypothesis.

**Null hypothesis-**There is no relation between 'diabetes' and 'TenYearCHD'

**Alternate hypothesis-**There is relation between 'diabetes' and 'TenYearCHD'

#### 2. Perform an appropriate statistical test.

In [None]:
# Perform Statistical Test to obtain P-Value
#defining null and alternate hypothesis
null="there is no relationship between diabetes and TenYearCHD"
alt="there is relationship between diabetes and TenYearCHD"

#performing linear regression
x=sm.add_constant(data['diabetes'])
y=data['TenYearCHD']
model=sm.OLS(y,x).fit()

print(model.summary())

#extracting p-value
p_val=model.pvalues[1].round(5)
print("P-value",p_val)

##### Which statistical test have you done to obtain P-Value?

**I have used OLS (ordinary least square) function from statsmodels package.**

**In this case , we have p-values as 0.0 whic is less than the significance level of 0.05, so we can reject the null hypothesis. Therefore, based on the given data we have enough evidence to support the alternate hypothesis that "There is relation between diabetes and TenYearCHD"**

## ***6. Feature Engineering & Data Pre-processing***

Feature engineering is the process of creating new features from existing ones to improve the performance of a machine learning model. This involves transforming raw data into a more useful and informative form, by either creating new features from the existing data, or selecting only the most relevant features from the raw data.

The goal of feature engineering is to extract relevant information from the raw data and represent it in a way that can be easily understood by the machine learning model. The success of a machine learning model depends heavily on the quality of the features used as inputs, so feature engineering plays an important role in model performance.

### 1. Handling Missing Values

In [None]:
# Handling Missing Values & Missing Value Imputation

# NOTE-missing values has already been imputed

### 2. Handling Outliers

In [None]:
# Handling Outliers & Outlier treatments

# figsize
plt.figure(figsize=(15,5))
# boxplot of numerical features
sns.boxplot(data=data[continuous])
plt.show()

In [None]:
# we are going to replace the datapoints with upper and lower bound of all the outliers

def clip_outliers(data):
    for col in data[continuous]:
        # using IQR method to define range of upper and lower limit.
        q1 = data[col].quantile(0.25)
        q3 = data[col].quantile(0.75)
        iqr = q3 - q1
        lower_bound = q1 - 1.5 * iqr
        upper_bound = q3 + 1.5 * iqr

        # replacing the outliers with upper and lower bound
        data[col] = data[col].clip(lower_bound, upper_bound)
    return data

In [None]:
# using the function to treat outliers
data = clip_outliers(data)

In [None]:
# checking the boxplot after outlier treatment

# figsize
plt.figure(figsize=(15,5))
# boxplot of numerical features
sns.boxplot(data=data[continuous])
plt.show()

##### What all outlier treatment techniques have you used and why did you use those techniques?

Since we have limited datapoint hence we are not simply removing the outlier instead of that we are using the clipping method.

**Clipping Method:** In this method, we set a cap on our outliers data, which means that if a value is higher than or lower than a certain threshold, all values will be considered outliers. This method replaces values that fall outside of a specified range with either the minimum or maximum value within that range.

In [None]:
# checking for distribution after treating outliers.
for col in continuous:
  fig, ax =plt.subplots(1,2, figsize=(10,3))
  sns.distplot(data[col], ax=ax[0]).set(title="Distribution")
  sns.boxplot(data[col], ax=ax[1]).set(title="Outlier")
  plt.suptitle(f'{col.title()}',weight='bold')
  plt.tight_layout()

**We can also observe some shifts in the distribution of the data after treating outliers. Some of the data were skewed before handling outliers, but after doing so, the features almost follow the normal distribution. Therefore, we are not utilizing the numerical feature transformation technique.**

### 3. Categorical Encoding

**Except for the 'sex' and 'is_smoking' columns, almost all of the categories in the dataset are already represented numerically (ordinal). Therefore, we are encoding these two columns**

In [None]:
# Encode your categorical columns
data['sex'] = data['sex'].map({'M':1, 'F':0})
data['is_smoking'] =data['is_smoking'].map({'YES':1, 'NO':0})

### 4. Feature Manipulation & Selection

#### 1. Feature Manipulation

**The readings of blood pressure are shown in two numbers.**

*The highest number in your systolic blood pressure reading.

*The lowest number on your diastolic blood pressure readings.

Pulse Pressure is the difference of the top number (systolic) and the bottom number (diastolic). For instance, a pulse pressure of 40 is considered to be healthy if the resting blood pressure is 120/80 millimeters of mercury (mm Hg). A pulse pressure greater than 40 mm Hg is typically considered unhealthy.

The risk of a heart event, such as a heart attack or stroke, can be predicted by measuring pulse pressure. High pulse pressure, especially in older people, is considered a risk factor for cardiovascular disease.

In [None]:
# adding new column PulsePressure
data['pulse_pressure'] = data['sysBP'] - data['diaBP']

# dropping the sysBP and diaBP columns
data.drop(columns=['sysBP', 'diaBP'], inplace=True)

#### 2. Feature Selection

In [None]:
plt.figure(figsize=(15,5))
sns.heatmap(data.corr(), annot=True)

If a person smokes (is_smoking=='yes'), but the number of cigarettes smoked per day is 0, or cigsPerDay is 0. Then it may develop into a conflicting case, we must treat those records.

In [None]:
# checking data, weather the provide information is correct or not
data[(data.is_smoking == 'YES') & (data.cigsPerDay == 0)]

since the is_smoking and cigsPerDay columns do not contain any conflict cases. It is sufficient to provide information regarding is_smoking in the cigsPerDay column.

In [None]:
# droping is_smoking column due to multi-collinearity

data.drop('is_smoking', axis=1, inplace=True)

#Chi-square test

In feature selection, the chi-square test can be used to determine if a variable is related to target variable. If the p-value of the test is low, it indicates that there is a significant relationship between the two variables, and the variable can be selected as an important feature for the model

In [None]:
X = data.drop('TenYearCHD', axis=1)
y= data['TenYearCHD']

In [None]:
# importing libarary
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

# model fitting
ordered_rank_features = SelectKBest(score_func=chi2, k='all')
model = ordered_rank_features.fit(X,y)

# ranking feature based on importance
feature_imp = pd.Series(model.scores_,index=X.columns)
feature_imp.sort_values(ascending=False)

#Extra Trees Classifier

Extra_Tree_Classifier is a tree-based strategy that naturally ranks according to how well they decrease the Gini impurity (the purity of the node) across all trees. Notes with the least amount of impurity are found at the ends of trees, while nodes at the beginning of trees have the greatest decrease in impurity. As a result, we can select a subset of the most important features by pruning trees below a particular node.

In [None]:
# importing libarary
from sklearn.ensemble import ExtraTreesClassifier

# model fitting
model = ExtraTreesClassifier()
model.fit(X,y)

# ranking feature based on importance
ranked_features = pd.Series(model.feature_importances_,index=X.columns)
print(ranked_features.sort_values(ascending=False))

##Feature importance

In [None]:
# plotting graph ---> Feature Importance
fig, axs = plt.subplots(1,2, figsize=(15,4))

ranked_features.sort_values(ascending=False).plot(kind='barh', title='ExtraTreesClassifier', ax=axs[0])
feature_imp.sort_values(ascending=False).plot(kind='barh', title='Chi Square test', ax=axs[1])


plt.suptitle('Feature Importance', fontsize=20, fontweight='bold', y=1.1)
plt.tight_layout()

From these graphs, we can say that the two most important features are **'age'** and **'pulse_pressure'** to predict the target variable

As we discussed earlier, in the healthcare industry, every piece of data is crucial for analyzing or forecasting the target variable. The entries in this dataset are person-specific, the values vary between individuals, and all of the features are very important.

That is why I am using all features, except multi-collinear features, to train the model.

In [None]:
# plotting correlation heatmap to check multicollinearity.
plt.figure(figsize=(15,4))
sns.heatmap(data.drop(columns='TenYearCHD').corr(),annot=True)

there is no multicollinearity within independent variables ( assuming threshold to be 0.7)

##6.HANDLING IMBALANCED DATASET

##### Do you think the dataset is imbalanced? Explain Why.

YES, the target variables is imbalances as we saw while performing univariate analysis.

When there are significantly more instances of certain classes than others, the issue of class imbalance typically arises. Class imbalance in the target class is a problem for machine learning models because it can result in biased predictions. That is why we need to balance the target class.



In [None]:
# Handling Imbalanced Dataset (If needed)

## Handling target class imbalance using SMOTE
from collections import Counter
from imblearn.over_sampling import SMOTE

print(f'Before Handling Imbalanced class {Counter(y)}')

# Resampling the minority class
smote = SMOTE(random_state=42)
# fit predictor and target variable
X, y = smote.fit_resample(X, y)

print(f'After Handling Imbalanced class {Counter(y)}')

##### What technique did you use to handle the imbalance dataset and why? (If needed to be balanced)

**The data set differs significantly. Our data, therefore, lack balance. We will use the Synthetic Minority Oversampling Technique (SMOTE) to resolve this issue.**

SMOTE (Synthetic Minority Oversampling Technique) works by randomly selecting a minority class point and calculating its k-nearest neighbors. Between the selected point and its neighbors, the synthetic points are added. Continue with the steps until the data is balanced.

### 7. Data Splitting

In [None]:
# Split your data to train and test. Choose Splitting ratio wisely.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=33)

print(X_train.shape)
print(X_test.shape)

##### What data splitting ratio have you used and why?

**I have used train-test splitting ratio as 80-20 which means 80% of data will be used to train the model while 20% will be used to test the model performance.**

### 8. Data Scaling

In [None]:
# Scaling your data
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

## ***7. ML Model Implementation***

In [None]:
# Importing metrics for model evaluation.
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
from sklearn.metrics import classification_report,ConfusionMatrixDisplay

**Evaluation Metrics Used -**

**Accuracy:** Simply put, accuracy is the percentage of times that the classifier correctly predicts. The ratio of the total number of predictions to the number of correct predictions is how accuracy is defined. If a model has a 99 percent accuracy rate, you might think it is doing very well. However, this is not always the case and can be misleading in some situations.
When the target class is well-balanced, accuracy is useful, but it is not a good choice for unbalanced classes.

**Confusion Matrix:** The Confusion Matrix is a performance measurement for classification problems in machine learning in which there can be two or more classes output. It is a table with actual and predicted value combinations. The table that is frequently used to describe the performance of a classification model on a set of test data for which the true values are known is referred to as a confusion matrix. It is extremely helpful for determining the AUC-ROC curves, precision, recall, and accuracy.

**Precision:** Precision explains the percentage of correctly predicted cases that were actually successful. When False Positives are more of a concern than False Negatives, precision can be useful.

A label's precision is calculated by dividing the number of predicted positives by the number of true positives.

**Recall:** Recall describes the proportion of actual positive cases that our model correctly predicted. When False Negative is more important than False Positive, this metric is helpful. In medical cases, it matters whether we raise a false alarm or not, but the actual positive cases should not go unnoticed. The number of true positives divided by the total number of actual positives is the definition of recall for a label.

**F1 Score:** This score incorporates both Precision and Recall metrics. When Precision and Recall are equal, it reaches its peak.

The harmonic mean of recall and precision is the F1 Score.

**AUC-ROC:** The Receiver Operator Characteristic (ROC) is a probability curve that separates the "signal" from the "noise" by plotting the TPR (True Positive Rate) against the FPR (False Positive Rate) at various threshold values. The measure of a classifier's ability to differentiate between classes is the Area Under the Curve (AUC). This simply indicates that the classifier is able to precisely differentiate between all Positive and Negative class points when AUC is equal to 1. The classifier would correctly identify all negatives as positives when the AUC was zero.

In [None]:
# empty list for appending performance metric score
model_result = []

#defining function to train input model and print evaluation matrix

# empty list for appending performance metric score
model_result = []

def predict(ml_model, model_name):

  '''
  Pass the model and predict value.
  Function will calculate all the eveluation metrics and appending those metrics score on model_result table.
  Plotting confusion_matrix and roc_curve for test data.
  '''

  # model fitting
  model = ml_model.fit(X_train, y_train)

  # predicting value and probability
  y_train_pred = model.predict(X_train)
  y_test_pred = model.predict(X_test)
  y_train_prob = model.predict_proba(X_train)[:,1]
  y_test_prob = model.predict_proba(X_test)[:,1]


  ''' Performance Metrics '''
  # accuracy score  ---->  (TP+TN)/(TP+FP+TN+FN)
  train_accuracy = accuracy_score(y_train, y_train_pred)
  test_accuracy = accuracy_score(y_test, y_test_pred)
  print(f'train accuracy : {round(train_accuracy,3)}')
  print(f'test accuracy : {round(test_accuracy,3)}')

  # precision score  ---->  TP/(TP+FP)
  train_precision = precision_score(y_train, y_train_pred)
  test_precision = precision_score(y_test, y_test_pred)
  print(f'train precision : {round(train_precision,3)}')
  print(f'test precision : {round(test_precision,3)}')

  # recall score  ---->  TP/(TP+FN)
  train_recall = recall_score(y_train, y_train_pred)
  test_recall = recall_score(y_test, y_test_pred)
  print(f'train recall : {round(train_recall,3)}')
  print(f'test recall : {round(test_recall,3)}')

  # f1 score  ---->  Harmonic Mean of Precision and Recall
  train_f1 = f1_score(y_train, y_train_pred)
  test_f1 = f1_score(y_test, y_test_pred)
  print(f'train f1 : {round(train_f1,3)}')
  print(f'test f1 : {round(test_f1,3)}')

  # roc_auc score  ---->  It shows how well the model can differentiate between classes.
  train_roc_auc = roc_auc_score(y_train, y_train_prob)
  test_roc_auc = roc_auc_score(y_test, y_test_prob)
  print(f'train roc_auc : {round(train_roc_auc,3)}')
  print(f'test roc_auc : {round(test_roc_auc,3)}')
  print('-'*80)

  # classification report
  print(f'classification report for test data \n{classification_report(y_test, y_test_pred)}')
  print('-'*80)


  ''' plotting Confusion Matrix '''
  ConfusionMatrixDisplay.from_predictions(y_test, y_test_pred)
  plt.title('confusion matrix on Test data', weight='bold')
  plt.show()
  print('-'*80)

  ''' plotting ROC curve '''
  fpr, tpr, threshold = roc_curve(y_test, y_test_prob)
  plt.plot(fpr,tpr, label=f'ROC - {model_name}')
  plt.plot([0,1], [0,1], '--')
  plt.title('ROC curve on Test data', weight='bold')
  plt.xlabel('False Positive Rate----->')
  plt.ylabel('True Positive Rate----->')
  plt.legend(loc=4)


  ''' actual value vs predicted value on test data'''
  d = {'y_actual':y_test, 'y_predict':y_test_pred}
  print(pd.DataFrame(data=d).head(10).T)                   # constructing a dataframe with both actual and predicted values
  print('-'*80)

   # using the score from the performance metrics to create the final model_result.
  model_result.append({'model':model_name,
                       'train_accuracy':train_accuracy,
                       'test_accuracy':test_accuracy,
                       'train_precision':train_precision,
                       'test_precision':test_precision,
                       'train_recall':train_recall,
                       'test_recall':test_recall,
                       'train_f1':train_f1,
                       'test_f1':test_f1,
                       'train_roc_auc':train_roc_auc,
                       'test_roc_auc':test_roc_auc})



### **ML Model - 1**

#**Logistic Regression**

Logistic regression is one of the simplest algorithms for estimating the relationship between independent variables and a single dependent binary variable and determining the likelihood of an event occurring.

The regulation parameter C controls the trade-off between keeping the model simple (underfitting) and increasing its complexity (overfitting). With increasing values of C, the model becomes more complicated and the power of regulation decreases, resulting in an overfitting of the data.

In [None]:
# ML Model - 1 Implementation
from sklearn.linear_model import LogisticRegression

# Fit the Algorithm

# Predict on the model

predict(LogisticRegression(), 'LogisticRegression')


## **ML Model - 2**

#**SVM (Support Vector Machine)**

Classification is carried out by a Support Vector Machine (SVM) by locating the hyperplane with the greatest margin between the two classes. The support vectors are the vectors (cases) that define the hyperplane. Finding a hyperplane in an N-dimensional space that clearly classifies the data points is the goal of the SVM algorithm.

The number of features determines the hyperplane's dimension. The hyperplane is just a line if there are two input features. The hyperplane transforms into a two-dimensional plane when there are three input features. When the number of features is greater than three, it becomes difficult to imagine.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
import sklearn.svm
predict(sklearn.svm.SVC(probability=True), 'SVM')

### **ML Model - 3**

##**(KNN) K-Narest Neighbours**

A supervised machine learning algorithm known as KNN or K-nearest neighbor can be used to solve classification and regression problems. K is not a non-parametric nearest neighbor, i.e. It makes no assumptions regarding the assumptions that underlie the data. An input or unseen data set is categorized here by the algorithm based on the characteristics shared by the closest data points. The distance between two points determines these closest neighbors. The distance metric methods that are utilized can be Euclidean Distance, Manhattan Distance, Minkowski, Cosine Similarity Measure etc)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
# Checking the optimum value of the k:
accuracy=[]

# Iteratig for the optimum values of k
for i in range(1,15):
  knn=KNeighborsClassifier(n_neighbors=i)
  knn.fit(X_train,y_train)
  accuracy.append(knn.score(X_test, y_test))

#plotting the k-value vs accuracy
plt.title('k-NN Varying number of neighbors')
plt.plot(range(1,15), accuracy)
plt.xlabel('number of neighbours')
plt.ylabel('Accuracy')
plt.show()

The best accuracy is at K=1. So we will concentrate on low values of k; k=3 is superior to k=2. For binary classification, k is typically an odd number (to prevent ties) of at least three.

#### 1. Explain the ML Model used and it's performance using Evaluation metric Score Chart.

In [None]:
# Visualizing evaluation Metric Score chart
predict(KNeighborsClassifier(n_neighbors=1), 'KNN')

#**Model Result**

In [None]:
model_result = pd.DataFrame(model_result)
round(model_result,3)

 Accuracy scores aren't very helpful when dealing with imbalanced data or classes.

It is acceptable to classify a healthy individual as having a 10-year risk of coronary heart disease CHD (false positive) and to conduct additional medical tests; however, it is categorically unacceptable to fail to identify a particular patient or to classify a particular patient as healthy (false negative). As a result, the **model's recall score** will be the primary focus of our project.

Balanced accuracy is a better metric than F1 when positives and negatives are equally important. When more attention is required, F1 is an excellent scoring metric for imbalanced data.

In [None]:
# plotting graph to compare model performance of all the models
fig, axs = plt.subplots(1,2, figsize=(15,5))
sns.barplot(x=model_result['model'], y=model_result['test_recall'], ax=axs[0])   # Model vs Recall score
sns.barplot(x=model_result['model'], y=model_result['test_f1'], ax=axs[1])       # Model vs F1 score
plt.tight_layout()

clearly KNN is best performing model based on metrics. We didn't want to mispredict a person's safety when he has the risk of 10 years of CHD, so the final model we chose was KNN.

# **Conclusion**

In this project, we tackled a classification problem in which we had to classify and predict the 10-year risk of future coronary heart disease (CHD) for patients. The goal of the project was to develop a tool for the early detection and prevention of CHD, addressing a significant public health concern using machine learning techniques.

1. There were approximately 3390 records and 16 attributes in the dataset.

2. We started by importing the dataset, and necessary libraries and conducted exploratory data analysis (EDA) to get a clear insight into each feature by separating the dataset into numeric and categoric features. We did Univariate, Bivariate, and even multivariate analyses.

3. After that, the outliers and null values were removed from the raw data and treated. Data were transformed to ensure that it was compatible with machine learning models.

4. In feature engineering we transformed raw data into a more useful and informative form, by creating new features, encoding, and understanding important features. We handled target class imbalance using SMOTE.

5. Then finally cleaned and scaled data was sent to various models, the metrics were made to evaluate the model, and we tuned the hyperparameters to make sure the right parameters were being passed to the model. To select the final model based on requirements, we checked model_result.

6. When developing a machine learning model, it is generally recommended to track multiple metrics because each one highlights distinct aspects of model performance. We are, however, focusing more on the Recall score and F1 score because we are dealing with healthcare data and our data is unbalanced.

7. Our highest recall score, 0.951 (95.1%), came from KNN.


The recall score is of the utmost significance in the medical field, where we place a greater emphasis on reducing false negative values because we do not want to mispredict a person's safety when he is at risk.



### ***Hurrah! You have successfully completed your Machine Learning Capstone Project !!!***