# Model for Cardiovascular Disease Prediction

In this project, we will analyze [Cardiovascular Disease dataset](https://www.kaggle.com/sulianova/cardiovascular-disease-dataset) to find which variables are related to the disease. Then we will use different machine learning models to predict whether the patient has  cardiovascular disease or not. <br>The dataset contains information about patients doing cardiovascular disease examination.<br><br>
**Data features:**
   - Age | Objective Feature | age | int (days)<br>
   - Height | Objective Feature | height | int (cm) |<br>
   - Weight | Objective Feature | weight | float (kg) |<br>
   - Gender | Objective Feature | gender | categorical code |<br>
   - Systolic blood pressure | Examination Feature | ap_hi | int |<br>
   - Diastolic blood pressure | Examination Feature | ap_lo | int |<br>
   - Cholesterol | Examination Feature | cholesterol | 1: normal, 2: above normal, 3: well above normal |<br>
   - Glucose | Examination Feature | gluc | 1: normal, 2: above normal, 3: well above normal |<br>
   - Smoking | Subjective Feature | smoke | binary |<br>
   - Alcohol intake | Subjective Feature | alco | binary |<br>
   - Physical activity | Subjective Feature | active | binary |<br>
   - Presence or absence of cardiovascular disease | Target Variable | cardio | binary |<br>


# 1. Data preprocessing and cleaning

In this stage, we're working to acheive a clean dataset by removing duplicates and extract important variables that we need such as patients age, gender, height, etc.<br>
Down here we're importing the libraries that we'll use such as pandas, which is a famous data analysis python library and other utility libraries.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

Here, we're reading the data file which is saved as csv (Comma Seperated Values) and we're also telling pandas that our values are seperated by a semicolon and that the first row is the data heading.<br>
The head method in the second line lists the first 5 rows of the file excluding the data heading.

In [None]:
df = pd.read_csv('/kaggle/input/cardiovascular-disease-dataset/cardio_train.csv', sep=";")
df.head()

We need to know the number of rows and columns we're dealing with in data so we're achieving this by using the shape method.

In [None]:
print(f"Number of columns: { df.shape[1] }")

In [None]:
print(f"Number of rows: { df.shape[0] }")

Checking if there's NULL value in any of the cells.

In [None]:
df.info()

There are no missing values

In our dataset, the patients age is written in days, so we're converting it to years and rounding it to the nearest 2 decimals. Also we're replacing the gender column with another two-columns, one for male and the other is for female. If the patients' gender is male then a value of 1 will be inside the male column and zero inside the female column and vice-versa.

In [None]:
df['age'] = round(df['age']/365.25,2)

In [None]:
df.insert(3, "female", (df['gender']==1).astype(int))
df.insert(4, 'male', (df['gender']==2).astype(int))
df.drop(['gender', 'id'], axis=1, inplace=True)

Checking if there ary any duplicates in our rows and printing the duplicated rows count.

In [None]:
df.duplicated().sum()

Dropping all duplicated rows.

In [None]:
df.drop_duplicates(inplace=True)

Down here, we're calculating the patient BMI (Body Mass Index) using the formula which is: $\frac{Weight}{Height^{2}}$<br>
In our dataset, the height of patients were in centimeters so we divided it by 100 to convert it into meters.

In [None]:
df.insert(5, 'bmi', round((df['weight']/(df['height']/100)**2), 2))

The describe method used down here gives us a ready-to-go primary statistics calculations such as the mean average, standard deviation (STD) and the five quartiles.

In [None]:
df.describe()

Concluding this data, we've found the following:
- The **mean age** for patients is 53.
- The **percentage of males** is 35.
- The **percentage of females** is 65.
- The **percentage of smokers** is 8.
- The **percentage of alcoholists** is 5.
- The **percentage of patients who do sports** is 80.

It seems there are mony outliers in body mass index, may be it's a mistake. So, lets drop outliers.

In [None]:
df.drop(df.query('bmi >60 or bmi <15').index, axis=0, inplace=True)

<p>Here we need to categorize the blood pressure stages according to the systolic and diastolic pressure.</p>
<img src="https://img.webmd.com/dtmcms/live/webmd/consumer_assets/site_images/article_thumbnails/other/blood_pressure_charts/basic_blood_pressure_chart.png" width="50%">

Here, we created a function that adds a column called bp_cat (Blood Pressure Category). This function scans two columns of each row which are the ap_hi and ap_lo then based on the values of these columns it categorizes the patients blood pressure as Normal, Elevated, High Blood Pressure Stage 1, High Blood Pressure Stage 2 or Hypertensive Crisis.

In [None]:
def BPCategorize(x,y):
    if x<=120 and y<=80:
        return 'normal'
    elif x<=129 and y<=80:
        return 'elevated'
    elif x<=139 or y<=89:
        return 'high 1'
    elif x<=180 or y<=120:
        return "high 2"
    elif x>180 or y>120:
        return 'high 3'
    else:
        return None
    
df.insert(8, "bp_cat", df.apply(lambda row: BPCategorize(row['ap_hi'], row['ap_lo']), axis=1))
df['bp_cat'].value_counts()

We can also drop outliers from blood pressure variables

In [None]:
df.drop(df.query('ap_hi >220 or ap_lo >180 or ap_hi<40 or ap_lo<40').index, axis=0, inplace=True)

**Finally, we've finished cleaning and sorting our dataset according to our needs.**

In [None]:
df.head()

# 2. Data analysis

In this stage, we're using pyplot and seaborn library to analyse our data through visualization such as pie-chart, bar-chart and boxplot instead of using just plain numbers and tables.

Here, we're defining a variable of standard and unique color for the visualization to avoid distraction to the reader.

In [None]:
base_color = sb.color_palette()[0]

We wanted to show the percentage of males and females using a pie-chart but we removed that column earlier in order to achieve flexibility for Machine Learning models, so, we had to create a workaround by creating a function to merge the gender columns in our dataset (male,female) into one column called gender.

In [None]:
def gender(x, y):
    if x==1:
        return 'female'
    else:
        return 'male'

Here, we're using the libraries mentioned before in creating the visualizations we want.

In [None]:
fig, ax = plt.subplots(1,2, figsize=(14,20))
plt.tight_layout(pad=10)
ax[0].pie(x=df['cardio'].value_counts(), labels=['Cardio', 'No cardio'],autopct='%1.1f%%', shadow=True, startangle=90, explode=(0.05,0.0))
ax[0].title.set_text('Cardio percentage')
gender = df.query("cardio == 1").apply(lambda row: gender(row['female'], row['male']), axis=1).value_counts()
ax[1].pie(x=gender, labels=['Female', 'Male'],autopct='%1.1f%%', shadow=True, startangle=90, explode=(0.05,0.0))
ax[1].title.set_text('Cardiovascular patients gender percentage')
;

**Concluding the charts, we've found the following:**
- The percentage of people with cardiovascular diseases is 50%.
- The percentage of males with cardiovascular diseases is 35.3%.
- The percentage of females with cardiovascular diseases is 64.7%.

<hr>

**Here we're making boxplots to compare the age and body mass index for the cardio and non-cardio patients.**

In [None]:
fig, ax = plt.subplots(ncols=2, figsize=(20,10))
plt.tight_layout(pad=18)
sb.boxplot(data=df, x='cardio', y='age', ax=ax[0], color=base_color)
sb.boxplot(data=df, x='cardio', y='bmi', showfliers=False, ax=ax[1], color=base_color)
ax[0].title.set_text('Age')
ax[0].set_xticklabels(['No-cardio', 'Cardio'])
ax[1].title.set_text('body mass index')
ax[1].set_xticklabels(['No-cardio', 'Cardio'])
ax[0].set_xlabel("")
ax[1].set_xlabel("")
;

**Concluding the following, we've found the following:**
- A relation is found between the age of people and cardiovascular diseases, thus, elderly people are most likely to have this kind of diseases.
- Another relation is found between the BMI and cardiovascular diseases, thus, people with higher BMI are also most likely to to have this kind of diseases.

<hr>

In the following code, we're trying to find other general relations between our variables in the dataset.

In [None]:
fig, ax = plt.subplots(ncols=3, nrows=2, figsize=(20,13), sharey=True)
plt.tight_layout(pad=3)


df_gluc = df.groupby('gluc').mean()
sb.barplot(data=df_gluc, x=df_gluc.index, y='cardio', ax=ax[0][0], color=base_color)
ax[0][0].set_xticklabels(['normal', 'above normal', 'well above normal'])
ax[0][0].set_yticks(np.arange(0, 1.2, 0.1))
ax[0][0].set_yticklabels(np.arange(0, 120, 10))

df_cholesterol = df.groupby('cholesterol').mean()
sb.barplot(data=df_cholesterol, x=df_cholesterol.index, y='cardio', ax=ax[0][1], color=base_color)
ax[0][1].set_xticklabels(['normal', 'above normal', 'well above normal'])

df_smoke = df.groupby('smoke').mean()
sb.barplot(data=df_smoke, x=df_smoke.index, y='cardio', ax=ax[0][2], color=base_color)

df_alco = df.groupby('alco').mean()
sb.barplot(data=df_alco, x=df_alco.index, y='cardio', ax=ax[1][0], color=base_color)

df_active = df.groupby('active').mean()
sb.barplot(data=df_active, x=df_active.index, y='cardio', ax=ax[1][1], color=base_color)

df_bp = df.groupby('bp_cat').mean()
sb.barplot(data=df_bp, x=df_bp.index, y='cardio', ax=ax[1][2], color=base_color,
           order=['normal', 'elevated', 'high 1', 'high 2', 'high 3'])
plt.setp(ax[:, :], ylabel='')
plt.setp(ax[:, 0], ylabel='Cardio Pecentage')
;

**Concluding the following, we've found the following:**
- A relation is found between the glucose levels and cardiovascular disease, thus, 60% of people who have well above normal levels of glucose are more likely to have a cardiovascular disease.
- Another significant relation is found between the cholestrol levels and cardiovascular diseases, thus, higher cholestrol levels means more chance of encountering a cardiovascular disease.
- We've found that there is no strong bonded relationship between smoking and alcohol compared to cardiovascular diseases.
- A minor relationship is found between being doing sports and activities to cardiovascular diseases as inactive people might develop cardiovascular diseases.
- A major direct relationship is found between blood pressure and cardiovascular diseases where people who develop high blood pressure levels have the highest chances of having cardiovascular diseases.

# 3. Probability and Statistics 

**Probability that a person has cardio diseases given that he is 50 or older**

In [None]:
df_age_50 = df.query('age >=50')
df_agy_50_cardio = df_age_50.query('cardio==1')
round(df_agy_50_cardio.shape[0]*100/df_age_50.shape[0],2)

**Probability that a person has cardio diseases given that he has body mass index greater than 37**

In [None]:
df_bmi37 = df.query('bmi >=37')
df_bmi37_cardio = df_bmi37.query('cardio ==1')
round(df_bmi37_cardio.shape[0]*100/df_bmi37.shape[0],2)

**Probability that a person has cardio diseases given that the patient has a hypertensive crisis**

In [None]:
df_high3 = df.query("bp_cat == 'high 3'")
df_high_cardio = df_high3.query('cardio == 1')
round(df_high_cardio.shape[0]*100/df_high3.shape[0],2)

**Probability that a person drinks alcohol or smokes**

In [None]:
df_cohol_smoke = df.query("alco==1 or smoke==1")
print(df_cohol_smoke.shape[0]*100/df.shape[0])

**Probability that a person has cardio diseases given that the patient drinks alcohol or smokes**

In [None]:
df_cohol_smoke_cadrio = df_cohol_smoke.query('cardio==1')
df_cohol_smoke_cadrio.shape[0]*100/df_cohol_smoke.shape[0]

**Probability that a person has cardio diseases given that the patient is not active**

In [None]:
df_not_active = df.query('active==0')
df_not_active_cardio = df_not_active.query('cardio==1')
df_not_active_cardio.shape[0]*100/df_not_active.shape[0]

# 4. Predicting using Machine Learning

In this stage, we're using Machine Learning (ML) to predict the existence of cardiovascular diseases in patients according to our dataset. As known, there are various Machine Learning (ML) algorithms that are widely used, hence, we're using multiple algorithms and comparing them to each other according to their results.

In the cell below, we're importing libraries that will enable us to use Machine Learning algorithms.<br>Mainly, these are the algorithms that we are using:
- Random Forest Classifier
- Support Vector Classifier
- K Neighbors Classifier
- X Gradient Boost Classifier

We're also using metrics libraries to evaluate our predictions. As for prediction evaluation, we're using the following libraries:
- Accuracy Score
- Confusion Matrix

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LinearRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV, cross_val_predict, cross_validate
from sklearn.metrics import accuracy_score, r2_score, confusion_matrix, plot_confusion_matrix, plot_roc_curve
from xgboost import plot_importance
import warnings
warnings.filterwarnings('ignore')

In [None]:
df.head()

Here, we're dropping the cardio column in our dataset as this will be our target prediction variable.<br>Since Machine Learning understands numerical values only, we're also dropping the blood pressure category column as it's datatype is string.

In [None]:
X = df.drop(['cardio', 'bp_cat'], axis=1)
y = df['cardio']

We need to split our data in two groups, one is used for training our model and the other is for testing and evaluating.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

In [None]:
#check for linearity
m9 = LinearRegression().fit(X, y)
r2_score(m9.predict(X), y)

<hr>

#### 4.1 Random Forest Model

Random Forest Classifier consists of multiple decision trees where each tree in the random forest provides a prediction. The class prediction with the highest votes becomes the right prediction.

<img src="https://upload.wikimedia.org/wikipedia/commons/7/76/Random_forest_diagram_complete.png" width="30%">

At this piece of code, we're using our trained model to predict new values.

As we can see in the last output, model accuracy level reached 71.1% where the number of type-one error equals 2642.

<hr>

In [None]:
# rrfp = {'bootstrap': [True],
#  'max_depth': [10],
#  'max_features': ['sqrt'],
#  'min_samples_leaf': [1],
#  'min_samples_split': [2],
#  'n_estimators': [55,51]}

In [None]:
# rrfm = RandomizedSearchCV(RandomForestClassifier(),
#                               param_distributions = rrfp,
#                               n_iter = 100,
#                               cv = 5, verbose=0,
#                               random_state=0,
#                               n_jobs = -1)

In [None]:
# rrfm.fit(X_train, y_train)
# ;

In [None]:
# rrfm.best_params_

Here, we're creating a model through our classifier and training it.

In [None]:
df.head()

In [None]:
random_model = RandomForestClassifier(n_estimators=51,
                          max_depth=10,
                          random_state=0)

random_model.fit(X_train, y_train)
print(f"Testing accuracy: {round(accuracy_score(random_model.predict(X_test), y_test),4)*100}%")
print(f"Average testing accuracy: {round(cross_validate(random_model, X, y, cv=5)['test_score'].mean()*100,2)}%")

In [None]:
plot_confusion_matrix(random_model, X_test, y_test, values_format='d')
;

In [None]:
plot_roc_curve(random_model, X_test, y_test)

#### 4.2 Support Vector Model

In SVC, each data item is plotted in n-dimensional space where n is the number of rows we have in our dataset with the value of the each column in a single row being the value of a particular coordinate in the dimensional space, then, a classification is done by finding the hyper-plane that differentiates the two classes.
<br><br><img width="20%" src="https://miro.medium.com/max/1088/1*6U9NrruycDBsPOyivpn8UQ.png">

Here, we're creating a model and training it.

In [None]:
# svc_param_grid = {'C': [100,150],  
#               'gamma': [0.00001, 0.000001], 
#               'kernel': ['rbf']}  
  
# grid = GridSearchCV(SVC(), svc_param_grid, refit = True, verbose = 0) 
# # fitting the model for grid search 
# grid.fit(X_train, y_train) 
# ;

In [None]:
# grid.best_params_

In [None]:
# svc_model = SVC(C=100, gamma=0.00001, kernel="rbf", random_state=42)
# svc_cv = cross_validate(svc_model, X, y, cv=5)
# svc_cv

Below, we're using our trained model in order to provide a prediction for new values.

In [None]:
# svc_cv['test_score'].mean()

In [None]:
# svc_model.fit(X_train, y_train)
# csv_pred = svc_model.predict(X_train)

In this algorithm, the precision is higher but the type-one error value is higher than the Random Classifier Algorithm that we used before, therefore the overall performance for this model is considered irreliable.

<hr>

#### 4.3 K Neighbors Model

In the K Neighbors Classifier, the algorithm assumes that similar things exists within the same proximity or near each other. The algorithm calculates the distance between the new value and existing values then find the k-nearest neighbors then votes on the predictions.<br>
Ex. in the following picture if we take our K value = 3, then our new class prediction will be predicted as class B, but, if we take K value = 7 then our new class prediction will be predicted as class A.
<br><br><img WIDTH="20%" src="https://res.cloudinary.com/dyd911kmh/image/upload/f_auto,q_auto:best/v1531424125/KNN_final_a1mrv9.png">

Here, we're also training our model as usual.<br>
**N.B.** By experimenting number of neighbors from 5 to 300 through a for loop, we've found that 200 is the most accurate value.<br>

In [None]:
# kparams = {'n_neighbors':[5,10],
#           'leaf_size':[1,5],
#           'weights':['uniform', 'distance'],
#           'algorithm':['auto']}

# kparams = {'n_neighbors':[300],
#           'leaf_size':[1],
#           'weights':['uniform'],
#           'algorithm':['ball_tree']}

In [None]:
# krsv = RandomizedSearchCV(KNeighborsClassifier(),
#                               param_distributions = kparams,
#                               n_iter = 100,
#                               cv = 5, verbose=3,
#                               random_state=0,
#                               n_jobs = 7)

In [None]:
# krsv.fit(X_train, y_train)

In [None]:
# krsv.best_params_

In [None]:
k_model = KNeighborsClassifier(weights = 'uniform',
                               n_neighbors = 300,
                               leaf_size = 1,
                               algorithm = 'ball_tree')
k_model.fit(X_train, y_train)

In [None]:
cross_validate(k_model, X, y, cv=5)['test_score'].mean()

After training our model, we need to predict new values.

In [None]:
k_pred = k_model.predict(X_test)
print(f"score: {round((accuracy_score(k_pred, y_test)*100),2)}%")
plot_confusion_matrix(k_model, X_test, y_test, values_format='d')

In [None]:
plot_roc_curve(k_model, X_test, y_test)

As shown in the results, the precision in this algorithm is higher than SVC, however, type-one error is still higher than Random Forest Classifier algorithm.

<hr>

#### 4.4 X Gradient Boost

Just like Random Forest Classifier, the XGB classifier uses a decision-tree algorithm but also with a gradient boosting framework. Rather than training all of the models in isolation of one another, boosting trains models in succession, with each new model being trained to correct the errors made by the previous ones.<br><br><img width="20%" src="https://miro.medium.com/max/461/1*A9myadIB_CqJv-EJA-G_bA.png">

As usual, we're training our model.

In [None]:
# ROUND 1
# param_grid = {
#     'max_depth': [3, 4, 5],
#     'learning_rate': [0.1, 0.01, 0.05],
#     'gamma': [0, 0.25, 0.1],
#     'reg_lambda': [0, 1.0, 10.0],
#     'scale_pos_weight': [1, 3, 5]
# }

# ROUND 2
# param_grid = {
#     'max_depth': [3],
#     'learning_rate': [0.6, 0.5, 0.7],
#     'gamma': [0.25],
#     'reg_lambda': [50.0, 100, 150],
#     'scale_pos_weight': [3]
# }

# ROUND 3
# param_grid = {
#     'max_depth': [3],
#     'learning_rate': [0.6, 0.5, 0.7],
#     'gamma': [0.25],
#     'reg_lambda': [50.0, 100, 150],
#     'scale_pos_weight': [3]
# }

# ROUND 4  
# param_grid= = {
#     'max_depth': [3],
#     'learning_rate': [0.6, 0.65, 0.55],
#     'gamma': [0.25],
#     'reg_lambda': [40.0, 50.0, 60.0],
#     'scale_pos_weight': [3]
# }

## Winner Winner Chicken Dinner!!!!!
# param_grid={'gamma': [0.24],
#  'learning_rate': [.13],
#  'max_depth': [5],
#  'reg_lambda': [50],
#     'n_estimators': [150]}

In [None]:
# optimal_params = GridSearchCV(
#     estimator=XGBClassifier(objective="binary:logistic",
#                             seed=0,
#                             subsample=0.9),
#     param_grid=param_grid,
#     scoring='roc_auc',
#     verbose=1,
#     n_jobs=7,
#     cv=5
# )

In [None]:
# optimal_params.fit(
#                 X_train, 
#                 y_train, 
#                 verbose=False,
#                 early_stopping_rounds=10,
#                 eval_metric='aucpr',
#                 eval_set=[(X_test, y_test)])

In [None]:
# optimal_params.best_params_

In [None]:
boost_model = XGBClassifier(verbosity=0, seed=0, n_estimators=150,
                            gamma= 0.24, max_depth=4, learning_rate=0.13,
                            reg_lambda=50.0, scale_pos_weight=1)

boost_model.fit(X_train, y_train)
boost_pred = boost_model.predict(X_test)
print(f"Testing accuracy: {round((accuracy_score(boost_pred, y_test)*100),2)}%")
xgb_cross = cross_validate(boost_model, X, y, cv=11)
print(f"Average testing accuracy: {round((xgb_cross['test_score'].mean()*100),4)}%")

Again, we're predicting our new values.

In [None]:
plot_confusion_matrix(boost_model, X_test, y_test, values_format='d')
;

In [None]:
plot_roc_curve(boost_model, X_test, y_test)

As seen in the output, the accuracy score is the highest of all models used before, however, type-one error equals 2728.

In [None]:
plot_importance(boost_model)

## Conclusion

To conclude all the previous models that we've used before, a significant variance was found in accuracy score and type-one error. However, accuracy isn't much important compared to type-one error because it would be dangerous if our model classified a cardio patient as a non-cardio patient, so, we must rely on the model with the lowest type-one error value, so, the **XGBoost Classifier is considered as a winner** in such a critical field like health. 