# CSMODEL MCO1

### Group 6:
### Nieva, Samuel
### Opalla, Rijan
### Quinzon, Christopher


## I. Dataset Description

The dataset acquired covers the possible key indicators of heart disease among US citizens. The dataset is a public online resource available on Kaggle.com and was acquired from the Center for Disease Control and Prevention (CDC), a government organization of the United States taking part in the Behavioral Risk Factor Surveillance System (BRFSS). The data was acquired observationally as there was no form of treatment applied to the respondents. The collection was done via an annual telephone survey in all 50 states plus the District of Columbia and three more U.S. territories. Moreso, the dataset contains 319,795 rows and 18 columns with each row representing an observation of a person and the columns representing the responses of the individual to different questions about their health status. The answers of the respondents are valued as 9 booleans, 5 strings, and 4 decimals. No other external files are combined with the single file for this machine project. 

#### Heart Disease

The variable describes this as whether or not: "Respondents that have ever reported having coronary heart disease (CHD) or myocardial infarction (MI)".

It has two unique values: `Yes` and `No`.

#### BMI

The variable describes this as the respondents "BMI". BMI or Body Mass Index is a numerical value derived from a person's height and weight and is used to measure a person's body fat.

Its value is the numerical value of the respondent's BMI.

#### Smoking

The variable describes this as whether or not: "Respondents have smoked at least 100 cigarettes in their lives". The dataset notes that 5 packs of cigarettes is equivalent to 100 cigarettes.

It has two unique values: `Yes` and `No`.

#### Diabetic

The variable describes this as whether or not the respondents have diabetes.

It has four unique values: `Yes`, `No`, `Yes (during pregnancy)`, and `No, borderline diabetes`, which means that the respondent has Type 2 or Prediabetes.

## II. Data Cleaning
The following code blocks are for importing the necessary libraries as well as the dataset needed.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import chi2_contingency

In [None]:
heart_df = pd.read_csv("heart_2020_cleaned.csv")

The next segment displays general information within the dataset and the datatypes of the various features.

In [None]:
heart_df.info()

In [None]:
heart_df.head()

The dataset has already been cleaned beforehand as stated by the dataset's authors, however, the procedures for data cleaning discussed in class will still be performed to ensure that the data is clean and all entries are valid in terms of their content and datatype.

### Checking for missing values

In [None]:
heart_df.isnull().any()

### Checking for inconsistencies and multiple representations

In [None]:
print("No. of unique values per column:\n",heart_df.nunique())

Checking for all the unique values for each variable used.

### Heart Disease

In [None]:
heart_df['HeartDisease'].unique()

### BMI

In [None]:
heart_df['BMI'].unique()

### Smoking

In [None]:
heart_df['Smoking'].unique()

### Diabetic

In [None]:
heart_df['Diabetic'].unique()

Based on the results above, the data does not contain any missing values. Also, all features contain valid entries and datatypes. 

## III. Exploratory Data Analysis

The group has identified several questions which would be answered through exploratory data analysis. The following code segments show the libraries to be imported.

In [None]:
import matplotlib.pyplot as plt
from scipy import stats
# sets the theme of the charts
# plt.style.use('seaborn-darkgrid')

# %matplotlib inline

%matplotlib inline

Since some columns contain boolean values, we must first apply one-hot encoding to aid in the analysis. In the case of the unique values of the `diabetic` variable, we have combined the value of `Yes (during pregnancy)` to the value `Yes` and `No, borderline diabetes` to the value `No`. This study will only focus on type 1 diabetes, therefore the term "borderline diabetes" is considered as prediabetes/type 2 diabetes.  

In [None]:
heart_df = heart_df.replace({'Yes': 1, 'No':0, 'No, borderline diabetes': 0, 'Yes (during pregnancy)': 1})
heart_df


### Is there a relationship between heart disease and BMI?

Body Mass Index (BMI) refers to the measure of body fat based on an individual's weight and height. It has four different categories: underweight, normal weight, overweight, and obesity. People with BMIs that are <18.5 are considered underweight, while an index between 18.5-24.9 fall under normal weight. On the other hand, values from 25-29.9 would fall under overweight. Lastly, a BMI of 30 or greater would fall under obesity.

Given this, we would like to find out if the dataset shows a relationship between heart disease and an individual's BMI.

To answer this question, we must group the entire dataset accordingly.

In [None]:
hd_bmi = heart_df[['HeartDisease', 'BMI']]
hd_bmi

The correlation between the two is calculated as follows:

In [None]:
stats.pointbiserialr(hd_bmi['HeartDisease'],hd_bmi['BMI'])

Since the variables `HeartDisease` and `BMI` are binary and continuous variables respectively, the point biserial correlation was used. Similar to other correlation coefficients, the point biserial correlation coefficient result varies between -1 and +1, with a value of 0 implying no correlation. Moreover, the resulting coefficient is mathematically equivalent to the Pearson's correlation coefficient. This can be seen in the following code segment.

In [None]:
hd_bmi.corr()

The calculated correlation between the two variables is **0.0518**. From the result, the two variables are positively correlated. However, the r-value suggests a very weak relationship between the two variables. 

To visualize the data, the following histogram plots the frequency of numerical data across the two variables:

In [None]:
pd.DataFrame({'Without Heart Disease': hd_bmi.groupby('HeartDisease').get_group(0).BMI,
              'With Heart Disease':   hd_bmi.groupby('HeartDisease').get_group(1).BMI}).plot.hist(bins=50, stacked=True)
plt.title('BMI')
plt.xlabel('BMI')

Based on the visualization above, it can be observed that the distribution is positively skewed. The histogram also shows that people with heart disease have more observations than those without heart disease, especially around the higher BMI ranges. 

### Is there a relationship between Diabetes and Heart Disease?

According to the World Health Organization (WHO), diabetes is classified as a chronic and metabolic diesease caused by increase of blood sugar level which may damage the heart, blood vessels, eyes, kidneys, and nerves over time. <br>

Our variable of interest for this EDA question will be `HeartDisease` and `Diabetic`, therefore we will acquire the columns of the two in the dataframe and group them by the presence of diabetes:

In [None]:
hd_diabetic = heart_df.groupby('Diabetic')['HeartDisease'].value_counts()
diabetic_df = pd.DataFrame([hd_diabetic[0], hd_diabetic[1]], index=['Not diabetic', 'Diabetic']).transpose()
diabetic_df


We visualize the relationship of the 2 variables via bar graph:

In [None]:
diabetic_df.plot.bar()

plt.xlabel('Heart Disease')
plt.ylabel('Count')


In [None]:
heart_df['Diabetic'].value_counts()

Since there is a great difference between the number of diabetic and non-diabetic respondents, we analyze the relationship by observing the proportionality of respondents with no diabetes but with heart disease and those that have diabetes and heart disease:

In [None]:
non_diabetic_df = heart_df[heart_df['Diabetic'] == 0]
no_diabetes_no_hd = non_diabetic_df[non_diabetic_df ['HeartDisease']==0].shape[0]
no_diabetes_with_hd = non_diabetic_df[non_diabetic_df ['HeartDisease']==1].shape[0]

with_diabetic_df = heart_df[heart_df['Diabetic'] == 1]
with_diabetes_no_hd = with_diabetic_df[with_diabetic_df ['HeartDisease']==0].shape[0]
with_diabetes_with_hd = with_diabetic_df[with_diabetic_df ['HeartDisease']==1].shape[0]

In [None]:
print('No diabetes and no Heart Disease:', no_diabetes_no_hd)
print('No diabetes but with Heart Disease:', no_diabetes_with_hd, '\n')

print('Diabetic with no Heart Disease', with_diabetes_no_hd)
print('Diabetic with Heart Disease', with_diabetes_with_hd, '\n')

print('Percentage of no diabetes but with Heart Disease: {:.2f}%'.format(no_diabetes_with_hd/non_diabetic_df.shape[0] * 100))
print('Percentage of diabetic with Heart Disease: {:.2f}%'.format(with_diabetes_with_hd/with_diabetic_df.shape[0] * 100))

based from the computed proportion, respondents who are diabetic are more prone to having a heart disease as compared to those without diabetes. To further establish the correlation between diabetes and heart disease, we will use the Chi-Square test as means to inference the categorical data.

In [None]:
chi2_contingency(diabetic_df, correction=True)

The result of the chi2_contingency() shows that the p-value is less than the significance levels of 0.01, 0.05 and 0.10. Therefore, there is a significant correlation between diabetes and heart disease.

### Is there a relationship between Smoking and Heart Disease?

In this Dataset, the variable 'Smoking' refers to whether the respondent has smoked a hundred cigarettes or more in their lives or not, which, according to data set, is generally 5 packs.

With this in mind, we would like to figure out if a respondent is more likely to catch a Heart Disease if they have smoked at least a hundred cigarettes. And to do that, we must analyze the relationship between the two variables.

In [None]:
temp = heart_df.groupby('Smoking')['HeartDisease'].value_counts()
table = pd.DataFrame([temp[0], temp[1]], index=['Does not Smoke', 'Smokes']).transpose()
table

In [None]:
table.plot.bar()

From the graph, it can be seen that, when it comes to people without Heart Diseases, there are more non-smokers than there are smokers. However, given that the total number of respondents are different:

In [None]:
heart_df['Smoking'].value_counts()

There are more non-smokers than there are smokers, so the bar graph may be too misleading to perform a proper analysis simply based on the visualization. We can further assess the relationship by looking into the percentages:

In [None]:
non_smoking_df = heart_df[heart_df['Smoking'] == 0]
non_smoking_no_disease = non_smoking_df[non_smoking_df['HeartDisease']==0].shape[0]
non_smoking_with_disease = non_smoking_df[non_smoking_df['HeartDisease']==1].shape[0]

smoking_df = heart_df[heart_df['Smoking'] == 1]
smoking_no_disease = non_smoking_df[non_smoking_df['HeartDisease']==0].shape[0]
smoking_with_disease = non_smoking_df[non_smoking_df['HeartDisease']==1].shape[0]

print('Non-smokers with no Heart Disease:', non_smoking_no_disease)
print('Non-smokers with Heart Disease:', non_smoking_with_disease, '\n')

print('Smokers with no Heart Disease:', smoking_no_disease)
print('Smokers with no Heart Disease:', smoking_with_disease, '\n')

print('Percentage of non-smokers with Heart Disease: {:.2f}%'.format(non_smoking_with_disease/non_smoking_df.shape[0] * 100))
print('Percentage of smokers with Heart Disease: {:.2f}%'.format(smoking_with_disease/smoking_df.shape[0] * 100))

As seen above, when taking proportion and percentages into account, the percentage of smokers with a Heart Disease is higher than the percentage of the non-smokers with a Heart Disease. With that, one final measure must be taken to make sure that there is actually a correlation, which is Pearson's Chi-Squared Test, a test meant to find the correlation between two categorical variables.

In [None]:
chi2_contingency(table, correction=True)

Though the function reports a p-score of 0.0, in reality, it is an approximation because the actual value is very small. Because of the small p-value and the high chi-statistic value (which is higher than the critical values of 0.01 , 0.05 and 0.10), it can be concluded that, there is a correlation between the two variables, HeartDisease and Smoking, at the significance levels of 0.01, 0.05 and 0.10.

## Formulated Research Question

Our exploratory data analysis has only tackled 3 of the 17 variables that can be measured on how much impact it has to an individual's risk of having a heart disease. It can be seen that certain health conditions and lifestyles may indicate or form a pattern that can lead to having the said health concern.Thus, our research aims to identify and answer the question:

**Which variables have a significant effect on the likelihood of heart disease?**

In the context of the Philippines, heart disease continues to be a top cause of death among Filipinos during 2021 even in times of the Covid-19 pandemic. The Philippine Statistics Authority reported that 125,913 (17.9%) of the deaths during the pandemic died because of ischemic heart disease. Ranked after heart disease are: cerebrovascular disease (68,180 or 9.7%) and the Covid-19 virus (8,390 or 1.5%) (Kabagani, 2022). 

The World Health Organization (2021) has identified several risk factors of heart disease which are unhealthy eating habits, physical inactivity, tobacco usage, and excessive alcohol intake. Some of the mentioned risk factors are already present in the dataset, thus, the main goal for this research is to validate these claims through the given dataset and find other features that are potential risk factors.

### References
Kabagani, L..(2022, February 23). Heart disease remains top cause of death in PH in 2021: PSA. Retrieved from: https://www.pna.gov.ph/articles/1168439<br>

World Health Organization. (2021). Cardiovascular diseases (CVDs). World Health Organization. Retrieved from: https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds) <br>