# COGS 118B - Project Proposal

# Project Description

You will design and execute a machine learning project. There are a few constraints on the nature of the allowed project. 
- The problem addressed will not be a "toy problem" or "common training students problem" like mtcars, iris, palmer penguins etc.
- The dataset will have >1k observations and >5 variables. I'd prefer more like >10k observations and >10 variables. A general rule is that if you have >100x more observations than variables, your solution will likely generalize a lot better. The goal of training an unsupervised machine learning model is to learn the underlying pattern in a dataset in order to generalize well to unseen data, so choosing a large dataset is very important.

- The project must include some elements of unsupervised learning, but you are welcome to include some supervised or other learning approaches as well.
- The project will include a model selection and/or feature selection component where you will be looking for the best setup to maximize the performance of your ML system.
- You will evaluate the performance of your ML system using more than one appropriate metric
- You will be writing a report describing and discussing these accomplishments


Feel free to delete this description section when you hand in your proposal.

# Names

Hopefully your team is at least this good. Obviously you should replace these with your names.

- Samantha Prestrelski  
- Jeffrey Yang
- Yash Patki
- Fayaz Shaik
- Denny Yoo

# Abstract 
This section should be short and clearly stated. It should be a single paragraph <200 words.  It should summarize: 
- what your goal/problem is
- what the data used represents and how they are measured
- what you will be doing with the data
- how performance/success will be measured

We aim to predict which people do and do not have heart disease based on 18 factors. 

The data comes from annual telephone surveys conducted by the CDC as part of the Behavioral Risk Factor Surveillance System (BRFSS). Each row represents a different person. Each of the 19 columns represents a single health factor that may play a part an individual having heart disease. These columns are HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer. Some of these columns can be converted to binary representation, the others will need to be one-hot encoded. 

We will be creating unsupervised machine learning models that use the data in order to predict which people do and don't have heart disease. We will be using models such as K-Means and GMM.

We will measure success by using the columns HadHeartAttack, HadAngina, and HadStroke as true indicators for having heart disease. 

# Background

Heart disease affects millions of people each year and is a leading cause of mortality worldwide. The early prediction and diagnosis of heart disease is crucial for early intervention, improved treatment, and better patient outcomes. It is thus critical that medical practitioners have access to tools that would grant them the ability to make such early detections of heart disease in patients, a challenge that our project seeks to address.

As our model utilizes various health factors in its prediction of heart disease, it is crucial to understand how these factors influence the probability of an individual having heart disease. Extensive prior research exists which meticulously analyzes how factors such as BMI and age contribute to the likelihood of heart disease, illuminating the complex relationships that our model seeks to capture. A 2021 study by the American Heart Association found that obesity, as quantified by BMI, is directly responsible for a range of cardiovascular risk factors such as diabetes and hypertension which heavily contribute to an increased likelihood of heart disease <a name="cite_ref-1"></a>[<sup>1</sup>](#cite_note-1). A WebMD article corroborates that individuals past the age of 65 are drastically more susceptible to heart failure and other conditions that are linked to heart disease <a name="cite_ref-2"></a>[<sup>2</sup>](#cite_note-2). Due to their high impact nature, it is clearly imperative that BMI and age, along with the other 15 health factors, be integrated into our model should we want to comprehensively predict heart disease in patients.

Past research has also shown machine learning models to be excellent predictors of medical conditions like heart disease. A 2023 study found various neural network models to be capable of achieving up to 94.78% accuracy in heart disease prediction <a name="cite_ref-3"></a>[<sup>3</sup>](#cite_note-3). Due to the past success of these models, it becomes increasingly clear that machine learning algorithms are best fitted towards solving our problem of predicting heart disease. 

1. <a name="cite_note-1"></a>[](#cite_ref-1) Powell-Wiley TM, Poirier P, Burke LE, Després J-P, Gordon-Larsen P, Lavie CJ, Lear SA, Ndumele CE, Neeland IJ, Sanders P, St-Onge M-P; on behalf of the American Heart Association Council on Lifestyle and Cardiometabolic Health; Council on Cardiovascular and Stroke Nursing; Council on Clinical Cardiology; Council on Epidemiology and Prevention; and Stroke Council. Obesity and cardiovascular disease: a scientific statement from the American Heart Association. Circulation. 2021;143:e984–e1010. doi: 10.1161/CIR.0000000000000973
2. <a name="cite_note-2"></a>[](#cite_ref-2) “What to Know about Your Heart as You Age.” WebMD, WebMD, www.webmd.com/healthy-aging/what-happens-to-your-heart-as-you-age. Accessed 16 Feb. 2024. 
3. <a name="cite_note-3"></a>[](#cite_ref-3) Srinivasan, S., Gunasekaran, S., Mathivanan, S.K. et al. An active learning machine technique based prediction of cardiovascular heart disease from UCI-repository database. Sci Rep 13, 13588 (2023). https://doi.org/10.1038/s41598-023-40717-1


# Problem Statement



Clearly describe the problem that you are solving. Avoid ambiguous words. The problem described should be well defined and should have at least one ML-relevant potential solution. Additionally, describe the problem thoroughly such that it is clear that the problem is quantifiable (the problem can be expressed in mathematical or logical terms), measurable (the problem can be measured by some metric and clearly observed), and replicable (the problem can be reproduced and occurs more than once).

As described earlier, heart disease is a huge problem, and according to our dataset, affects a lot of people/patients. Fortunately, heart disease is treatable and there chances of it being treated are far better if prevented early on. Our problem we hope to solve is that given our dataset filled with various factors relating to a patient, can we build a ML model that can accurately predict a person's chance of getting heart disease. Our datatset features a variety of patients' information such as their BMI, gender, race and etc. which we believe can be used to generate patterns in the form of "there are ... patients who have heart disease", "...% of patients who are between the tages of 50-60 have heart disease" and etc. 

The problem is quantifiable as it expresses the risk of hear disease as a percentage or a singular yes/no to express whether that person should be worries about heart disease if they continue their current lifestyle. This model's performance can be measured by how accurately it tracks a person's actual risk of heart disease. This can be achieved by splitting the datatset in two: one called the training datatset and one called the validation dataset. If the assumption that heart disease can be predicted using certain factors, then changing the datatset by a factor of 2 shouldn't change anything. Finally, this problem is replicable as our dataset and other ML training methods can be shared and used by everyone / other datasets to reproduce results.

We can use ML training methods such as logisitic regression to train a predictive model on our dataset, or one half of the datatset coined as the training dataset, to do testing on the other half of the datatset for validation purposes. We could also run several feature detection tasks to ensure that the most predominant features are used to predict heart disease.

# Data

We will use 17 variables to predict Heart Disease and a binary value for whether or not the person has heart disease.  

Link to data: https://www.kaggle.com/datasets/kamilpytlak/personal-key-indicators-of-heart-disease  
- 18 variables
    - HeartDisease, BMI, Smoking, AlcoholDrinking, Stroke, PhysicalHealth, MentalHealth, DiffWalking, Sex, AgeCategory, Race, Diabetic, PhysicalActivity, GenHealth, SleepTime, Asthma, KidneyDisease, SkinCancer.
- 320,000 observations
    - Each observation is a U.S. resident that provided their health status as part of the Behavioral Risk Factor Surveillance System (BRFSS)'s telephone surveys.

According to the [CDC](https://www.cdc.gov/heartdisease/risk_factors.htm), 
> About half of all Americans (47%) have at least 1 of 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking.

Although high blood pressure and high cholesterol are hard to measure, diabetes and obesity are indicators of high blood pressure, making `BMI`, `Diabetic`, and `PhysicalActivity` critical variables, as well as `Smoking`. BMI is numerical, Diabetic is Yes/No/Other, PhysicalActivity is Yes/No, and Smoking is Yes/No. PhysicalActivity is not very well described, so we will need to check how this dataset was [created](https://github.com/kamilpytlak/data-science-projects/blob/main/heart-disease-prediction/2022/documentation/vars_list_with_descriptions.txt) and match up the raw data with the processed data to figure out how to interpret some of the variables. 

We will need to convert binary columns that are listed as Yes/No to 0s and 1s, and category columns like `General Health` to some sort of numerical in order to vectorize all of the data. 

# Proposed Solution

In this section, clearly describe a solution to the problem. The solution should be applicable to the project domain and appropriate for the dataset(s) or input(s) given. Provide enough detail (e.g., algorithmic description and/or theoretical properties) to convince us that your solution is applicable. Why might your solution work? Make sure to describe how the solution will be tested.  

If you know details already, describe how (e.g., library used, function calls) you plan to implement the solution in a way that is reproducible.

If it is appropriate to the problem statement, describe a benchmark model<a name="sota"></a>[<sup>[3]</sup>](#sotanote) against which your solution will be compared. 

# Evaluation Metrics

Propose at least one evaluation metric that can be used to quantify the performance of both the benchmark model and the solution model. The evaluation metric(s) you propose should be appropriate given the context of the data, the problem statement, and the intended solution. Describe how the evaluation metric(s) are derived and provide an example of their mathematical representations (if applicable). Complex evaluation metrics should be clearly defined and quantifiable (can be expressed in mathematical or logical terms).

One evaluation metric that can be used is distortion score. Distortion score takes into account only how tight a cluster is by calculating the average of the squared distances between each point in a cluster and the cluster center. It is represented as such: 
$$
J = \sum_{n=1}^N\sum_{k=1}^Kr_{nk} ||x_n-\mu_k||^2
$$

Another possible evaluation metric to be used is silhouette score which is similar to distortion score except it also takes into account the distances between the points of one cluster and the nearest cluster center. The function for silhouette score can be generalized as such:

$$
\frac{separation - cohesion}{max(separation, cohesion)}
$$

These metrics can be used to determine the optimal number of clusters for our data as certain factors could lead to specific kinds of heart disease, or we can see if there is great overlap between them with a smaller number of clusters. 

# Ethics & Privacy

Using the provided data science ethics checklist from https://deon.drivendata.org, we discuss the following potential concerns with ethics and data privacy:
Data Collection
- **Informed Consent**: The human subjects opted in, as they could have refused the telemarketing survey, hung up the phone at any time, or refused to answer questions. 
- **Collection Bias/Bias Mitigation**: Some bias is towards people that are willing to give their information, which might be affected by age or location. It also restricts the survey to those that have access to phone services. This dataset might also be affected by access to healthcare: certain diagnoses like diabetes, heart disease, and kidney disease might be missed for lower-income people that don't have the resources to get diagnosed. While we cannot fix the collection process, we will need to do exploratory data analysis to see what demographics are represented and if there are any specific groups are over- or underrepresented. 
- **Limit PII exposure**: Health information is inherently personally identifiable. However, the dataset has been cleaned to only include Sex, Age, and Race as the most PII. We can do research into whether these factors are very important in predicting heart disease, or if there's negligible difference. If there are not significant differences, we can anonymize this dataset further. 

Data Storage
- **Data retention plan**: This dataset is public and managed by someone else. However, in the testing phase, we should not store any results of people who test our model if they input their own information. 

Deployment
- **Monitoring and Evaluation**: If this ML project were to go into production, we would not collect user data. Any computations would be done on the client side, meaning we have no access to any of the user inputs and thus cannot store them.
- **Redress**: To prevent unintentional user harm, we would also put a warning that this model is not a recommendation from medical professionals and is purely based on data. If this were to be in production for a while, we could update our model based on yearly new releases from the CDC's BRFSS. We can also provide a feedback form for any user complaints. 

As we continue to work with the data and develop our model/metrics, we will revisit the data ethics checklist to ensure we address potential ethical concerns.


# Team Expectations 

Put things here that cement how you will interact/communicate as a team, how you will handle conflict and difficulty, how you will handle making decisions and setting goals/schedule, how much work you expect from each other, how you will handle deadlines, etc...

* We will communicate openly and transparently within the team, sharing information, updates, and concerns in a timely manner.

* We will make decisions collectively by discussing and weighing options together, considering input from all team members, and striving for consensus.

* We will set clear goals and schedules for projects, break down tasks, allocate responsibilities fairly, and hold each other accountable for meeting deadlines and deliverables. 

* We will collaborate respectfully by valuing each team member's ideas, opinions, and contributions, encouraging open dialogue, active listening, and resolving conflicts professionally.

* We will provide and receive feedback constructively, offering praise and areas for improvement in a timely and respectful manner, fostering a culture of continuous growth and development within the team.

* We will be adaptable and flexible in our approach to work, recognizing that priorities and circumstances may change. 

* We will take ownership of our work and hold ourselves and each other accountable for meeting our commitments. This includes being proactive in seeking help when needed, taking responsibility for our actions, and following through on our commitments.

* We will strive for continuous improvement in our work and processes, seeking feedback, learning from our experiences, and finding ways to work more effectively as a team. 

# Project Timeline Proposal

Replace this with something meaningful that is appropriate for your needs. It doesn't have to be something that fits this format.  It doesn't have to be set in stone... "no battle plan survives contact with the enemy". But you need a battle plan nonetheless, and you need to keep it updated so you understand what you are trying to accomplish, who's responsible for what, and what the expected due dates are for each item.

| Meeting Date  | Meeting Time| Completed Before Meeting  | Discuss at Meeting |
|---|---|---|---|
| 2/17  |  1 PM |  Brainstorm topics/questions (all)  | Determine best form of communication; Discuss and decide on final project topic; discuss hypothesis; begin background research | 
| 1/26  |  10 AM |  Do background research on topic (Pelé) | Discuss ideal dataset(s) and ethics; draft project proposal | 
| 2/1  | 10 AM  | Edit, finalize, and submit proposal; Search for datasets (Beckenbaur)  | Discuss Wrangling and possible analytical approaches; Assign group members to lead each specific part   |
| 2/14  | 6 PM  | Import & Wrangle Data ,do some EDA (Maradonna) | Review/Edit wrangling/EDA; Discuss Analysis Plan   |
| 2/23  | 12 PM  | Finalize wrangling/EDA; Begin programming for project (Cruyff) | Discuss/edit project code; Complete project |
| 3/13  | 12 PM  | Complete analysis; Draft results/conclusion/discussion (Carlos)| Discuss/edit full project |
| 3/19  | Before 11:59 PM  | NA | Turn in Final Project  |

# Data Cleaning

In [2]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt 

In [8]:
heart_2020_cleaned = pd.read_csv("heart_2020_cleaned.csv")
heart_2020_cleaned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 319795 entries, 0 to 319794
Data columns (total 18 columns):
 #   Column            Non-Null Count   Dtype  
---  ------            --------------   -----  
 0   HeartDisease      319795 non-null  object 
 1   BMI               319795 non-null  float64
 2   Smoking           319795 non-null  object 
 3   AlcoholDrinking   319795 non-null  object 
 4   Stroke            319795 non-null  object 
 5   PhysicalHealth    319795 non-null  float64
 6   MentalHealth      319795 non-null  float64
 7   DiffWalking       319795 non-null  object 
 8   Sex               319795 non-null  object 
 9   AgeCategory       319795 non-null  object 
 10  Race              319795 non-null  object 
 11  Diabetic          319795 non-null  object 
 12  PhysicalActivity  319795 non-null  object 
 13  GenHealth         319795 non-null  object 
 14  SleepTime         319795 non-null  float64
 15  Asthma            319795 non-null  object 
 16  KidneyDisease     31

In [39]:
print(list(heart_2020_cleaned.columns))
columns = list(heart_2020_cleaned.columns)
print("------------------------------------------------")
print(heart_2020_cleaned.dtypes)
print("------------------------------------------------")
print(heart_2020_cleaned.isna().sum())

['HeartDisease', 'BMI', 'Smoking', 'AlcoholDrinking', 'Stroke', 'PhysicalHealth', 'MentalHealth', 'DiffWalking', 'Sex', 'AgeCategory', 'Race', 'Diabetic', 'PhysicalActivity', 'GenHealth', 'SleepTime', 'Asthma', 'KidneyDisease', 'SkinCancer']
------------------------------------------------
HeartDisease         object
BMI                 float64
Smoking              object
AlcoholDrinking      object
Stroke               object
PhysicalHealth      float64
MentalHealth        float64
DiffWalking          object
Sex                  object
AgeCategory          object
Race                 object
Diabetic             object
PhysicalActivity     object
GenHealth            object
SleepTime           float64
Asthma               object
KidneyDisease        object
SkinCancer           object
dtype: object
------------------------------------------------
HeartDisease        0
BMI                 0
Smoking             0
AlcoholDrinking     0
Stroke              0
PhysicalHealth      0
MentalHeal

In [40]:
for column in columns:
    print(column, heart_2020_cleaned[column].unique())
    if (heart_2020_cleaned[column].dtypes != "object"):
        # print("Min: ", min(heart_2020_cleaned[column].unique()), "| Max: ", max(heart_2020_cleaned[column].unique()))
        print(heart_2020_cleaned[column].describe())
    print("------------------------------------------------")

HeartDisease ['No' 'Yes']
------------------------------------------------
BMI [16.6  20.34 26.58 ... 62.42 51.46 46.56]
count    319795.000000
mean         28.325399
std           6.356100
min          12.020000
25%          24.030000
50%          27.340000
75%          31.420000
max          94.850000
Name: BMI, dtype: float64
------------------------------------------------
Smoking ['Yes' 'No']
------------------------------------------------
AlcoholDrinking ['No' 'Yes']
------------------------------------------------
Stroke ['No' 'Yes']
------------------------------------------------
PhysicalHealth [ 3.  0. 20. 28.  6. 15.  5. 30.  7.  1.  2. 21.  4. 10. 14. 18.  8. 25.
 16. 29. 27. 17. 24. 12. 23. 26. 22. 19.  9. 13. 11.]
count    319795.00000
mean          3.37171
std           7.95085
min           0.00000
25%           0.00000
50%           0.00000
75%           2.00000
max          30.00000
Name: PhysicalHealth, dtype: float64
------------------------------------------------

In [7]:
heart_2022_no = pd.read_csv("heart_2022_no_nans.csv")
heart_2022_no.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 246022 entries, 0 to 246021
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      246022 non-null  object 
 1   Sex                        246022 non-null  object 
 2   GeneralHealth              246022 non-null  object 
 3   PhysicalHealthDays         246022 non-null  float64
 4   MentalHealthDays           246022 non-null  float64
 5   LastCheckupTime            246022 non-null  object 
 6   PhysicalActivities         246022 non-null  object 
 7   SleepHours                 246022 non-null  float64
 8   RemovedTeeth               246022 non-null  object 
 9   HadHeartAttack             246022 non-null  object 
 10  HadAngina                  246022 non-null  object 
 11  HadStroke                  246022 non-null  object 
 12  HadAsthma                  246022 non-null  object 
 13  HadSkinCancer              24

In [9]:
heart_2022 = pd.read_csv("heart_2022_with_nans.csv")
heart_2022.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 445132 entries, 0 to 445131
Data columns (total 40 columns):
 #   Column                     Non-Null Count   Dtype  
---  ------                     --------------   -----  
 0   State                      445132 non-null  object 
 1   Sex                        445132 non-null  object 
 2   GeneralHealth              443934 non-null  object 
 3   PhysicalHealthDays         434205 non-null  float64
 4   MentalHealthDays           436065 non-null  float64
 5   LastCheckupTime            436824 non-null  object 
 6   PhysicalActivities         444039 non-null  object 
 7   SleepHours                 439679 non-null  float64
 8   RemovedTeeth               433772 non-null  object 
 9   HadHeartAttack             442067 non-null  object 
 10  HadAngina                  440727 non-null  object 
 11  HadStroke                  443575 non-null  object 
 12  HadAsthma                  443359 non-null  object 
 13  HadSkinCancer              44

# Footnotes
<a name="lorenznote"></a>1.[^](#lorenz): Lorenz, T. (9 Dec 2021) Birds Aren’t Real, or Are They? Inside a Gen Z Conspiracy Theory. *The New York Times*. https://www.nytimes.com/2021/12/09/technology/birds-arent-real-gen-z-misinformation.html<br> 
<a name="admonishnote"></a>2.[^](#admonish): Also refs should be important to the background, not some randomly chosen vaguely related stuff. Include a web link if possible in refs as above.<br>
<a name="sotanote"></a>3.[^](#sota): Perhaps the current state of the art solution such as you see on [Papers with code](https://paperswithcode.com/sota). Or maybe not SOTA, but rather a standard textbook/Kaggle solution to this kind of problem
