# STAT207 Final Project - Predicting Obesity Risk


Idunnuoluwa Akinola, Zach Goldstein, Anh Do

## 1. Introduction

According to the WHO, "Rates of overweight and obesity continue to grow in adults and children. From 1975 to 2016, the prevalence of overweight or obese children and adolescents aged 5–19 years increased more than four-fold from 4% to 18% globally" (WHO, 2020). The prevalence of obesity is the result of certain factors which include lifestyle, diet, physical activties and broader environment (Mayo Clinic, 2023). Changes in work, transportation, and leisure activities have led to more sedentary lifestyles. Desk jobs, reliance on cars, and increased screen time contribute to a lack of physical activity. Some individuals may have a genetic predisposition to obesity, making it more challenging for them to maintain a healthy weight. A model that predicts whether a person may be at risk for obesity for new datasets would help those affected seek preventative treatment.

We are going to explore a dataset about certain factors that contribute to obesity. The goal of exploring this dataset is to create a predictive model that will predict family history with obesity for new datasets using:
* Weight
* Frequency of consumption of vegetables (FCVC)
* Physical activity frequency (FAF)
* Number of main meals (NCP), and
* Frequent consumption of high caloric food (FAVC)

Scientist, nutritionist, doctors and individuals will find this model very useful because it will help them pinpoint common causes of obesity and come up with more effective treatments and preventataive measures. These groups would prefer a classifier that is equally good at classifying positives and negatives because poor performance for either could cause undetected health conditions or medical overtreatment. We would also like our chosen model to provide reliable interpretative insights about the nature of the relationship between the variables in the dataset.

----

**Citations**

Mayo Clinic. (2023, July 22). Obesity - Symptoms and causes - Mayo Clinic. Retrieved December 5, 2023, from https://www.mayoclinic.org/diseases-conditions/obesity/symptoms-causes/syc-20375742 

WHO. (2020, February 21). Obesity. Retrieved December 5, 2023, from https://www.who.int/health-topics/obesity#tab=tab_1

## 2. Dataset Discussion

The data consists of the estimation of obesity levels in people from Mexico, Peru and Colombia, with ages between 14 and 61 and diverse eating habits and physical conditions - each row represents data collected from one person. The data was collected through a web survey where anonymous users answered questions relating to attributes like Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), and Physical activity frequency (FAF). The dataset is not inclusive of all possible types of observations since data was only collected from one region of the world which may have specific eating habits not reflected elsewhere, leaving out populations from other continents. The dataset was downloaded on 12/04/23 from Kaggle (https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data). Since only one population is represented, the people in our research motivation will have to conduct more research or adjust for their own demographic before drawing conclusions from model predictions. 

We will use family_history_with_overweight, which indicates whether the responder has/had a family member who suffers/suffered from overweight, which is correlated with obesity risk, as our response variable and the following explanatory variables:
* Weight: weight of the person surveighed in kilograms
* Frequency of consumption of vegetables (FCVC): number of vegetables consumed per day
* Physical activity frequency (FAF): number of times per week person engages in physical activity
* Number of main meals (NCP): number of main meals eaten per day
* Frequent consumption of high caloric food (FAVC): whether the person frequently consumes high caloric food (yes/no)

We chose to focus on these explanatory variables as they relate to diet quality and physical condition, which are very relevant to obesity.

In [17]:
#Run this
import pandas as pd                    # imports pandas and calls the imported version 'pd'
import matplotlib.pyplot as plt        # imports the package and calls it 'plt'
import seaborn as sns                  # imports the seaborn package with the imported name 'sns'
sns.set()  

In [18]:
df=pd.read_csv("ObesityDataSet.csv")
df.head(5)

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [19]:
original_row_num=df.shape[0]
print("Rows before data cleaning: " + str(original_row_num))

Rows before data cleaning: 2111


In [20]:
df.isna().sum()

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

## 3. Dataset Cleaning

We only intend to use family history with overweight, Frequency of consumption of vegetables (FCVC), Physical activity frequency (FAF), Number of main meals (NCP), weight and Frequent consumption of high caloric food (FAVC). After looking at the dataset with the selected variables that we desire to use, we can say that the dataset does not have any implicit missing values as all data types are consistent and neither binary variables contain a third value. Therefore, we didn't need to drop any rows. Our categorical explanatory variable has enough observations for every level, so we didn't drop any rows.

In [21]:
# filter columns
df = df[['family_history_with_overweight', 'FAVC', 'FCVC', 'NCP', 'FAF', 'Weight']]
df

Unnamed: 0,family_history_with_overweight,FAVC,FCVC,NCP,FAF,Weight
0,yes,no,2.0,3.0,0.000000,64.000000
1,yes,no,3.0,3.0,3.000000,56.000000
2,yes,no,2.0,3.0,2.000000,77.000000
3,no,no,3.0,3.0,2.000000,87.000000
4,no,no,2.0,1.0,0.000000,89.800000
...,...,...,...,...,...,...
2106,yes,yes,3.0,3.0,1.676269,131.408528
2107,yes,yes,3.0,3.0,1.341390,133.742943
2108,yes,yes,3.0,3.0,1.414209,133.689352
2109,yes,yes,3.0,3.0,1.139107,133.346641


In [22]:
# all datatypes are consistent with the datatype that values are supposed to be
df.dtypes


family_history_with_overweight     object
FAVC                               object
FCVC                              float64
NCP                               float64
FAF                               float64
Weight                            float64
dtype: object

In [23]:
df['family_history_with_overweight'].unique()

array(['yes', 'no'], dtype=object)

In [24]:
df['FAVC'].unique()

array(['no', 'yes'], dtype=object)

In [29]:
# categorical explanatory variable has large number of observations for every value
print(df[df['FAVC'] == 'yes'].shape[0])
print(df[df['FAVC'] == 'no'].shape[0])

1866
245


## 4. Preliminary Analysis

## 5. Model Data Preprocessing

## 6. Feature Selection with k-Fold Cross-Validation

## 7. Best Model Discussion

## 8. Additional Analysis/Insight

## 9. Conclusion

## References