# STAT207 Final Project - Predicting Obesity Risk


Idunnuoluwa Akinola, Zach Goldstein, Anh Do

## 1. Introduction

According to the WHO, "Rates of overweight and obesity continue to grow in adults and children. From 1975 to 2016, the prevalence of overweight or obese children and adolescents aged 5–19 years increased more than four-fold from 4% to 18% globally" (WHO, 2020). The prevalence of obesity is the result of certain factors which include lifestyle, diet, physical activties and broader environment (Mayo Clinic, 2023). Changes in work, transportation, and leisure activities have led to more sedentary lifestyles. Desk jobs, reliance on cars, and increased screen time contribute to a lack of physical activity. Some individuals may have a genetic predisposition to obesity, making it more challenging for them to maintain a healthy weight. A model that predicts whether a person may be at risk for obesity for new datasets would help those affected seek preventative treatment.

We are going to explore a dataset about certain factors that contribute to obesity. The goal of exploring this dataset is to create a predictive model that will predict family history with obesity for new datasets using:
* Weight
* Frequency of consumption of vegetables (FCVC)
* Physical activity frequency (FAF)
* Number of main meals (NCP), and
* Frequent consumption of high caloric food (FAVC)

Scientist, nutritionist, doctors and individuals will find this model very useful because it will help them pinpoint common causes of obesity and come up with more effective treatments and preventataive measures. These groups would prefer a classifier that is equally good at classifying positives and negatives because poor performance for either could cause undetected health conditions or medical overtreatment. We would also like our chosen model to provide reliable interpretative insights about the nature of the relationship between the variables in the dataset.

----

**Citations**

Mayo Clinic. (2023, July 22). Obesity - Symptoms and causes - Mayo Clinic. Retrieved December 5, 2023, from https://www.mayoclinic.org/diseases-conditions/obesity/symptoms-causes/syc-20375742 

WHO. (2020, February 21). Obesity. Retrieved December 5, 2023, from https://www.who.int/health-topics/obesity#tab=tab_1

## 2. Dataset Discussion

The data consist of the estimation of obesity levels in people from the countries of Mexico, Peru and Colombia, with ages between 14 and 61 and diverse eating habits and physical condition , data was collected using a web platform with a survey where anonymous users answered each question, then the information was processed obtaining 17 attributes and 2111 records.
The attributes related with eating habits are: Frequent consumption of high caloric food (FAVC), Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Consumption of food between meals (CAEC), Consumption of water daily (CH20), and Consumption of alcohol (CALC). The attributes related with the physical condition are: Calories consumption monitoring (SCC), Physical activity frequency (FAF), Time using technology devices (TUE), Transportation used (MTRANS). the dataset was downloaded on 12/04/23 the link to dataset is https://www.kaggle.com/datasets/aravindpcoder/obesity-or-cvd-risk-classifyregressorcluster/data

In [1]:
#Run this
import pandas as pd                    # imports pandas and calls the imported version 'pd'
import matplotlib.pyplot as plt        # imports the package and calls it 'plt'
import seaborn as sns                  # imports the seaborn package with the imported name 'sns'
sns.set()  


In [2]:
df=pd.read_csv("ObesityDataSet.csv")
df

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.000000,1.620000,64.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,0.000000,1.000000,no,Public_Transportation,Normal_Weight
1,Female,21.000000,1.520000,56.000000,yes,no,3.0,3.0,Sometimes,yes,3.000000,yes,3.000000,0.000000,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.000000,1.800000,77.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,2.000000,1.000000,Frequently,Public_Transportation,Normal_Weight
3,Male,27.000000,1.800000,87.000000,no,no,3.0,3.0,Sometimes,no,2.000000,no,2.000000,0.000000,Frequently,Walking,Overweight_Level_I
4,Male,22.000000,1.780000,89.800000,no,no,2.0,1.0,Sometimes,no,2.000000,no,0.000000,0.000000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,20.976842,1.710730,131.408528,yes,yes,3.0,3.0,Sometimes,no,1.728139,no,1.676269,0.906247,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,21.982942,1.748584,133.742943,yes,yes,3.0,3.0,Sometimes,no,2.005130,no,1.341390,0.599270,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,22.524036,1.752206,133.689352,yes,yes,3.0,3.0,Sometimes,no,2.054193,no,1.414209,0.646288,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,24.361936,1.739450,133.346641,yes,yes,3.0,3.0,Sometimes,no,2.852339,no,1.139107,0.586035,Sometimes,Public_Transportation,Obesity_Type_III


In [8]:
original_row_num=df.shape[0]
original_row_num

2111

In [9]:
df.isna().sum()

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

## 3. Dataset Cleaning

We only intend to use these variables family history with overweight, Frequency of consumption of vegetables (FCVC), Physical activity frequency (FAF), Number of main meals (NCP), weight and Frequent consumption of high caloric food (FAVC)

In [12]:
df2 = df[['family_history_with_overweight', 'FAVC', 'FCVC', 'NCP', 'FAF', 'Weight']]
df2

Unnamed: 0,family_history_with_overweight,FAVC,FCVC,NCP,FAF,Weight
0,yes,no,2.0,3.0,0.000000,64.000000
1,yes,no,3.0,3.0,3.000000,56.000000
2,yes,no,2.0,3.0,2.000000,77.000000
3,no,no,3.0,3.0,2.000000,87.000000
4,no,no,2.0,1.0,0.000000,89.800000
...,...,...,...,...,...,...
2106,yes,yes,3.0,3.0,1.676269,131.408528
2107,yes,yes,3.0,3.0,1.341390,133.742943
2108,yes,yes,3.0,3.0,1.414209,133.689352
2109,yes,yes,3.0,3.0,1.139107,133.346641


In [13]:
df2.dtypes


family_history_with_overweight     object
FAVC                               object
FCVC                              float64
NCP                               float64
FAF                               float64
Weight                            float64
dtype: object

In [15]:
df2['family_history_with_overweight'].unique()

array(['yes', 'no'], dtype=object)

In [16]:
df2['FAVC'].unique()

array(['no', 'yes'], dtype=object)

After looking after the dataset with the selected variables that we desire to use. we can say that the dataset does not have any unique values that has to be cleaned 

## 4. Preliminary Analysis

## 5. Model Data Preprocessing

## 6. Feature Selection with k-Fold Cross-Validation

## 7. Best Model Discussion

## 8. Additional Analysis/Insight

## 9. Conclusion

## References