# **Basic Set-up**

In [None]:
# Basic Libraries
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
sb.set() # set the default Seaborn style for graphics

In [None]:
df = pd.read_csv('rawdata.csv')
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             2111 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2111 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

#**Data Preparation & Cleaning**
- Remove some irrelevant columns
- Add BMI indicator as column
- Rename columns to increase readibility
- Add Obesity = 'True' or 'False' column (BMI >= 27.5 means Obesity = 'True)
- Changing the levels 'Yes', 'No' of family_ob_hist and freq_high_cal_food to 1 and 0 respectively
- Changing the indicator of alcohol consumption to numeric figures instead
- Remove Outliers

**Content of original columns in raw data set**

Gender: male or female

Age: age

Height: height

Weight: weight

family_history_with_overweight: Has a family member suffered or suffers from overweight? - yes or no

FAVC: Frequent consumption of high caloric food - yes or no

FCVC: Frequency of consumption of vegetables - Never, Sometimes, Always

NCP: Number of main meals - 1, 2, 3, 4

CAEC: Consumption of food between meals - No, Sometimes, Frequently, Always

SMOKE: Do you smoke - yes or no

CH2O: Consumption of water daily - Less than a litter, between 1 and 2 l, more than 2 l

SCC: Calories consumption monitoring - yes or no

FAF: Physical activity frequency - 0, 1 to 2, 2 to 4, 4 to 5

TUE: Time using technology devices - 0 to 2, 3 to 5, >5

CALC: Consumption of alcohol - no, sometimes, frequently, always

MTRANS: Transportation used - automobile, motorbike, bike, public_transportation, walking

NObeyesdad: Type of obesity - insufficient_weight, normal_weight, overweight-level_i, overweight-level_ii, obesity_type_i, obesity_type_ii, obesity_type_iii

**Justification on why we removed some columns (not important in determining obesity) and why the ones we keep are important**
- Frequency of consumption of vegetables (FCVC): This factor influences diet quality but does not necessarily prevent obesity as a person can eat vegetables regularly while still consuming excessive high-caloric foods, leading to obesity.

- Consumption of water daily (CH20): Compared to other factors, water intake alone does not play a big part in determining obesity.

- Time using technology devices (TUE) & transportation used (MTRANS): While the amount of screen time and type of commute links to how sedentary a person's lifestyle is, we chose to keep FAF which already captures this.

- Number of Main Meals (NCP): This is similar to FAVC, so we chose to keep FAVC and drop NCP.

We chose to focus on the 4 predictors below because from our research, we have found that these are the top 4 contributors of Obesity. We would like to find out which of these 4 factors contribute most and least to Obesity.
1. Frequent consumption of high-caloric food (FAVC) – Regular intake of energy-dense foods leads to excess calorie intake, increasing obesity risk.
2. Low physical activity frequency (FAF) – A sedentary lifestyle with little to no exercise reduces calorie expenditure, promoting weight gain.
3. Family history of overweight (family_history_with_overweight) – Genetic predisposition and shared lifestyle habits contribute to obesity risk.
4. Consumption of Alcohol (CALC)- Alcohol consumption increases obesity risk by adding empty calories, slowing fat metabolism, and stimulating appetite.

In [None]:
# Columns to remove
columns_to_drop = ['FCVC', 'MTRANS', 'SMOKE', 'CH2O', 'SCC', 'CALC', 'TUE', 'NObeyesdad', 'NCP', 'Age', 'Gender']

# Drop the columns
df_cleaned = df.drop(columns_to_drop, axis=1)

# Standardize the CALC column to lowercase before dropping it
df['CALC'] = df['CALC'].str.lower()

# Mapping values for alcohol consumption
alc_mapping = {'always': 3, 'frequently': 2, 'sometimes': 1, 'no': 0}

# Apply mapping to the 'CALC' column before dropping it
df_cleaned['CALC_numeric'] = df['CALC'].map(alc_mapping)

# Drop the original CAEC column
df_cleaned = df_cleaned.drop(columns=['CAEC'])

# Rename Columns to increase readability
df_cleaned = df_cleaned.rename(columns={
    'family_history_with_overweight': 'family_ob_hist',
    'FAVC': 'freq_high_cal_food',
    'CALC_numeric': 'consumption_of_alcohol',
    'FAF': 'phy_act_freq'
})

# Replace Values (0- No; 1-Yes)
df_cleaned['family_ob_hist'] = df_cleaned['family_ob_hist'].replace({'no': 0, 'yes': 1})
df_cleaned['freq_high_cal_food'] = df_cleaned['freq_high_cal_food'].replace({'no': 0, 'yes': 1})

# Calculate BMI
df_cleaned['BMI'] = df['Weight'] / (df['Height']**2)

# Add new column ('Obesity' - True ; False)
df_cleaned['Obesity'] = df_cleaned['BMI'] >= 27.5

# Display the first few rows to confirm changes
df_cleaned.head(10)


  df_cleaned['family_ob_hist'] = df_cleaned['family_ob_hist'].replace({'no': 0, 'yes': 1})
  df_cleaned['freq_high_cal_food'] = df_cleaned['freq_high_cal_food'].replace({'no': 0, 'yes': 1})


Unnamed: 0,Height,Weight,family_ob_hist,freq_high_cal_food,phy_act_freq,consumption_of_alcohol,BMI,Obesity
0,1.62,64.0,1,0,0.0,0,24.386526,False
1,1.52,56.0,1,0,3.0,1,24.238227,False
2,1.8,77.0,1,0,2.0,2,23.765432,False
3,1.8,87.0,0,0,2.0,2,26.851852,False
4,1.78,89.8,0,0,0.0,1,28.342381,True
5,1.62,53.0,0,1,0.0,1,20.195092,False
6,1.5,55.0,1,1,1.0,1,24.444444,False
7,1.64,53.0,0,0,3.0,1,19.705532,False
8,1.78,64.0,1,1,1.0,2,20.19947,False
9,1.72,68.0,1,1,1.0,0,22.985398,False


In [None]:
# Check for missing values
missing_values = df_cleaned.isnull().sum()

# Display the result
print(missing_values)

#Sample size
total_rows = df_cleaned.shape[0]
print("Sample size in df_cleaned:", total_rows)

#Count no. of Obese people (BMI >= 30) in the original sample
obese_count = df_cleaned['Obesity'].sum()
print("Number of obese individuals:", obese_count)

Height                    0
Weight                    0
family_ob_hist            0
freq_high_cal_food        0
phy_act_freq              0
consumption_of_alcohol    0
BMI                       0
Obesity                   0
dtype: int64
Sample size in df_cleaned: 2111
Number of obese individuals: 1206


# **Removing Outliers**

In [None]:
# Define a function to find outliers using IQR
def detect_outliers_iqr(data, column):
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return data[(data[column] < lower_bound) | (data[column] > upper_bound)]

# Check for outliers in Height, Weight, Age and BMI
outliers_height = detect_outliers_iqr(df_cleaned, 'Height')
outliers_weight = detect_outliers_iqr(df_cleaned, 'Weight')
outliers_BMI = detect_outliers_iqr(df_cleaned, 'BMI')

print("Outliers in Height:")
print(outliers_height)
print("\nOutliers in Weight:")
print(outliers_weight)
print("\nOutliers in BMI:")
print(outliers_BMI)
print(df_cleaned['BMI'].describe())

Outliers in Height:
     Height  Weight  family_ob_hist  freq_high_cal_food  phy_act_freq  \
349    1.98   125.0               1                   1           1.0   

     consumption_of_alcohol        BMI  Obesity  
349                       1  31.884502     True  

Outliers in Weight:
     Height  Weight  family_ob_hist  freq_high_cal_food  phy_act_freq  \
344    1.87   173.0               1                   1           2.0   

     consumption_of_alcohol       BMI  Obesity  
344                       1  49.47239     True  

Outliers in BMI:
Empty DataFrame
Columns: [Height, Weight, family_ob_hist, freq_high_cal_food, phy_act_freq, consumption_of_alcohol, BMI, Obesity]
Index: []
count    2111.000000
mean       29.700159
std         8.011337
min        12.998685
25%        24.325802
50%        28.719089
75%        36.016501
max        50.811753
Name: BMI, dtype: float64


In [None]:
#Remove the outlier for height as found above
df_cleaned = df_cleaned.drop(outliers_height.index)
#Remove the outlier for weight as found above
df_cleaned = df_cleaned.drop(outliers_weight.index)

There is 1 outlier in Height and 1 outlier in Weight so we removed these 2 outliers. After which, we drop the Height and Weight Columns since we have already used them to calculate the BMI which determines obesity.
There are **no BMI outliers** found thus we do not need to remove any.

In [None]:
df_cleaned = df_cleaned.drop(columns=['Height', 'Weight'])
df_cleaned.head()

Unnamed: 0,family_ob_hist,freq_high_cal_food,phy_act_freq,consumption_of_alcohol,BMI,Obesity
0,1,0,0.0,0,24.386526,False
1,1,0,3.0,1,24.238227,False
2,1,0,2.0,2,23.765432,False
3,0,0,2.0,2,26.851852,False
4,0,0,0.0,1,28.342381,True


In [None]:
#CSV of cleaned data after data preparation and cleaning
df_cleaned.to_csv('df_cleaned.csv')