# Data understanding

1. https://archive.ics.uci.edu/dataset/544

2.  This dataset include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

3. There is only one .csv file containing the dataset that has to be loaded, which is separted by commas and has a header.

4. The dataset has 17 features and 2111 observations.

5. The 3 numerical features are the following:
    - Height_(cm): Continuous
    - Weight_(kg): Continuous
    - Age: Discrete

6. The 10 categorical features are the following
    - Gender: Nominal
    - FCVC: Ordinal
    - NCP: Ordinal
    - CAEC: Ordinal
    - CH2O	: Ordinal
    - FAF: Ordinal
    - TUE: Ordinal
    - CALC: Ordinal
    - MTRANS: Nominal
    - NObeyesdad: Ordinal

7. The 4 binary features are the following:
    - family_history_with_overweight
    - FAVC
    - SMOKE
    - SCC

# Data preprocessing

- Loading the dataset

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv("../assets/ObesityDataSet_raw_and_data_sinthetic.csv", sep=",", true_values=["Yes"], false_values=["No"])
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


**Check structure:**


In [2]:
df.shape

(2111, 17)

**Check if there is any null values::**


In [3]:
df.isnull().sum()

Gender                            0
Age                               0
Height                            0
Weight                            0
family_history_with_overweight    0
FAVC                              0
FCVC                              0
NCP                               0
CAEC                              0
SMOKE                             0
CH2O                              0
SCC                               0
FAF                               0
TUE                               0
CALC                              0
MTRANS                            0
NObeyesdad                        0
dtype: int64

**Check datatypes of the features:**

In [4]:
df.dtypes

Gender                             object
Age                               float64
Height                            float64
Weight                            float64
family_history_with_overweight     object
FAVC                               object
FCVC                              float64
NCP                               float64
CAEC                               object
SMOKE                              object
CH2O                              float64
SCC                                object
FAF                               float64
TUE                               float64
CALC                               object
MTRANS                             object
NObeyesdad                         object
dtype: object

**Adjusting synthetic data, because it changed the categorical type to continuous/float, thanks to the distributed data generation**

In [5]:
#NCP is a categorical variable with numerical values, 
# we will convert it to a categorical variable with three categories: "Between 1 and 2", "Three" and "More than three"

df["NCP"] = df["NCP"].apply(lambda x: round(x))

# Randomly assign 2s to 1 or 3
mask_2 = df["NCP"] == 2
df.loc[mask_2, "NCP"] = np.random.choice([1, 3], size=mask_2.sum())

df["NCP"] = df["NCP"].astype("category")
df["NCP"] = df["NCP"].cat.rename_categories({1: "Between 1 and 2", 3: "Three", 4: "More than three"})
df["NCP"] = pd.Categorical(
    df["NCP"],
    categories=["Between 1 and 2", "Three", "More than three"],
    ordered=True
)

#FCVC is a categorical variable with numerical values as well
df["FCVC"] = df["FCVC"].apply(lambda x: round(x)).astype("category")    
df["FCVC"] = df["FCVC"].cat.rename_categories({1: "Never", 2: "Sometimes", 3: "Always"})
df["FCVC"] = pd.Categorical(
    df["FCVC"],
    categories=["Never", "Sometimes", "Always"],
    ordered=True
)

#CH2O2 is a categorical variable with numerical values as well

df["CH2O"] = df["CH2O"].apply(lambda x: round(x)).astype("category")
df["CH2O"] = df["CH2O"].cat.rename_categories({1: "Less than a liter", 2: "Between 1 and 2 L", 3: "More than 2 L"})
df["CH2O"] = pd.Categorical(
    df["CH2O"],
    categories=["Less than a liter", "Between 1 and 2 L", "More than 2 L"],
    ordered=True
)

#FAF is a categorical variable with numerical values as well
df["FAF"] = df["FAF"].apply(lambda x: round(x)).astype("category")
df["FAF"] = df["FAF"].cat.rename_categories({0: "I do not", 1: "1 or 2 days", 2: "2 or 4 days", 3: "4 or 5 days"})
df["FAF"] = pd.Categorical(
    df["FAF"],
    categories=["I do not", "1 or 2 days", "2 or 4 days", "4 or 5 days"],
    ordered=True
)

#TUE is a categorical variable with numerical values as well
df["TUE"] = df["TUE"].apply(lambda x: round(x)).astype("category")
df["TUE"] = df["TUE"].cat.rename_categories({0: "0-2 hours", 1: "3-5 hours", 2: "More than 5 hours"})
df["TUE"] = pd.Categorical(
    df["TUE"],
    categories=["0-2 hours", "3-5 hours", "More than 5 hours"],
    ordered=True
)


#Numericals round and convert to int/float

df["Age"] = df["Age"].apply(lambda x: round(x)).astype("int")
df["Height"] = df["Height"].apply(lambda x: round(x, 2)).astype("float")
df["Weight"] = df["Weight"].apply(lambda x: round(x, 2)).astype("float")

**Converting the remaining variables to categorical and bool datatypes, because they displayed as objects. First I do nominal, then ordinal, and finally printing out the datatypes again to check:**

In [6]:
#Nominals
df["Gender"] = df["Gender"].astype("category")
df["MTRANS"] = df["MTRANS"].astype("category")

#Ordinals
basic_categories = ['no', 'Sometimes', 'Frequently', 'Always']
overweight_categories = ['Insufficient_Weight', 'Normal_Weight', 'Overweight_Level_I', 'Overweight_Level_II', 'Obesity_Type_I', 'Obesity_Type_II', 'Obesity_Type_III']

df["CAEC"] = pd.Categorical(df["CAEC"], categories=basic_categories, ordered=True)
df["CALC"] = pd.Categorical(df["CALC"], categories=basic_categories, ordered=True)
df["NObeyesdad"] = pd.Categorical(df["NObeyesdad"], categories=overweight_categories, ordered=True)

#Bools
bool_map = {"yes": True, "no": False}
df["family_history_with_overweight"] = df["family_history_with_overweight"].map(bool_map).astype("boolean")
df["FAVC"] = df["FAVC"].map(bool_map).astype("boolean")
df["SMOKE"] = df["SMOKE"].map(bool_map).astype("boolean")
df["SCC"] = df["SCC"].map(bool_map).astype("boolean")

#Check types again
df.dtypes

Gender                            category
Age                                  int64
Height                             float64
Weight                             float64
family_history_with_overweight     boolean
FAVC                               boolean
FCVC                              category
NCP                               category
CAEC                              category
SMOKE                              boolean
CH2O                              category
SCC                                boolean
FAF                               category
TUE                               category
CALC                              category
MTRANS                            category
NObeyesdad                        category
dtype: object

**Let's check it again:**

In [7]:
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21,1.62,64.0,True,False,Sometimes,Three,Sometimes,False,Between 1 and 2 L,False,I do not,3-5 hours,no,Public_Transportation,Normal_Weight
1,Female,21,1.52,56.0,True,False,Always,Three,Sometimes,True,More than 2 L,True,4 or 5 days,0-2 hours,Sometimes,Public_Transportation,Normal_Weight
2,Male,23,1.8,77.0,True,False,Sometimes,Three,Sometimes,False,Between 1 and 2 L,False,2 or 4 days,3-5 hours,Frequently,Public_Transportation,Normal_Weight
3,Male,27,1.8,87.0,False,False,Always,Three,Sometimes,False,Between 1 and 2 L,False,2 or 4 days,0-2 hours,Frequently,Walking,Overweight_Level_I
4,Male,22,1.78,89.8,False,False,Sometimes,Between 1 and 2,Sometimes,False,Between 1 and 2 L,False,I do not,0-2 hours,Sometimes,Public_Transportation,Overweight_Level_II


**Let's rename some of the cloumns to have a better understandability**

In [8]:
df = df.rename(columns={
    "family_history_with_overweight": "Family_history_overweight",
    "FAVC": "High_caloric_food",
    "FCVC": "Veggie_consumption_freq",
    "NCP": "Main_meals_count",
    "CAEC": "Food_between_meals_freq",
    "SMOKE": "Smokes",
    "CH2O": "Water_consumption",
    "SCC": "Monitors_calories",
    "FAF": "Physical_activity",
    "TUE": "Screen_time",
    "CALC": "Alcohol_consumption_freq",
    "MTRANS": "Transportation_mode",
    "NObeyesdad": "Obesity_level"
})

df.head()

Unnamed: 0,Gender,Age,Height,Weight,Family_history_overweight,High_caloric_food,Veggie_consumption_freq,Main_meals_count,Food_between_meals_freq,Smokes,Water_consumption,Monitors_calories,Physical_activity,Screen_time,Alcohol_consumption_freq,Transportation_mode,Obesity_level
0,Female,21,1.62,64.0,True,False,Sometimes,Three,Sometimes,False,Between 1 and 2 L,False,I do not,3-5 hours,no,Public_Transportation,Normal_Weight
1,Female,21,1.52,56.0,True,False,Always,Three,Sometimes,True,More than 2 L,True,4 or 5 days,0-2 hours,Sometimes,Public_Transportation,Normal_Weight
2,Male,23,1.8,77.0,True,False,Sometimes,Three,Sometimes,False,Between 1 and 2 L,False,2 or 4 days,3-5 hours,Frequently,Public_Transportation,Normal_Weight
3,Male,27,1.8,87.0,False,False,Always,Three,Sometimes,False,Between 1 and 2 L,False,2 or 4 days,0-2 hours,Frequently,Walking,Overweight_Level_I
4,Male,22,1.78,89.8,False,False,Sometimes,Between 1 and 2,Sometimes,False,Between 1 and 2 L,False,I do not,0-2 hours,Sometimes,Public_Transportation,Overweight_Level_II


In [9]:
df.dtypes

Gender                       category
Age                             int64
Height                        float64
Weight                        float64
Family_history_overweight     boolean
High_caloric_food             boolean
Veggie_consumption_freq      category
Main_meals_count             category
Food_between_meals_freq      category
Smokes                        boolean
Water_consumption            category
Monitors_calories             boolean
Physical_activity            category
Screen_time                  category
Alcohol_consumption_freq     category
Transportation_mode          category
Obesity_level                category
dtype: object

Save df as parquet


In [10]:
# df.to_parquet("../assets/ObesityDataSet_cleaned.parquet", index=False)

Need extra column: BMI

In [11]:
df['BMI'] = df['Weight'] / (df['Height']) ** 2
df.dtypes

Gender                       category
Age                             int64
Height                        float64
Weight                        float64
Family_history_overweight     boolean
High_caloric_food             boolean
Veggie_consumption_freq      category
Main_meals_count             category
Food_between_meals_freq      category
Smokes                        boolean
Water_consumption            category
Monitors_calories             boolean
Physical_activity            category
Screen_time                  category
Alcohol_consumption_freq     category
Transportation_mode          category
Obesity_level                category
BMI                           float64
dtype: object

In [12]:
df.head()

Unnamed: 0,Gender,Age,Height,Weight,Family_history_overweight,High_caloric_food,Veggie_consumption_freq,Main_meals_count,Food_between_meals_freq,Smokes,Water_consumption,Monitors_calories,Physical_activity,Screen_time,Alcohol_consumption_freq,Transportation_mode,Obesity_level,BMI
0,Female,21,1.62,64.0,True,False,Sometimes,Three,Sometimes,False,Between 1 and 2 L,False,I do not,3-5 hours,no,Public_Transportation,Normal_Weight,24.386526
1,Female,21,1.52,56.0,True,False,Always,Three,Sometimes,True,More than 2 L,True,4 or 5 days,0-2 hours,Sometimes,Public_Transportation,Normal_Weight,24.238227
2,Male,23,1.8,77.0,True,False,Sometimes,Three,Sometimes,False,Between 1 and 2 L,False,2 or 4 days,3-5 hours,Frequently,Public_Transportation,Normal_Weight,23.765432
3,Male,27,1.8,87.0,False,False,Always,Three,Sometimes,False,Between 1 and 2 L,False,2 or 4 days,0-2 hours,Frequently,Walking,Overweight_Level_I,26.851852
4,Male,22,1.78,89.8,False,False,Sometimes,Between 1 and 2,Sometimes,False,Between 1 and 2 L,False,I do not,0-2 hours,Sometimes,Public_Transportation,Overweight_Level_II,28.342381


In [13]:
df.shape

(2111, 18)

Save df as parquet

In [None]:
##df.to_parquet("../assets/ObesityDataSet_BMI.parquet", index=False)

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
A suitable version of pyarrow or fastparquet is required for parquet support.
Trying to import the above resulted in these errors:
 - Missing optional dependency 'pyarrow'. pyarrow is required for parquet support. Use pip or conda to install pyarrow.
 - Missing optional dependency 'fastparquet'. fastparquet is required for parquet support. Use pip or conda to install fastparquet.