## Data: Estimation of obesity levels based on eating habits and physical condition

* This [obesity dataset](https://archive-beta.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition) 
include data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition.

* This dataset is originally available at UC Irvine Machine Learning Repository via the [Link](https://archive-beta.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition). The related paper is also available 
in [here](https://www.sciencedirect.com/science/article/pii/S2352340919306985).

* The variables in the dataset are: 

   * `Gender`: A binary variable with levels **Female** and **Male**.
   * `Age`: A numerical variable in years.
   * `Height`: A numerical variable in meters.
   * `Weight`: A numerical variable in kilograms.
   * `family_history_with_overweight`: A binary variable with levels **Yes** and **No** showing whether a family member suffered/suffers from overweight.
   * `FAVC` : A binary variable with levels **Yes** and **No** showing frequency of consumption of high caloric food.
   * `FCVC` : A numerical variable here. Frequency of consumption of vegetables (**Interestingly, it is a categorical variable in the paper**).
   * `NCP` : A numerical variable here. Number of main meals (**Interestingly, it is a categorical variable in the paper**).
   * `CAEC` : An ordinal variable with four levels **No**, **Sometimes**, **Frequently**, and **Always** showing consumption of food between meals.    
   * `SMOKE` : A binary variable with levels **Yes** and **No** showing smoking habit.
   * `CH2O` : A numerical variable here. Consumption of water daily (**Interestingly, it is a categorical variable in the paper**).
   * `SCC` : A binary variable with levels **Yes** and **No** showing calories consumption monitoring.
   * `FAF` : A numerical variable here. Physical activity frequency (**Interestingly, it is a categorical variable in the paper**).
   * `TUE` : A numerical variable here. Time using technology devices (**Interestingly, it is  a categorical variable in the paper**). 
   * `CALC` : An ordinal variable with four levels **No**, **Sometimes**, **Frequently**, and **Always** showing consumption of alcohol
   frequency.
   * `MTRANS` : A nominal variable with four levels **Public_Transportation**, **Automobile**, **Walking**, **Motorbike**, 
   and **Bike** showing transportation type used.
   * `NObeyesdad`:  Another nominal variable (**not clear**). 
   
* A portion of the data set is shown below:   

In [1]:
# import the data set
import pandas as pd

df = pd.read_csv('ObesityDataSet_raw_and_data_sinthetic.csv')
df

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.000000,1.620000,64.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,0.000000,1.000000,no,Public_Transportation,Normal_Weight
1,Female,21.000000,1.520000,56.000000,yes,no,3.0,3.0,Sometimes,yes,3.000000,yes,3.000000,0.000000,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.000000,1.800000,77.000000,yes,no,2.0,3.0,Sometimes,no,2.000000,no,2.000000,1.000000,Frequently,Public_Transportation,Normal_Weight
3,Male,27.000000,1.800000,87.000000,no,no,3.0,3.0,Sometimes,no,2.000000,no,2.000000,0.000000,Frequently,Walking,Overweight_Level_I
4,Male,22.000000,1.780000,89.800000,no,no,2.0,1.0,Sometimes,no,2.000000,no,0.000000,0.000000,Sometimes,Public_Transportation,Overweight_Level_II
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2106,Female,20.976842,1.710730,131.408528,yes,yes,3.0,3.0,Sometimes,no,1.728139,no,1.676269,0.906247,Sometimes,Public_Transportation,Obesity_Type_III
2107,Female,21.982942,1.748584,133.742943,yes,yes,3.0,3.0,Sometimes,no,2.005130,no,1.341390,0.599270,Sometimes,Public_Transportation,Obesity_Type_III
2108,Female,22.524036,1.752206,133.689352,yes,yes,3.0,3.0,Sometimes,no,2.054193,no,1.414209,0.646288,Sometimes,Public_Transportation,Obesity_Type_III
2109,Female,24.361936,1.739450,133.346641,yes,yes,3.0,3.0,Sometimes,no,2.852339,no,1.139107,0.586035,Sometimes,Public_Transportation,Obesity_Type_III


In [2]:
# check the first five lines
df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,Sometimes,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


We intentionally created some missing values in the variables `Age`, `family_history_with_overweight`, and `CALC` as given below:

In [3]:
import numpy as np

np.random.seed(42)

for col in ['Age','family_history_with_overweight','CALC']:
    
    df.loc[df.sample(frac=np.random.randint(5,15)/100).index, col] = np.nan

df.head()

Unnamed: 0,Gender,Age,Height,Weight,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,NObeyesdad
0,Female,21.0,1.62,64.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,Normal_Weight
1,Female,21.0,1.52,56.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,,Public_Transportation,Normal_Weight
2,Male,23.0,1.8,77.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,Normal_Weight
3,Male,27.0,1.8,87.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,Overweight_Level_I
4,Male,22.0,1.78,89.8,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,Overweight_Level_II


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2111 entries, 0 to 2110
Data columns (total 17 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   Gender                          2111 non-null   object 
 1   Age                             1879 non-null   float64
 2   Height                          2111 non-null   float64
 3   Weight                          2111 non-null   float64
 4   family_history_with_overweight  2005 non-null   object 
 5   FAVC                            2111 non-null   object 
 6   FCVC                            2111 non-null   float64
 7   NCP                             2111 non-null   float64
 8   CAEC                            2111 non-null   object 
 9   SMOKE                           2111 non-null   object 
 10  CH2O                            2111 non-null   float64
 11  SCC                             2111 non-null   object 
 12  FAF                             21

In [5]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.feature_selection import SequentialFeatureSelector
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.pipeline import make_pipeline
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression


**Step 1:** In order to change this problem into a regression problem, following [Center for Disease Control and Prevention(CDC)](https://www.cdc.gov/healthyweight/assessing/bmi/adult_bmi/index.html#:~:text=With%20the%20metric%20system%2C%20the,to%20obtain%20height%20in%20meters.), create the BMI variable as follows:


$$
BMI = weight (kg) / [height (m)]^2,
$$

and also include the variables `Gender`, `Age`, `family_history_with_overweight`,	`FAVC`,	`FCVC`,	`NCP`,	`CAEC`,
`SMOKE`, `CH2O`, `SCC`, `FAF`,	`TUE`,	`CALC`, and	`MTRANS` in your dataset. Show the first 5 lines of your dataset.

In [6]:
# Calculating BMI and create a new column 'BMI' in the dataset.
df['BMI'] = df['Weight'] / (df['Height'] ** 2)

# Selecting the specified columns
selected_columns = ['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'FCVC', 'NCP',
                    'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC', 'MTRANS', 'BMI']

# Creating a new dataset with the selected columns
df = df[selected_columns]

# Showing the first 5 lines of the new dataset
df.head(5)

Unnamed: 0,Gender,Age,family_history_with_overweight,FAVC,FCVC,NCP,CAEC,SMOKE,CH2O,SCC,FAF,TUE,CALC,MTRANS,BMI
0,Female,21.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,0.0,1.0,no,Public_Transportation,24.386526
1,Female,21.0,yes,no,3.0,3.0,Sometimes,yes,3.0,yes,3.0,0.0,,Public_Transportation,24.238227
2,Male,23.0,yes,no,2.0,3.0,Sometimes,no,2.0,no,2.0,1.0,Frequently,Public_Transportation,23.765432
3,Male,27.0,no,no,3.0,3.0,Sometimes,no,2.0,no,2.0,0.0,Frequently,Walking,26.851852
4,Male,22.0,no,no,2.0,1.0,Sometimes,no,2.0,no,0.0,0.0,Sometimes,Public_Transportation,28.342381


**Step 2:** Create a SINGLE **pipeline object** with [Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) such that
 
   - Numeric features are:
 
     - imputed with strategy="mean" and
     - scaled with Z-transformation;

   - Nominal features are:

     - imputed with strategy="most_frequent" and
     - one-hot encoded appropriately;

   - Ordinal features are:

      - imputed with strategy="most_frequent" and
      - ordinal encoded appropriately,
      
   - and all these **transform** steps are finally assembled with [SequentialFeatureSelector](https://rasbt.github.io/mlxtend/user_guide/feature_selection/SequentialFeatureSelector/) algorithm from [mlextend](https://rasbt.github.io/mlxtend/) library to select the best subset of features with forward=False (in other words, backward elimination), scoring='r2', and cv=5. Tells us the optimum combination of features which predicts BMI. How did you select this combination?
   
- Note that this question involves use of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) in the intermediate steps. Answers avoiding use of [ColumnTransformer](https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html) will NOT BE ACCEPTED.  

In [7]:
# Defining the features and target variable
X = df.drop('BMI', axis=1)
y = df['BMI']

# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Creating a pipeline with a ColumnTransformer
numeric_features = ['Age', 'NCP', 'FCVC', 'CH2O', 'FAF', 'TUE']
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

nominal_features = ['MTRANS']
nominal_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder())
])

ordinal_features = ['CAEC', 'CALC']
ordinal_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OrdinalEncoder(categories=[['no', 'Sometimes', 'Frequently', 'Always']] * len(ordinal_features)))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('nom', nominal_transformer, nominal_features),
        ('ord', ordinal_transformer, ordinal_features)
    ])

regr = LinearRegression()

sfs = Pipeline([
    ('preprocessor', preprocessor),
    ('sfs', SFS(regr, k_features=(1, 10), forward=False, scoring='r2', cv=5))
])

sfs.fit(X_train, y_train)

In [8]:
metric_dict = sfs.named_steps['sfs'].get_metric_dict()

for k, v in metric_dict.items():
    print(f"\nNumber of Features: {k}")
    print(f"Selected Feature Indices: {v['feature_idx']}")
    
    valid_indices = [idx for idx in v['feature_idx'] if idx < len(X_train.columns)]
    print(f"Selected Feature Names: {X_train.columns[valid_indices]}")
    
    print(f"CV Scores: {v['cv_scores']}")
    print(f"R2 Score: {v['avg_score']}")
    print(f"Confidence Interval Bound: {v['ci_bound']}")
    print(f"Standard Deviation: {v['std_dev']}")
    print(f"Standard Error: {v['std_err']}\n")



Number of Features: 13
Selected Feature Indices: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
Selected Feature Names: Index(['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'FCVC',
       'NCP', 'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC'],
      dtype='object')
CV Scores: [0.3245496  0.31833423 0.22966971 0.33461035 0.300086  ]
R2 Score: 0.3014499773021953
Confidence Interval Bound: 0.04833974894189038
Standard Deviation: 0.03760996693694074
Standard Error: 0.01880498346847037


Number of Features: 12
Selected Feature Indices: (0, 1, 2, 3, 4, 6, 7, 8, 9, 10, 11, 12)
Selected Feature Names: Index(['Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'FCVC',
       'CAEC', 'SMOKE', 'CH2O', 'SCC', 'FAF', 'TUE', 'CALC'],
      dtype='object')
CV Scores: [0.32323728 0.31688041 0.22995179 0.33348769 0.30430407]
R2 Score: 0.30157225028597046
Confidence Interval Bound: 0.04760527871496628
Standard Deviation: 0.0370385241480262
Standard Error: 0.0185192620740131


Numb

In [9]:
best_subset_key = max(metric_dict, key=lambda k: metric_dict[k]['avg_score'])
best_subset_info = metric_dict[best_subset_key]

print("Subset Info")
print("Total Subset Index:", best_subset_key)
print("Best R2 Score:", best_subset_info['avg_score'])
print("Best Parameter Indisis:", best_subset_info['feature_names'])

Subset Info
Total Subset Index: 10
Best R2 Score: 0.30216012062613523
Best Parameter Indisis: ('0', '1', '2', '3', '4', '6', '8', '10', '11', '12')


Yet, there is almost no difference between the model with 10 features (R2 = 0.3022) and the model with 9 features (R2 = 0.3012). So, we can continue with the model with 9 features. 

9 Selected Feature Names: 'Gender', 'Age', 'family_history_with_overweight', 'FAVC', 'FCVC','CAEC', 'FAF', 'TUE', 'CALC'

In [10]:
parameter = ('0', '1', '2', '3', '4', '6', '10', '11', '12')
indisis = [int(indis) - 1 for indis in parameter if indis.isdigit() and 0 <= int(indis) <= len(X_train.columns)]
names = X_train.columns[indisis]

names = pd.DataFrame(names, columns=['Optimum Features'])
names

Unnamed: 0,Optimum Features
0,MTRANS
1,Gender
2,Age
3,family_history_with_overweight
4,FAVC
5,NCP
6,SCC
7,FAF
8,TUE


In [11]:
sfs = SFS(regr, k_features=(1, 9), forward=False, scoring='r2', cv=5)
sfs9 = Pipeline([('preprocessor', preprocessor), 
                 ('sequentialfeatureselector', sfs)])



**Step 3:** Report $R^2$ of the **best model (i.e. the model with optimum features)** found in **Step 2** on train and test data, respectively. (Note here that the number of features in train and test datasets should be less than the original number of features due to feature selection).

In [12]:
# Fitting the pipeline for training data
sfs9.fit(X_train, y_train)

# Transforming both training and test data with the feature selector
X_train_selected = sfs9.transform(X_train)
X_test_selected = sfs9.transform(X_test)

regr.fit(X_train_selected, y_train)

# Predicting on the training set
y_train_pred9 = regr.predict(X_train_selected)

# Calculating R2 score for training set
r2_train9 = r2_score(y_train, y_train_pred9)

# Predicting on the test set
y_test_pred9 = regr.predict(X_test_selected)

# Calculating R2 score for test set
r2_test9 = r2_score(y_test, y_test_pred9)

# Displaying the R^2 scores
print(f"R2 score on training set: {r2_train9}")
print(f"R2 score on test set: {r2_test9}")

R2 score on training set: 0.31657158686913567
R2 score on test set: 0.22675806283364242


**Step 4:** Repeat **Step 2** when the final estimator is only linear regression (i.e., without feature selection). 

In [13]:
X_train_WFS, X_test_WFS, y_train_WFS, y_test_WFS = train_test_split(X, y, test_size=0.2, random_state=42)

# Defining the pipeline without feature selection 
pipeline_ = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('regressor', LinearRegression())
])

**Step 5:** Report the $R^2$ of the linear regression model without any feature selection on train and test data, respectively.

In [14]:
# Fitting the pipeline on the training data
pipeline_.fit(X_train_WFS, y_train_WFS)

# Predicting on the training set
y_train_pred = pipeline_.predict(X_train_WFS)

# Predicting on the test set
y_test_pred = pipeline_.predict(X_test_WFS)


# Calculating R^2 scores
r2_train = r2_score(y_train_WFS, y_train_pred)
r2_test = r2_score(y_test_WFS, y_test_pred)

# Displaying the R^2 scores
print(f"R2 score on training set: {r2_train}")
print(f"R2 score on test set: {r2_test}")

R2 score on training set: 0.318286120484243
R2 score on test set: 0.22089502614551104


**Step 6:** Compare these results in **Step 5** with the ones in **Step 3** and comment on it.

The R^2 scores are very similar for **Step 5** and  **Step 3**, indicating that the choice of features does not significantly affect the performance of the model in this particular case. However, there are still slight differences. These can be interpreted as follows: 

If the R^2 score of the test data in the model with Feature Selection is higher, this indicates that the model is able to generalize better to the test set with the selected features. This may indicate that the model preserves important information by avoiding unnecessary complexity. On the other hand, if the R^2 score of the train data in the model with Feature Selection is lower, this indicates that the model fits the training set less well with the selected features and is more resistant to overfitting. This may indicate that the model is able to generalize better.
The outputs mentioned above can also be stated for the results of the applications conducted in this question.

In addition, we can indicate that since the training and test results are very close to each other, there is no overfitting problem arising from comparing the training and test results. Yet, there is an underfitting problem in this model. Because the R^2 = 0.31 value is considerably low and it states that the model explains only a small proportion of the variance in out target.