<a href="https://colab.research.google.com/github/roklp/MLP34/blob/main/notebook87a4c88bd9.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 style = "font-size:300%; text-align:center;color:#0000FF; letter-spacing: 2px;padding: 10px;border-bottom: 5px solid #407A68"> Obesity Risk Prediction (Multi-Class) </h1>

In this kaggle project, We explored the task of predicting obesity risk using a multi-class classification approach.
This report outlines the key steps taken in feature engineering, ensemble modeling, and encoding techniques to achieve accurate predictions.
# feature engineering
In this section, various transformations are applied to preprocess the data and derive new features that could potentially enhance the predictive performance of machine learning models.
**Age and Height Rounding**  <br>
To facilitate model learning, the 'Age' and 'Height' columns undergo rounding transformations. <br>The 'Age' values are multiplied by 100 and converted to uint16, effectively rounding them to the nearest whole number. Similarly, the 'Height' values are also multiplied by 100 and converted to uint16 to achieve rounding.
**Feature Extraction** <br>
New features are extracted to provide additional insights into the dataset. Specifically, the 'BMI' (Body Mass Index) feature is computed by dividing the weight by the square of the height. <br>This metric is commonly used to assess an individual's body composition. Additionally, a 'PseudoTarget' feature is created based on the BMI values. The BMI values are segmented into predefined bins, and each observation is assigned a categorical label corresponding to its BMI category.
**Column Rounding** <br>
Certain columns in the dataset have their values rounded to integers for simplification and standardization. This operation is performed on columns such as 'FCVC', 'NCP', 'CH2O', 'FAF', and 'TUE'. <br>By rounding these numerical values, the model can focus on broader patterns and trends within the data.
**Feature Dropping** <br>
A custom transformer, known as FeatureDropper, is employed to remove specified columns from the dataset. This allows for the elimination of redundant or irrelevant features that may hinder model performance. <br>The FeatureDropper class is initialized with a list of columns to be dropped, and during the transform process, these columns are excluded from the dataset.
# encoding
In this code, we primarily used OneHotEncoder and MEstimateEncoder.
Looking at the train data, we have a mixture of numerical and categorical columns.
Furthermore, the dependent variable we need to predict is categorical, not numerical.
Therefore, we encoded the ['Gender', 'family_history_with_overweight', 'FAVC', 'CAEC', 'SMOKE', 'SCC', 'CALC', 'MTRANS'] columns in the train data,
and the 'NObeyesdad' column in the test data.
MEstimateEncoder is one method for encoding categorical variables. This method encodes each category of the categorical variable using the mean of the target variable for each category. Instead of using the mean of the target variable for each category, MEstimateEncoder calculates the mean of the category with a pre-specified M value. This allows for consistent calculation of the mean for all categories. This method is useful when the number of samples in a category is small or when categories are peculiar.
# ensemble
We employ an ensemble model consisting of Random Forest Model, LGBM Model, XGB Model, and CatBoost Model. For each model, we utilize Optuna to find the optimal hyperparameters, followed by cross-validation. Additionally, we assign weights to each model and create the final ensemble model. Finally, we compute the accuracy.

# List of libraries used in statistical analysis
- matplotlib
- numpy
- pandas
- scipy
- seaborn
#List of libraries used in the final model
- os
- tensorflow
- random
- warnings
- numpy
- pandas
- matplotlib
- seaborn
- sklearn(make_pipeline, Pipeline, StandardScaler, MinMaxScaler, OneHotEncoder, CatBoostEncoder, MEstimateEncoder, StratifiedGroupKFold, RandomForestClassifier, RidgeClassifier, LogisticRegression, set_config, FunctionTransformer, StratifiedKFold, ColumnTransformer, make_column_transformer, clone, BaseEstimator, TransformerMixin, accuracy_score, confusion_matrix, ConfusionMatrixDisplay, PCA, KMeans)
- xgboost(XGBClassifier)
- catboost(CatBoostClassifier)
- lightgbm(LGBMClassifier)
- optuna
- prettytable(PrettyTable)

**The original work is available at https://www.kaggle.com/code/ksevta/ps4e2-xgb-lgbm-0-92**

**Purpose**: To learn from high scoring notebooks.

Following are some modifications I have done in the notebook to improve score.

- Random number fixed.
- n_splits changed to 9.

In [None]:
!pip install tensorflow

In [None]:
import os
import tensorflow as tf  # To ensure reproducibility of project results / Open-source platform for building and training machine learning and deep learning models
import random as rn  # Provides various functions for generating random numbers
os.listdir('/kaggle/input/playground-series-s4e2/')  # Used to check what files or folders are in the specified path
os.environ['PYTHONHASHSEED'] = '51'  # Ensures consistent results of hash function / Increases result reproducibility
rn.seed(89)  # Sets the seed number to 89 when running randomly / Ensures consistent results
tf.random.set_seed(40)  # Sets the starting point of the random process to 40 / Used for dropout in tensorflow


# Introduction
<div style="font-size:120%">
    <b>Goal:</b> We have to predict obesity risk in individuals.<br><br>
    <b>Dataset Description:</b>
</div>

| Column | Full Form | Description|
|---|---|---|
| 'id'| id | Unique for each person(row)|
|'Gender'| Gender| person's Gender|
| 'Age' | Age| Dtype is float. Age is between 14 years to 61 years |
|'Height'| Height | Height is in meter it's between 1.45m to 1.98m|
| 'Weight' | Weight| Weight is between 39 to 165. I think it's in KG.|
|'family_history_with_overweight'| family history <br> with overweight| yes or no question|
| 'FAVC'| Frequent consumption <br> of high calorie food| it's yes or no question. i think question they asked is <br>do you consume high calorie food|
|'FCVC'|  Frequency of <br>consumption of vegetables| Similar to FAVC. this is also `yes or no` question|
|'NCP'| Number of main meals| dtype is float, NCP is between 1 & 4. I think it should be 1,2,3,4 <br>but our data is synthetic so it's taking float values|
|'CAEC'| Consumption of <br>food between meals| takes 4 values `Sometimes`, `Frequently`, `no` & `Always` <br>|
| 'SMOKE'| Smoke | yes or no question. i think the question is "Do you smoke?" |
|'CH2O'| Consumption of <br>water daily| CH2O takes values between 1 & 3. again it's given as <br>float may be because of synthetic data. it's values should be 1,2 or 3|
|'SCC'|  Calories consumption <br>monitoring| yes or no question|
|'FAF'| Physical activity <br>frequency| FAF is between 0 to 3, 0 means no physical activity<br> and 3 means high workout. and again, in our data it's given as float|
|'TUE'| Time using <br>technology devices| TUE is between 0 to 2. I think question will be "How long you have <br>been using technology devices to track your health." in our data it's given as float |
|'CALC'| Consumption of alcohol | Takes 3 values: `Sometimes`, `no`, `Frequently`|
| 'MTRANS' | Transportation used| MTRANS takes 5 values `Public_Transportation`, `Automobile`, <br>`Walking`, `Motorbike`, & `Bike`|
|'NObeyesdad'| TARGET | This is our target, takes 7 values, and in this comp. we have to give <br>the class name (Not the Probability, which is the case in most comp.)


<div style="font-size:120%">
    <b>NObeyesdad (Target Variable):</b>
</div>

* Insufficient_Weight : Less than 18.5
* Normal_Weight       : 18.5 to 24.9
* Obesity_Type_I      : 30.0 to 34.9
* Obesity_Type_II     : 35.0 to 39.9
* Obesity_Type_III   : Higher than 40
* Overweight_Level_I, Overweight_Level_II takes values between 25 to 29



# Import Libraries

In [None]:
import warnings
warnings.filterwarnings("ignore")

import numpy as np, pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from category_encoders import OneHotEncoder, CatBoostEncoder, MEstimateEncoder
from sklearn.model_selection import StratifiedGroupKFold


from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from lightgbm import LGBMClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import RidgeClassifier, LogisticRegression

from sklearn import set_config
import os
from sklearn.preprocessing import FunctionTransformer
from sklearn.model_selection import StratifiedKFold
import optuna
from sklearn.compose import ColumnTransformer
from prettytable import PrettyTable

from sklearn.compose import make_column_transformer
from sklearn.base import clone
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score
import optuna
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Parameters

In [None]:
# Set Parameters for Reproducibility
pd.set_option("display.max_rows", 100)
FILE_PATH = "/kaggle/input/playground-series-s4e2/"
TARGET = "NObeyesdad"
n_splits = 9  # Used to evaluate the performance of the model by dividing the data into multiple parts and using each part for training and validation sequentially
RANDOM_SEED = 73  # Set the random seed to 73, ensuring reproducibility by obtaining the same result every time random operations are performed


# Load Data

In [None]:
# load all data
train = pd.read_csv(os.path.join(FILE_PATH, "train.csv"))
test = pd.read_csv(os.path.join(FILE_PATH, "test.csv"))
sample_sub = pd.read_csv(os.path.join(FILE_PATH, "sample_submission.csv"))
train_org = pd.read_csv("/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv")

# Explore Data

In [None]:
def prettify_df(df):
    table = PrettyTable()
    table.field_names = df.columns

    for row in df.values:
        table.add_row(row)
    print(table)
# Define a function to prettify the DataFrame (tabular data)
# Use the PrettyTable library to display the contents of the DataFrame in a visually appealing tabular format


In [None]:
train.head(10)

In [None]:
# Train Data
print("Train Data")
print(f"Total number of rows: {len(train)}")
print(f"Total number of columns: {train.shape[1]}\n")

# Test Data
print("Test Data")
print(f"Total number of rows: {len(test)}")
print(f"Total number of columns: {test.shape[1]}")
# Print the number of rows and columns in the Train and Test Data


In [None]:
# check null and unique count
# FHWO: family_history_with_overweight
train_copy = train.rename(columns={"family_history_with_overweight":"FHWO"}) # Renaming column 'family_history_with_overweight' to 'FHWO'
tmp = pd.DataFrame(index=train_copy.columns) # Creating a new DataFrame for columns
tmp['count'] = train_copy.count() # Calculating the non-null count
tmp['dtype'] = train_copy.dtypes # Finding the data types of columns
tmp['nunique'] = train_copy.nunique() # Calculating the number of unique values for each column
tmp['%nunique'] = (tmp['nunique']/len(train_copy))*100 # Calculating the percentage of unique values for each column relative to the total number of rows
tmp['%null'] = (train_copy.isnull().sum()/len(train_copy))*100 # Calculating the percentage of null values for each column
tmp['min'] = train_copy.min() # Finding the minimum value for each column
tmp['max'] = train_copy.max() # Finding the maximum value for each column
tmp

tmp.reset_index(inplace=True) # Resetting the index of the resulting DataFrame
tmp = tmp.rename(columns = {"index":"Column Name"}) # Renaming the 'index' column to 'Column Name'
tmp = tmp.round(3) # Rounding the values in the DataFrame to 3 decimal places
prettify_df(tmp) # Printing the prettified DataFrame
del tmp, train_copy # Deleting temporary variables to free up memory
# Explains the process of examining important information about each column (feature) in the training dataset and pretty printing the results.
# Specifically, it calculates and displays the percentage of null values, number and percentage of unique values, data type, minimum, and maximum values for each column.


In [None]:
# Target Distribution with Gender
# Analyzing the number and proportion of data points by gender
# Checking the proportion of each target class in the entire dataset
pd.set_option('display.float_format', '{:.2f}'.format) # Setting Pandas to display numbers up to two decimal places
tmp = pd.DataFrame(train.groupby([TARGET,'Gender'])["id"].agg('count')) # Grouping by variables with the count of the 'id' column
tmp.columns = ['Count']
train[TARGET].value_counts() # Calculating the overall distribution of values for the TARGET variable
tmp = pd.merge(tmp,train[TARGET].value_counts(),left_index=True, right_index=True)
tmp.columns = ['gender_count','target_class_count']
tmp['%gender_count'] = tmp['gender_count']/tmp['target_class_count']
tmp["%target_class_count"] = tmp['target_class_count']/len(train)
tmp = tmp[['gender_count','%gender_count','target_class_count','%target_class_count']]
print("Target Distribution with Gender")
tmp


In [None]:
raw_num_cols = list(train.select_dtypes("float").columns)
raw_cat_cols = list(train.columns.drop(raw_num_cols+[TARGET]))

full_form = dict({'FAVC' : "Frequent consumption of high caloric food",
                  'FCVC' : "Frequency of consumption of vegetables",
                  'NCP' :"Number of main meal",
                  'CAEC': "Consumption of food between meals",
                  'CH2O': "Consumption of water daily",
                  'SCC':  "Calories consumption monitoring",
                  'FAF': "Physical activity frequency",
                  'TUE': "Time using technology devices",
                  'CALC': "Consumption of alcohol" ,
                  'MTRANS' : "Transportation used"})

# Dividing the columns of the dataset into numerical and categorical, and providing full descriptions for some column abbreviations


**From Above Table, We Can See**
* All the People in `Obesity_Type_II` are **Male** and in `Obesity_Type_III` all are **Female**
* `Overweight_Level_II` consists `70%` **Male**, and `Insufficient_Weight` consists more than `60%` **Female**
* From these point we can say that `Gender` is a important feature for the Obesity Prediction

* All individuals classified under Obesity_Type_II are male, while those in Obesity_Type_III are all female.

* About 70% of individuals in Overweight_Level_II are male, and more than 60% of those in Insufficient_Weight are female.

* Based on these observations, it can be inferred that Gender is an important feature for predicting obesity.

* Translation seems good, but there's a small typo in the last sentence: "...it can be inferred that Gender is an important feature for predicting obesity." I just corrected the typo.

# Data Visualization
In this Section we will see:
* Individual Numerical Plots
* Individual Categorical Plots
* Numerical Correlation Plot
* Combined Numerical Plots

### Target Distribution with Gender


In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

fig, axs = plt.subplots(1,2,figsize = (12,5))
plt.suptitle("Target Distribution")

sns.histplot(binwidth=0.5, x=TARGET, data=train, hue='Gender', palette="Blues", ax=axs[0], discrete=True)

axs[0].tick_params(axis='x', rotation=60)

axs[1].pie(
    train[TARGET].value_counts(),
    shadow=True,
    explode=[.1 for i in range(train[TARGET].nunique())],
    labels=train[TARGET].value_counts().index,
    autopct='%1.f%%',
    colors=sns.color_palette("Blues", n_colors=len(train[TARGET].value_counts()))
)

plt.tight_layout()
plt.show()

# Generate two graphs to display the distribution of the target variable (with respect to gender): a histogram and a pie chart.

<a id = "section_1"> </a>
# Individual Numerical Plots

In [None]:
fig,axs = plt.subplots(len(raw_num_cols),1,figsize=(12,len(raw_num_cols)*2.5),sharex=False)
for i, col in enumerate(raw_num_cols):
    sns.violinplot(x=TARGET, y=col,hue="Gender", data=train,ax = axs[i], split=False)
    if col in full_form.keys():
        axs[i].set_ylabel(full_form[col])

plt.tight_layout()
plt.show()
#Create violin plots for each numerical variable (raw_num_cols) based on the target variable (TARGET) and gender (Gender
# Each plot represents the distribution of the variable split by gender.






Insights from above plots:
* We should Ignore, **Female** distribution in `Obesity_Type_II` Class & **Male** distribution in "Obesity_Type_III". because of very small sample size
* We can see People in category of `Insufficient Weight` consumes higher `Number of main Meal` maybe because to gain weight
* `Frequency of consumption of Vegetables` is **Three** for everyone in class `Obesity Type III`
* `Weight`, `Height` & `Gender` looks like the most important features. `Weight` shows very clear differentiation for diff classes


* Due to very small sample sizes, we should disregard the distribution of females in the 'Obesity Type II' class and males in the 'Obesity Type III' class.
* Individuals in the 'Insufficient Weight' category seem to consume more main meals.
* All individuals in the 'Obesity Type III' class have a frequency of consuming vegetables three times.
* Weight, Height, and Gender appear to be the most important features. Weight shows significant differentiation across various classes.

<a id = "section_2"> </a>
# Individual Categorical Plots

In [None]:
_,axs = plt.subplots(int(len(raw_cat_cols)-1),2,figsize=(12,len(raw_cat_cols)*3),width_ratios=[1, 4])
for i,col in enumerate(raw_cat_cols[1:]):
    sns.countplot(y=col,data=train,palette="bright",ax=axs[i,0])
    sns.countplot(x=col,data=train,hue=TARGET,palette="bright",ax=axs[i,1])
    if col in full_form.keys():
        axs[i,0].set_ylabel(full_form[col])


plt.tight_layout()
plt.show()
# Generate two types of count plots for each categorical variable
# The plot on the left shows the distribution of the variable
# The plot on the right shows the distribution of the variable based on the target variable (TARGET)


<a id = "section_3"></a>
# Numerical Correlation Plot

In [None]:
tmp = train[raw_num_cols].corr("pearson")
sns.heatmap(tmp,annot=True,cmap ="Blues")
# Compute the Pearson correlation coefficient among numerical variables (raw_num_cols) in the train dataset and visualize the result as a heatmap
# Pearson correlation coefficient:
# - Helps to understand the relationship between variables
# - Assists in selecting important variables for machine learning modeling
# - Enables early detection and resolution of multicollinearity issues in machine learning


* `Height` has a positive corr with `Weight`,`FAF`. we will see there combined plots
* People with higher `Weight` drinks more water.

* Height shows a positive correlation with Weight and FAF (Physical activity frequency).
* We will examine their combined plots.
People with higher weight tend to consume more water.

<a id = "section_4"></a>
# Combined Numerical Plots

In [None]:
sns.jointplot(data=train, x="Height", y="Weight", hue=TARGET, height=6, palette="Blues")

#Creating a joint plot to visualize the relationship between "Height" and "Weight" using the train dataset.
#The TARGET variable is used to distinguish the color of each data point.

In [None]:
sns.jointplot(data=train, x="Age", y="Height", hue=TARGET,height=6, palette="Blues")
#Creating a joint plot to visualize the relationship between "Height" and "Age" using the train dataset.
#The TARGET variable is used to distinguish the color of each data point.

# Principal Component Analysis (PCA) & KMeans
These plots are inspired by [This](https://www.kaggle.com/competitions/playground-series-s4e2/discussion/472471) discussion.

In [None]:
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

# PCA
# Enhances computational efficiency and reduces the risk of overfitting.
# Allows reduction of data complexity while retaining important information.
# Visualization: Makes it easier to understand patterns or structures in the data and facilitates clearer interpretation of relationships between variables.
pca = PCA(n_components=2)  # Reduce to 2 principal components
pca_top_2 = pca.fit_transform(train[raw_num_cols])
# Apply PCA to the train dataset containing only numerical variables to transform each data point into a new form represented by 2 principal components

tmp = pd.DataFrame(data=pca_top_2, columns=['pca_1', 'pca_2'])
# Results are stored in pca_top_2, which is then used to create the tmp DataFrame
tmp['TARGET'] = train[TARGET]
# This DataFrame includes the transformed principal component values (pca_1, pca_2) and the target variable (TARGET)

fig, axs = plt.subplots(2, 1, figsize=(12, 6))
sns.scatterplot(data=tmp, y="pca_1", x="pca_2", hue='TARGET', ax=axs[0])
axs[0].set_title("Top 2 Principal Components")
# Set the title "Top 2 Principal Components" for the first subplot

# KMeans
# Reveals characteristics of each group.
# Enables identification of hidden insights in the data.
# By applying K-means clustering to data dimensionally reduced by PCA, the performance of the clustering algorithm can be further improved.
kmeans = KMeans(7, random_state=RANDOM_SEED)  # Create a K-means clustering model with 7 clusters
kmeans.fit(tmp[['pca_1', 'pca_2']])  # Apply K-means clustering to the PCA-transformed data
sns.scatterplot(y=tmp['pca_1'], x=tmp['pca_2'], c=kmeans.labels_, cmap='viridis', marker='o', edgecolor='k', s=50, alpha=0.8, ax=axs[1])
# Plot the clustering results in the second subplot (axs[1]) as a scatter plot
axs[1].set_title("Kmean Clustering on First 2 Principal Components")
plt.


# Feature Engineering & Processing

In [None]:
# In the age_rounder and height_rounder functions, values are multiplied by some value, which sometimes improves the model's CV score.
# In the extract_features function, we combine features to generate new features, transforming the data to potentially improve the model's performance.

def age_rounder(x):
    x_copy = x.copy()
    x_copy['Age'] = (x_copy['Age'] * 100).astype(np.uint16)  # Multiply the values in the 'Age' column by 100 and convert to integers to adjust scaling for better modeling.
    return x_copy

def height_rounder(x):
    x_copy = x.copy()
    x_copy['Height'] = (x_copy['Height'] * 100).astype(np.uint16)  # Multiply the values in the 'Height' column by 100 and convert to integers.
    return x_copy

def extract_features(x):
    x_copy = x.copy()
    x_copy['BMI'] = (x_copy['Weight'] / x_copy['Height'] ** 2)  # Create a new feature 'BMI'. This new feature helps the model better understand the data.
    x_copy['PseudoTarget'] = pd.cut(x_copy['BMI'], bins=[0, 18.4, 24.9, 29, 34.9, 39.9, 100], labels=[0, 1, 2, 3, 4, 5])
    return x_copy

def col_rounder(x):  # Round the values in specific columns ('FCVC', 'NCP', 'CH2O', 'FAF', 'TUE') and convert them to integers to simplify for the model.
    x_copy = x.copy()
    cols_to_round = ['FCVC', "NCP", "CH2O", "FAF", "TUE"]
    for col in cols_to_round:
        x_copy[col] = round(x_copy[col])
        x_copy[col] = x_copy[col].astype('int')
    return x_copy

# Each function operates on a copy of the original DataFrame and returns the transformed DataFrame.
# These preprocessing steps clarify the features of the data and reduce unnecessary variability to potentially enhance model performance.

AgeRounder = FunctionTransformer(age_rounder)
HeightRounder = FunctionTransformer(height_rounder)
ExtractFeatures = FunctionTransformer(extract_features)
ColumnRounder = FunctionTransformer(col_rounder)
# Use FunctionTransformer to make each preprocessing function compatible with Scikit-learn pipelines.
# This allows for more efficient management of data preprocessing steps and easy integration into the model training pipeline.


In [None]:
# The FeatureDropper class, which inherits from Scikit-learn's transformers, is defined to remove specific columns from a DataFrame.
# This is important when passing different sets of features to different models.
# It provides flexibility to remove unnecessary features during data preprocessing, potentially positively impacting model performance.

from sklearn.base import BaseEstimator, TransformerMixin

class FeatureDropper(BaseEstimator, TransformerMixin):
    # FeatureDropper is a tool for removing unwanted columns (i.e., features or variables) from the data.
    # It follows the conventions of the Scikit-learn library provided in Python.

    def __init__(self, cols):  # In this part, FeatureDropper decides which columns to remove.
        self.cols = cols  # When instantiated, it remembers the names of columns to remove.

    def fit(self, x, y):  # Here, it defines a convention ensuring that all Scikit-learn tools work similarly.
        return self

    def transform(self, x):  # It actually removes the unwanted columns from the data.
        return x.drop(self.cols, axis=1)

# FeatureDropper is used to remove specific columns from the data when desired.
# It facilitates easy experimentation with different data for different models.


### Next we will define `cross_val_model` which will be used to train and validate all the models we will use in this Notebook
`cross_val_model` function gives three things: **val_scores**, **valid_predictions**, **test_predictions**
* <b>val_scores:</b> This gives us accuracy score on Validation Data.
* <b>valid_predictions:</b> This is a array which stores model predictions on validation set
* <b>test_predictions:</b> This gives test prediction averaged by number of splits we are using

### Next, we'll define `cross_val_model`, which will be used to train and validate all models in this notebook. The function provides three outputs: **val_scores**.
 **valid_predictions**, **test_predictions**
* <b>val_scores:</b> Provides accuracy scores for the validation data.
* <b>valid_predictions:</b> Stores model predictions for the validation set in an array.
* <b>test_predictions:</b> Provides averaged test predictions based on the number of splits used.

In [None]:
# In cross_val_model we cross-validate models using Stratified K-Fold.

# Encoding target values with integers
target_mapping = {
                  'Insufficient_Weight': 0,
                  'Normal_Weight': 1,
                  'Overweight_Level_I': 2,
                  'Overweight_Level_II': 3,
                  'Obesity_Type_I': 4,
                  'Obesity_Type_II': 5 ,
                  'Obesity_Type_III': 6
                  }

# Define a method for Cross validation. Here we are using StratifiedKFold.
def cross_val_model(estimators, cv=skf, verbose=True):
    '''
        estimators : pipeline consisting preprocessing, encoder & model
        cv : Method for cross-validation (default: StratifiedKFold)
        verbose : print train/valid score (yes/no)
    '''
    # Data preparation
    X = train.copy()  # Create a copy of the training data
    y = X.pop(TARGET)  # Separate the target variable from X

    # Initialize arrays to store predictions
    y = y.map(target_mapping)  # Convert target variable values to integers using 'target_mapping'
    test_predictions = np.zeros((len(test), 7))  # Initialize array to store predictions for test data
    valid_predictions = np.zeros((len(X), 7))  # Initialize array to store predictions for validation data

    # Cross-validation
    val_scores, train_scores = [], []
    for fold, (train_ind, valid_ind) in enumerate(skf.split(X, y)):
        model = clone(estimators)  # Create a deep copy of the given estimator (model)

        # Define train set
        X_train = X.iloc[train_ind]
        y_train = y.iloc[train_ind]

        # Define valid set
        X_valid = X.iloc[valid_ind]
        y_valid = y.iloc[valid_ind]

        # Train and evaluate the model
        model.fit(X_train, y_train)
        if verbose:
            print("-" * 100)
            print(f"Fold: {fold}")
            print(f"Train Accuracy Score: {accuracy_score(y_true=y_train, y_pred=model.predict(X_train))}")
            print(f"Valid Accuracy Score: {accuracy_score(y_true=y_valid, y_pred=model.predict(X_valid))}")
            print("-" * 100)

        # Calculate predictions for test data
        test_predictions += model.predict_proba(test) / cv.get_n_splits()

        # Save predictions for validation data
        valid_predictions[valid_ind] = model.predict_proba(X_valid)

        # Store validation scores
        val_scores.append(accuracy_score(y_true=y_valid, y_pred=model.predict(X_valid)))

    if verbose:
        print(f"Average Mean Accuracy Score: {np.array(val_scores).mean()}")

    return val_scores, valid_predictions, test_predictions


In [None]:
# Combine Original & Synthetic Data
# Combine the original data with additional provided data
# Remove duplicates from the data to process it into a clean format
# The processed data is used for training machine learning models.

# Remove the 'id' column
train.drop(['id'], axis=1, inplace=True)  # Remove the 'id' column from the training data
test_ids = test['id']  # Save the 'id' column of the test dataset to test_ids for later use
test.drop(['id'], axis=1, inplace=True)  # Also remove the 'id' column from the test dataset

# Combine original and synthetic data
train = pd.concat([train, train_org], axis=0)  # Use the 'pd.concat' function to combine the original training dataset with the additional provided synthetic dataset

# Remove duplicates
train = train.drop_duplicates()  # Remove duplicate rows from the combined dataset to maintain data consistency and eliminate unnecessary duplicate data

# Reset the index
train.reset_index(drop=True, inplace=True)  # After removing duplicates, reset the index; drop=True discards the old index without adding it as a new column


In [None]:
# Empty dataframes to store scores, train/test predictions.
# These three dataframes are used to systematically store and manage various results generated during the modeling process,
# providing necessary information for analyzing the final prediction results.

score_list, oof_list, predict_list = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

# score_list: Empty dataframe to store the scores of the models.
# oof_list: Empty dataframe to store the predictions made by the model on the training/validation data.
# predict_list: Empty dataframe to store the predictions made by the model on the test data.


# Model

<div style = "font-size:120%">Rather than focusing on a single model, in this competition it's better to combine predictions from many high performing models. In this notebook we will be training Four different type of models and will combine their predictions for final sub.</div>

* [Random Forest Model](#rfc)
* [LGBM Model](#lgbm)
* [XGB Model](#xgb)
* [Catboost Model](#cat)

#In this notebook, we'll train four different types of models and combine their predictions for the final submission, rather than focusing on a single model. The four models are:

* Random Forest model
* LGBM model
* XGB model
* Catboost model


<a id = "rfC"> </a>
# Random Forest Model

In [None]:
# Define Random Forest Model Pipeline
# Connect data preprocessing steps and modeling steps using the make_pipeline function.

RFC = make_pipeline(
                        ExtractFeatures,  # Preprocessing step to extract new features from the data, such as calculating BMI
                        MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),  # Step to encode categorical variables into numerical ones
                       RandomForestClassifier(random_state=RANDOM_SEED)  # Specify the Random Forest classification model, setting random_state=RANDOM_SEED to ensure reproducibility of the model
                    )


In [None]:
# Execute Random Forest Pipeline
# Execute the defined Random Forest pipeline and store predictions for training and test data.

val_scores, val_predictions, test_predictions = cross_val_model(RFC)
# Call cross_val_model(RFC) to perform cross-validation on the Random Forest pipeline (RFC).
# This function returns scores for validation data (val_scores), predictions for validation data (val_predictions), and predictions for test data (test_predictions).

# Save train/validation predictions
# Store training/validation predictions

for k, v in target_mapping.items():
    oof_list[f"rfc_{k}"] = val_predictions[:, v]

# Save test predictions
# Store test predictions
for k, v in target_mapping.items():
    predict_list[f"rfc_{k}"] = test_predictions[:, v]


<a id = "lgbm"></a>
# LGBM Model

In [None]:
# Define Optuna Function To Tune LGBM Model

def lgbm_objective(trial):
    params = {
        'learning_rate' : trial.suggest_float('learning_rate', .001, .1, log = True),
        'max_depth' : trial.suggest_int('max_depth', 2, 20),
        'subsample' : trial.suggest_float('subsample', .5, 1),
        'min_child_weight' : trial.suggest_float('min_child_weight', .1, 15, log = True),
        'reg_lambda' : trial.suggest_float('reg_lambda', .1, 20, log = True),
        'reg_alpha' : trial.suggest_float('reg_alpha', .1, 10, log = True),
        'n_estimators' : 1000,
        'random_state' : RANDOM_SEED,
        'device_type' : "gpu",
        'num_leaves': trial.suggest_int('num_leaves', 10, 1000),

        #'boosting_type' : 'dart',
    }

    optuna_model = make_pipeline(
                                 ExtractFeatures,
                                 MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                                LGBMClassifier(**params,verbose=-1)
                                )
    val_scores, _, _ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

lgbm_study = optuna.create_study(direction = 'maximize',study_name="LGBM")

In [None]:
# Execute LGBM Tuning, To Tune set `TUNE` to True (it will take a long time)
TUNE = False

warnings.filterwarnings("ignore")
if TUNE:
    lgbm_study.optimize(lgbm_objective, 50)


In [None]:
numerical_columns = train.select_dtypes(include=['int64', 'float64']).columns.tolist()
categorical_columns = train.select_dtypes(include=['object']).columns.tolist()
categorical_columns.remove('NObeyesdad')

<div style = "font-size:120%">LGBM parameters in next cell are taken from @moazeldsokyx notebook you may check his great work in this notebook:<br></div>

https://www.kaggle.com/code/moazeldsokyx/pgs4e2-highest-score-lgbm-hyperparameter-tuning/notebook


In [None]:
# Here we defined LGBM Pipeline
# Where we use One_Hot_Encoder, for categorical encoding
# standard scaler for numerical column scaling


params = {'learning_rate': 0.04325905707439143, 'max_depth': 4,
          'subsample': 0.6115083405793659, 'min_child_weight': 0.43633356137010687,
          'reg_lambda': 9.231766981717822, 'reg_alpha': 1.875987414096491, 'num_leaves': 373,
          'n_estimators' : 1000,'random_state' : RANDOM_SEED, 'device_type' : "gpu",
         }

best_params = {
    "objective": "multiclass",          # Objective function for the model
    "metric": "multi_logloss",          # Evaluation metric
    "verbosity": -1,                    # Verbosity level (-1 for silent)
    "boosting_type": "gbdt",            # Gradient boosting type
    "random_state": 42,       # Random state for reproducibility
    "num_class": 7,                     # Number of classes in the dataset
    'learning_rate': 0.030962211546832760,  # Learning rate for gradient boosting
    'n_estimators': 500,                # Number of boosting iterations
    'lambda_l1': 0.009667446568254372,  # L1 regularization term
    'lambda_l2': 0.04018641437301800,   # L2 regularization term
    'max_depth': 10,                    # Maximum depth of the trees
    'colsample_bytree': 0.40977129346872643,  # Fraction of features to consider for each tree
    'subsample': 0.9535797422450176,    # Fraction of samples to consider for each boosting iteration
    'min_child_samples': 26             # Minimum number of data needed in a leaf
}

lgbm = make_pipeline(
                        ColumnTransformer(
                        transformers=[('num', StandardScaler(), numerical_columns),
                                  ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),
                        LGBMClassifier(**best_params,verbose=-1)
                    )

In [None]:
# Train LGBM Model

val_scores,val_predictions,test_predictions = cross_val_model(lgbm)

for k,v in target_mapping.items():
    oof_list[f"lgbm_{k}"] = val_predictions[:,v]

for k,v in target_mapping.items():
    predict_list[f"lgbm_{k}"] = test_predictions[:,v]

#0.91420543252078

<a id = "xgb"></a>
# XGB Model

In [None]:
# Optuna study for XGB Model
def xgb_objective(trial):
    params = {
        'grow_policy': trial.suggest_categorical('grow_policy', ["depthwise", "lossguide"]),
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),
        'gamma' : trial.suggest_float('gamma', 1e-9, 1.0),
        'subsample': trial.suggest_float('subsample', 0.25, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.25, 1.0),
        'max_depth': trial.suggest_int('max_depth', 0, 24),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),
    }

    params['booster'] = 'gbtree'
    params['objective'] = 'multi:softmax'
    params["device"] = "cuda"
    params["verbosity"] = 0
    params['tree_method'] = "gpu_hist"


    optuna_model = make_pipeline(
#                     ExtractFeatures,
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    XGBClassifier(**params,seed=RANDOM_SEED)
                   )

    val_scores, _, _ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

xgb_study = optuna.create_study(direction = 'maximize')


In [None]:
# Tune using Optuna
TUNE = False
if TUNE:
    xgb_study.optimize(xgb_objective, 50)

In [None]:
# XGB Pipeline

params = {
    'n_estimators': 1312,
    'learning_rate': 0.018279520260162645,
    'gamma': 0.0024196354156454324,
    'reg_alpha': 0.9025931173755949,
    'reg_lambda': 0.06835667255875388,
    'max_depth': 5,
    'min_child_weight': 5,
    'subsample': 0.883274050086088,
    'colsample_bytree': 0.6579828557036317
}
# {'eta': 0.018387615982905264, 'max_depth': 29, 'subsample': 0.8149303101087905, 'colsample_bytree': 0.26750463604831476, 'min_child_weight': 0.5292380065098192, 'reg_lambda': 0.18952063379457604, 'reg_alpha': 0.7201451827004944}

params = {'grow_policy': 'depthwise', 'n_estimators': 690,
               'learning_rate': 0.31829021594473056, 'gamma': 0.6061120644431842,
               'subsample': 0.9032243794829076, 'colsample_bytree': 0.44474031945048287,
               'max_depth': 10, 'min_child_weight': 22, 'reg_lambda': 4.42638097284094,
               'reg_alpha': 5.927900973354344e-07,'seed':RANDOM_SEED}

best_params = {'grow_policy': 'depthwise', 'n_estimators': 982,
               'learning_rate': 0.050053726931263504, 'gamma': 0.5354391952653927,
               'subsample': 0.7060590452456204, 'colsample_bytree': 0.37939433412123275,
               'max_depth': 23, 'min_child_weight': 21, 'reg_lambda': 9.150224029846654e-08,
               'reg_alpha': 5.671063656994295e-08}
best_params['booster'] = 'gbtree'
best_params['objective'] = 'multi:softmax'
best_params["device"] = "cuda"
best_params["verbosity"] = 0
best_params['tree_method'] = "gpu_hist"

XGB = make_pipeline(
#                     ExtractFeatures,
#                     MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
#                                            'SMOKE','SCC','CALC','MTRANS']),
#                     FeatureDropper(['FAVC','FCVC']),
#                     ColumnRounder,
#                     ColumnTransformer(
#                     transformers=[('num', StandardScaler(), numerical_columns),
#                                   ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    XGBClassifier(**best_params,seed=RANDOM_SEED)
                   )

In [None]:
val_scores,val_predictions,test_predictions = cross_val_model(XGB)

for k,v in target_mapping .items():
    oof_list[f"xgb_{k}"] = val_predictions[:,v]

for k,v in target_mapping.items():
    predict_list[f"xgb_{k}"] = test_predictions[:,v]

# 0.90634942296329
#0.9117093455898445 with rounder
#0.9163506382522121

<a id = "cat"></a>
# Catboost Model


In [None]:
# Optuna Function For Catboost Model
def cat_objective(trial):

    params = {

        'iterations': 1000,  # High number of estimators
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),
        'depth': trial.suggest_int('depth', 3, 10),
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0.01, 10.0),
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),
        'random_seed': RANDOM_SEED,
        'verbose': False,
        'task_type':"GPU"
    }

    cat_features = ['Gender','family_history_with_overweight','FAVC','FCVC','NCP',
                'CAEC','SMOKE','CH2O','SCC','FAF','TUE','CALC','MTRANS']
    optuna_model = make_pipeline(
                        ExtractFeatures,
#                         AgeRounder,
#                         HeightRounder,
#                         MEstimateEncoder(cols = raw_cat_cols),
                        CatBoostClassifier(**params,cat_features=cat_features)
                        )
    val_scores,_,_ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

cat_study = optuna.create_study(direction = 'maximize')

In [None]:
params = {'learning_rate': 0.13762007048684638, 'depth': 5,
          'l2_leaf_reg': 5.285199432056192, 'bagging_temperature': 0.6029582154263095,
         'random_seed': RANDOM_SEED,
        'verbose': False,
        'task_type':"GPU",
         'iterations':1000}


CB = make_pipeline(
#                         ExtractFeatures,
#                         AgeRounder,
#                         HeightRounder,
#                         MEstimateEncoder(cols = raw_cat_cols),
#                         CatBoostEncoder(cols = cat_features),
                        CatBoostClassifier(**params, cat_features=categorical_columns)
                        )

In [None]:
# Train Catboost Model
val_scores,val_predictions,test_predictions = cross_val_model(CB)
for k,v in target_mapping.items():
    oof_list[f"cat_{k}"] = val_predictions[:,v]

for k,v in target_mapping.items():
    predict_list[f"cat_{k}"] = test_predictions[:,v]

# best 0.91179835368868 with extract features, n_splits = 10
# best 0.9121046227778054 without extract features, n_splits = 10

# Model Evaluation

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay
import matplotlib.pyplot as plt
import numpy as np

# skf = StratifiedKFold(n_splits=5)
weights = {"rfc_": 0,
           "lgbm_": 3,
           "xgb_": 1,
           "cat_": 0}

tmp = oof_list.copy()
for k, v in target_mapping.items():
    tmp[f"{k}"] = (weights['rfc_'] * tmp[f"rfc_{k}"] +
                   weights['lgbm_'] * tmp[f"lgbm_{k}"] +
                   weights['xgb_'] * tmp[f"xgb_{k}"] +
                   weights['cat_'] * tmp[f"cat_{k}"])
tmp['pred'] = tmp[target_mapping.keys()].idxmax(axis=1)
tmp['label'] = train[TARGET]

ensemble_accuracy = accuracy_score(train[TARGET], tmp['pred'])

cm = confusion_matrix(y_true=tmp['label'].map(target_mapping),
                      y_pred=tmp['pred'].map(target_mapping),
                      normalize='true')

cm = cm.round(2)
plt.figure(figsize=(8, 8))
disp = ConfusionMatrixDisplay(confusion_matrix=cm,
                              display_labels=target_mapping.keys(),
                              cmap='Blues')  # 색상을 파란색 계통으로 설정
disp.plot(xticks_rotation=50)
plt.tight_layout()
plt.show()

print(f"Ensemble Accuracy Score: {ensemble_accuracy}")

"""   BEST     """

# Best LB [0,1,0,0]
# Average Train Score:0.9142044335854003
# Average Valid Score:0.91420543252078

# Best CV [1,3, 1,1]
# Average Train Score:0.9168308163711971
# Average Valid Score:0.9168308163711971
# adding orignal data improves score


# Final Submission

In [None]:
weights

In [None]:
for k,v in target_mapping.items():
    predict_list[f"{k}"] = (weights['rfc_']*predict_list[f"rfc_{k}"]+
                            weights['lgbm_']*predict_list[f"lgbm_{k}"]+
                            weights['xgb_']*predict_list[f"xgb_{k}"]+
                            weights['cat_']*predict_list[f"cat_{k}"])

final_pred = predict_list[target_mapping.keys()].idxmax(axis = 1)

sample_sub[TARGET] = final_pred
sample_sub.to_csv("submission.csv",index=False)
sample_sub


### Feel free to fork and test different things. some of the things you may try:
* Try Different weights and see how CV & LB changes. weights = {'rfc_': 0.0, 'lgbm_': 1.0, 'xgb_': 0.0, 'cat_': 0.0} gives best LB. you may try that first
* In this Notebook we used `StandardScaler`, next we can try `Log & MinMax Scaler transformer`
* Define new features in `extract_feature` function by combining different features
* Tune the models again using Optuna
* In this notebook we are using weighted average, next we may use a linear model to combine the predictions.

**This notebook is inspired by the awesome work of Iqbal Syah Akbar. I highly recommend to check his work.**

https://www.kaggle.com/code/iqbalsyahakbar/ps4e1-3rd-place-solution





**Happy Coding !!**


### 주영님 모델

In [None]:
import os;
import tensorflow as tf
import random as rn
os.listdir('/kaggle/input')
os.environ['PYTHONHASHSEED'] = '51'
rn.seed(89)
tf.random.set_seed(40)

In [None]:
# Set Prameters for Reproduciblity
pd.set_option("display.max_rows",100)
FILE_PATH = "/kaggle/input/playground-series-s4e2/"
TARGET = "NObeyesdad"
n_splits = 9
RANDOM_SEED = 73

### Load Data

In [None]:
# load all data
train = pd.read_csv(os.path.join(FILE_PATH, "train.csv"))
test = pd.read_csv(os.path.join(FILE_PATH, "test.csv"))
sample_sub = pd.read_csv(os.path.join(FILE_PATH, "sample_submission.csv"))
train_org = pd.read_csv("/kaggle/input/obesity-or-cvd-risk-classifyregressorcluster/ObesityDataSet.csv")

In [None]:
# numeric data
raw_num_cols = list(train.select_dtypes("float").columns)
# categorical data
raw_cat_cols = list(train.columns.drop(raw_num_cols+[TARGET]))

# Feature Engineering & Processing

It defines a series of functions and custom transformers for preprocessing data before feeding it into a machine learning model. The age_rounder and height_rounder functions multiply the 'Age' and 'Height' column values by 100 and convert them to uint16 data type to round them, potentially enhancing model performance. The extract_features function calculates the BMI(Body Mass Index) using 'Weight' and 'Height' columns and adds it as a new feature to the dataframe. Additionally, the col_rounder function rounds specific columns ('FCVC', "NCP", "CH2O", "FAF", "TUE") and converts them to integers. Each transformation is encapsulated within a FunctionTransformer object. Lastly, the FeatureDropper class is a custom transformer that drops specified columns from the input DataFrame, allowing for further customization of the data preprocessing pipeline.

In [None]:
# Round age values for potential model performance improvement
def age_rounder(x):
    # Make a copy of the input dataframe
    x_copy = x.copy()
    # Multiply 'Age' column values by 100 and convert to uint16 for rounding age
    x_copy['Age'] = (x_copy['Age']*100).astype(np.uint16)
    return x_copy

# Round height values for potential model performance improvement
def height_rounder(x):
    # Make a copy of the input dataframe
    x_copy = x.copy()
    # Multiply 'Height' column values by 100 and convert to uint16 for rounding height
    x_copy['Height'] = (x_copy['Height']*100).astype(np.uint16)
    return x_copy

# Extract new features: BMI and PseudoTarget
def extract_features(x):
    # Make a copy of the input dataframe
    x_copy = x.copy()
    # Calculate BMI using 'Weight' and 'Height' columns
    x_copy['BMI'] = (x_copy['Weight']/x_copy['Height']**2)
#     x_copy['PseudoTarget'] = pd.cut(x_copy['BMI'], bins=[0,18.4,24.9,29,34.9,39.9,100], labels=[0,1,2,3,4,5])  # Define 'PseudoTarget' based on BMI categories
    return x_copy

# Round specific columns' values
def col_rounder(x):
    # Make a copy of the input dataframe
    x_copy = x.copy()
    # Define columns to round
    cols_to_round = ['FCVC',"NCP","CH2O","FAF","TUE"]
    # Round values in specified columns
    for col in cols_to_round:
        x_copy[col] = round(x_copy[col])
        x_copy[col] = x_copy[col].astype('int')
    return x_copy

# Create FunctionTransformer objects for each transformation
AgeRounder = FunctionTransformer(age_rounder)
HeightRounder = FunctionTransformer(height_rounder)
ExtractFeatures = FunctionTransformer(extract_features)
ColumnRounder = FunctionTransformer(col_rounder)

In [None]:
# Custom transformer to drop specified columns
from sklearn.base import BaseEstimator, TransformerMixin

class FeatureDropper(BaseEstimator, TransformerMixin):
    # Initialize FeatureDropper with columns to drop
        def __init__(self, cols):
        self.cols = cols

    # Fit method (no action needed)
    # TransformerMixin requires Fit method
    def fit(self, x, y):
        return self

    # Transform method to drop specified columns from input DataFrame
    def transform(self, x):
        # Drop specified columns from the input DataFrame
        return x.drop(self.cols, axis=1)


#Define cross_val_model(used to train and validate all the models)

It performs cross-validation of machine learning models using Stratified K-Fold technique. It begins by encoding target values into integers and defining the Stratified K-Fold object. The cross_val_model function then conducts model training and validation within each fold, storing the predictions and accuracy scores. After training the models, it prints the training and validation accuracy scores for each fold if verbose mode is enabled. Finally, it combines original and synthetic data, drops unnecessary columns, and concatenates them before resetting the index and initializing empty dataframes to store scores and predictions. Overall, this code provides a robust framework for evaluating machine learning models through cross-validation.

In [None]:
# Cross-validate models using Stratified K-Fold

# Encode target values as integers
target_mapping = {
                  'Insufficient_Weight':0,
                  'Normal_Weight':1,
                  'Overweight_Level_I':2,
                  'Overweight_Level_II':3,
                  'Obesity_Type_I':4,
                  'Obesity_Type_II':5 ,
                  'Obesity_Type_III':6
                  }

# Method for cross-validation using StratifiedKFold by default
skf = StratifiedKFold(n_splits=n_splits)

def cross_val_model(estimators, cv=skf, verbose=True):
    '''
        estimators : pipeline consists preprocessing, encoder & model
        cv : Method for cross validation (default: StratifiedKfold)
        verbose : print train/valid score (yes/no)
    '''

    # Copy the original data
    X = train.copy()
    # Extract target variable and remove it from training data
    y = X.pop(TARGET)

    # Map target variable values to integers
    y = y.map(target_mapping)

		# Initialize arrays to store predictions
    test_predictions = np.zeros((len(test),7))
    valid_predictions = np.zeros((len(X),7))

    # Initialize lists to store scores
    val_scores, train_scores = [],[]

    # Iterate over each fold
    for fold, (train_ind, valid_ind) in enumerate(skf.split(X,y)):
        # Clone the model to ensure a fresh instance for each fold
        model = clone(estimators)

        # Define train set
        X_train = X.iloc[train_ind]
        y_train = y.iloc[train_ind]
        # Define valid set
        X_valid = X.iloc[valid_ind]
        y_valid = y.iloc[valid_ind]

        # Train the model
        model.fit(X_train, y_train)

        # Print training and validation accuracies if verbose is True
        if verbose:
            print("-" * 100)
            print(f"Fold: {fold}")
            print(f"Train Accuracy Score:-{accuracy_score(y_true=y_train,y_pred=model.predict(X_train))}")
            print(f"Valid Accuracy Score:-{accuracy_score(y_true=y_valid,y_pred=model.predict(X_valid))}")
            print("-" * 100)

        # Store predictions for test and validation data
        test_predictions += model.predict_proba(test)/cv.get_n_splits()
        valid_predictions[valid_ind] = model.predict_proba(X_valid)

        # Calculate validation scores and store them
        val_scores.append(accuracy_score(y_true=y_valid,y_pred=model.predict(X_valid)))

    # Print average mean accuracy score if verbose is True
    if verbose:
        print(f"Average Mean Accuracy Score:- {np.array(val_scores).mean()}")

    # Return validation scores and predictions, and test predictions
    return val_scores, valid_predictions, test_predictions

In [None]:
# Combine Orignal & Synthetic Data
# Drop 'id' column from the training data
train.drop(['id'],axis = 1, inplace = True)

# Store test IDs before dropping 'id' column from the test data
test_ids = test['id']
# Drop 'id' column from the test data
test.drop(['id'],axis = 1, inplace=True)

# Combine original and synthetic data by concatenating them along axis 0 (rows)
train = pd.concat([train,train_org],axis = 0)
# Remove duplicate rows from the combined data
train = train.drop_duplicates()
# Reset index after removing duplicates
train.reset_index(drop=True, inplace=True)

In [None]:
# Initialize empty dataframes to store scores & train/test predictions
score_list, oof_list, predict_list = pd.DataFrame(), pd.DataFrame(), pd.DataFrame()

#Random Forest Model

It defines a pipeline for a Random Forest Classifier model. The pipeline first preprocesses the data by extracting features and encoding categorical features using the M-Estimate Encoder. Then, it includes a Random Forest Classifier model with a specified random state. The pipeline is executed using the cross_val_model function, which performs cross-validation to evaluate the model's performance. The validation set predictions for each class are stored in separate columns in the 'oof_list' dataframe, while the test set predictions are stored similarly in the 'predict_list' dataframe. This pipeline facilitates the preprocessing and evaluation of the Random Forest model, while also managing predictions efficiently.

In [None]:
# Define Random Forest Model Pipeline
RFC = make_pipeline(
                    # Extract features
                    ExtractFeatures,
                    # Encode categorical features using M-Estimate Encoder
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    # Random Forest Classifier
                    RandomForestClassifier(random_state=RANDOM_SEED)
                    )

In [None]:
# Execute Random Forest Pipeline
val_scores,val_predictions,test_predictions = cross_val_model(RFC)

# Save train predictions in dataframes
for k,v in target_mapping.items():
    # Save validation set predictions for each class in a separate column in the 'oof_list' dataframe
    oof_list[f"rfc_{k}"] = val_predictions[:,v]

# Save test predictions in dataframes
for k,v in target_mapping.items():
    # Save test set predictions for each class in a separate column in the 'predict_list' dataframe
    predict_list[f"rfc_{k}"] = test_predictions[:,v]

#LGBM Model

It tunes and trains a LightGBM model using Optuna. First, it defines the objective function lgbm_objective for Optuna to optimize hyperparameters such as learning rate, max depth, subsample ratio, etc., through cross-validation. The study lgbm_study is then created to maximize the validation score. Additionally, it initializes the model with default hyperparameters and executes tuning if TUNE is set to True. The best parameters obtained from tuning or predefined parameters are defined. Later, a LightGBM pipeline is constructed, including data preprocessing steps such as standard scaling for numerical features and one-hot encoding for categorical features. Finally, the model is trained using cross-validation, and the predictions for both the validation and test sets are stored in respective dataframes. This code provides a comprehensive framework for optimizing and training LightGBM models for multi-class classification tasks.

In [None]:
# Define Optuna function to tune LGBM model
def lgbm_objective(trial):
    # Define hyperparameters to be tuned
    params = {
        'learning_rate' : trial.suggest_float('learning_rate', .001, .1, log = True),  # Learning rate tuning
        'max_depth' : trial.suggest_int('max_depth', 2, 20),  # Max depth tuning
        'subsample' : trial.suggest_float('subsample', .5, 1),  # Subsample ratio tuning
        'min_child_weight' : trial.suggest_float('min_child_weight', .1, 15, log = True),  # Min child weight tuning
        'reg_lambda' : trial.suggest_float('reg_lambda', .1, 20, log = True),  # L2 regularization tuning
        'reg_alpha' : trial.suggest_float('reg_alpha', .1, 10, log = True),  # L1 regularization tuning
        'n_estimators' : 1000,  # Number of estimators
        'random_state' : RANDOM_SEED,  # Random state for reproducibility
        'device_type' : "gpu",  # Specify device type (GPU)
        'num_leaves': trial.suggest_int('num_leaves', 10, 1000),  # Number of leaves tuning
#         'boosting_type' : 'dart',  # Boosting type (DART)
    }

    # Construct LightGBM model pipeline with Optuna suggested hyperparameters
    optuna_model = make_pipeline(
                                 # Extract features
                                 ExtractFeatures,
                                 # Encode categorical features
                                 MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                                 # LightGBM classifier with tuned hyperparameters
                                 LGBMClassifier(**params,verbose=-1)
                                 )

    # Perform cross-validation and obtain validation scores
    val_scores, _, _ = cross_val_model(optuna_model,verbose = False)

    # Return the mean of validation scores as the optimization objective
    return np.array(val_scores).mean()

# Create an Optuna study for maximizing the validation score
lgbm_study = optuna.create_study(direction = 'maximize',study_name="LGBM")

In [None]:
# Execute LGBM Tuning, To Tune set `TUNE` to True (it will take a long time)
TUNE = False

# Ignore warning messages
warnings.filterwarnings("ignore")

if TUNE:
    # Perform optimization with 50 trials
    lgbm_study.optimize(lgbm_objective, 50)

In [None]:
# Get list of numerical columns
numerical_columns = train.select_dtypes(include=['int64', 'float64']).columns.tolist()

# Get list of categorical columns
categorical_columns = train.select_dtypes(include=['object']).columns.tolist()

# Remove target column from categorical columns
categorical_columns.remove('NObeyesdad')

LGBM parameters in next cell are taken from @moazeldsokyx notebook you may check his great work in this notebook:
 https://www.kaggle.com/code/moazeldsokyx/pgs4e2-highest-score-lgbm-hyperparameter-tuning/notebook

In [None]:
# Define parameters for LGBM model
params = {'learning_rate': 0.04325905707439143, 'max_depth': 4,
          'subsample': 0.6115083405793659, 'min_child_weight': 0.43633356137010687,
          'reg_lambda': 9.231766981717822, 'reg_alpha': 1.875987414096491, 'num_leaves': 373,
          'n_estimators' : 1000,'random_state' : RANDOM_SEED, 'device_type' : "gpu",
         }

# Define best parameters obtained from tuning
best_params = {
    "objective": "multiclass",          # Objective function for the model
    "metric": "multi_logloss",          # Evaluation metric
    "verbosity": -1,                    # Verbosity level (-1 for silent)
    "boosting_type": "gbdt",            # Gradient boosting type
    "random_state": 42,                 # Random state for reproducibility
    "num_class": 7,                     # Number of classes in the dataset
    'learning_rate': 0.030962211546832760,  # Learning rate for gradient boosting
    'n_estimators': 500,                # Number of boosting iterations
    'lambda_l1': 0.009667446568254372,  # L1 regularization term
    'lambda_l2': 0.04018641437301800,   # L2 regularization term
    'max_depth': 10,                    # Maximum depth of the trees
    'colsample_bytree': 0.40977129346872643,  # Fraction of features to consider for each tree
    'subsample': 0.9535797422450176,    # Fraction of samples to consider for each boosting iteration
    'min_child_samples': 26             # Minimum number of data needed in a leaf
}

# Define LGBM pipeline
lgbm = make_pipeline(
                        ColumnTransformer(
                        transformers=[('num', StandardScaler(), numerical_columns),
                                  ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),
                        LGBMClassifier(**best_params,verbose=-1)
                    )

In [None]:
# Train LGBM Model
val_scores,val_predictions,test_predictions = cross_val_model(lgbm)

# Save train predictions in DataFrame
for k,v in target_mapping.items():
    oof_list[f"lgbm_{k}"] = val_predictions[:,v]

# Save test predictions in DataFrame
for k,v in target_mapping.items():
    predict_list[f"lgbm_{k}"] = test_predictions[:,v]

#XGB Model

It optimizes and trains an XGBoost model using Optuna. Firstly, an objective function xgb_objective is defined to optimize the hyperparameters of the XGBoost model. This function utilizes Optuna to search for hyperparameter values within specified ranges and returns the mean validation score through cross-validation. Then, an Optuna study xgb_study is created to find the hyperparameter configuration that maximizes the validation score. If the TUNE variable is set to True, hyperparameter tuning is performed. Initial and optimized hyperparameters are defined, and an XGBoost pipeline is constructed. Lastly, the model is trained using cross-validation, and predictions for the validation and test sets are stored in their respective dataframes. This code provides a comprehensive framework for tuning and training an XGBoost model.

In [None]:
# Optuna study for XGB Model
# Define the objective function for the Optuna study to optimize XGBoost hyperparameters
def xgb_objective(trial):
    # Define hyperparameters to be tuned
    params = {
        'grow_policy': trial.suggest_categorical('grow_policy', ["depthwise", "lossguide"]),  # Choose between 'depthwise' and 'lossguide' for tree growth policy
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),  # Number of boosting iterations
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 1.0),  # Learning rate
        'gamma' : trial.suggest_float('gamma', 1e-9, 1.0),  # Minimum loss reduction required to make a further partition on a leaf node
        'subsample': trial.suggest_float('subsample', 0.25, 1.0),  # Subsample ratio of training instances
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.25, 1.0),  # Subsample ratio of columns when constructing each tree
        'max_depth': trial.suggest_int('max_depth', 0, 24),  # Maximum depth of the tree
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 30),  # Minimum sum of instance weight needed in a child
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-9, 10.0, log=True),  # L2 regularization term on weights
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-9, 10.0, log=True),  # L1 regularization term on weights
    }

    # Additional parameters for XGBoost model
    params['booster'] = 'gbtree'  # Use tree-based models
    params['objective'] = 'multi:softmax'  # Objective function for multiclass classification
    params["device"] = "cuda"  # Use GPU for computation
    params["verbosity"] = 0  # Verbosity level
    params['tree_method'] = "gpu_hist"  # Method for tree construction using GPU


    # Construct XGB model pipeline with suggested hyperparameters
    optuna_model = make_pipeline(
#                     ExtractFeatures,  # Extract features from data
                    # Encode categorical features using M-Estimate Encoder
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    # XGBoost classifier with tuned hyperparameters
                    XGBClassifier(**params,seed=RANDOM_SEED)
                   )

    # Perform cross-validation with the model pipeline and return mean validation scores
    val_scores, _, _ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

# Create an Optuna study for maximizing validation scores
xgb_study = optuna.create_study(direction = 'maximize')

In [None]:
# Execute XGB Tuning, To Tune set `TUNE` to True (it will take a long time)
TUNE = False

if TUNE:
    # Perform optimization with 50 trials
    xgb_study.optimize(xgb_objective, 50)

In [None]:
# Initial hyperparameters for the XGBoost model
params = {
    'n_estimators': 1312,
    'learning_rate': 0.018279520260162645,
    'gamma': 0.0024196354156454324,
    'reg_alpha': 0.9025931173755949,
    'reg_lambda': 0.06835667255875388,
    'max_depth': 5,
    'min_child_weight': 5,
    'subsample': 0.883274050086088,
    'colsample_bytree': 0.6579828557036317
}
# {'eta': 0.018387615982905264, 'max_depth': 29, 'subsample': 0.8149303101087905, 'colsample_bytree': 0.26750463604831476, 'min_child_weight': 0.5292380065098192, 'reg_lambda': 0.18952063379457604, 'reg_alpha': 0.7201451827004944}

# Updated hyperparameters for the XGBoost model
params = {'grow_policy': 'depthwise', 'n_estimators': 690,
               'learning_rate': 0.31829021594473056, 'gamma': 0.6061120644431842,
               'subsample': 0.9032243794829076, 'colsample_bytree': 0.44474031945048287,
               'max_depth': 10, 'min_child_weight': 22, 'reg_lambda': 4.42638097284094,
               'reg_alpha': 5.927900973354344e-07,'seed':RANDOM_SEED}

# Best hyperparameters found for the XGBoost model
best_params = {'grow_policy': 'depthwise', 'n_estimators': 982,
               'learning_rate': 0.050053726931263504, 'gamma': 0.5354391952653927,
               'subsample': 0.7060590452456204, 'colsample_bytree': 0.37939433412123275,
               'max_depth': 23, 'min_child_weight': 21, 'reg_lambda': 9.150224029846654e-08,
               'reg_alpha': 5.671063656994295e-08}

# Adding additional parameters to the best_params dictionary
best_params['booster'] = 'gbtree'
best_params['objective'] = 'multi:softmax'
best_params["device"] = "cuda"
best_params["verbosity"] = 0
best_params['tree_method'] = "gpu_hist"

# XGBoost pipeline
XGB = make_pipeline(
                    # Feature encoding and XGBoost classifier
#                     ExtractFeatures,  # Extract features from data
#                     MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
#                                            'SMOKE','SCC','CALC','MTRANS']),  # Encode categorical features
#                     FeatureDropper(['FAVC','FCVC']),  # Drop features from data
#                     ColumnRounder,  # Round column values
#                     ColumnTransformer(
#                     transformers=[('num', StandardScaler(), numerical_columns),
#                                   ('cat', OneHotEncoder(handle_unknown="ignore"), categorical_columns)]),
                    # Encode categorical features using M-Estimate Encoder
                    MEstimateEncoder(cols=['Gender','family_history_with_overweight','FAVC','CAEC',
                                           'SMOKE','SCC','CALC','MTRANS']),
                    # XGBoost classifier with tuned hyperparameters
                    XGBClassifier(**best_params,seed=RANDOM_SEED)
                   )

In [None]:
# Cross-validate XGBoost model and get validation scores and predictions
val_scores,val_predictions,test_predictions = cross_val_model(XGB)

# Store validation predictions in respective categories
for k,v in target_mapping .items():
    oof_list[f"xgb_{k}"] = val_predictions[:,v]

# Store test predictions in respective categories
for k,v in target_mapping.items():
    predict_list[f"xgb_{k}"] = test_predictions[:,v]

#Catboost Model

It defines an Optuna objective function, cat_objective, to optimize hyperparameters for the CatBoost model. The function explores a range of hyperparameters such as learning rate, tree depth, L2 regularization coefficient, and bagging temperature using Optuna's suggest methods. It constructs a CatBoost pipeline with the optimized parameters and performs cross-validation to evaluate the model's performance. The CatBoost pipeline includes feature extraction, categorical encoding, and CatBoost classifier instantiation. After defining the objective function, an Optuna study object cat_study is created to maximize the objective function. Additionally, a set of initial hyperparameters for the CatBoost model is specified. Finally, the CatBoost model is trained using cross-validation, and the validation and test predictions are stored in respective dataframes for further analysis. This code provides a comprehensive framework for optimizing and training CatBoost models.

In [None]:
# Optuna Function For Catboost Model
def cat_objective(trial):
    # Define parameters to be optimized
    params = {
        'iterations': 1000,  # Number of boosting iterations
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),  # Learning rate for gradient boosting
        'depth': trial.suggest_int('depth', 3, 10),  # Depth of the trees
        'l2_leaf_reg': trial.suggest_float('l2_leaf_reg', 0.01, 10.0),  # L2 regularization coefficient
        'bagging_temperature': trial.suggest_float('bagging_temperature', 0.0, 1.0),  # Temperature for Bayesian bootstrap
        'random_seed': RANDOM_SEED,  # Random seed for reproducibility
        'verbose': False,  # Whether to print information during training
        'task_type':"GPU"  # Whether to use GPU for training
    }

    # Define categorical features for CatBoost
    cat_features = ['Gender','family_history_with_overweight','FAVC','FCVC','NCP',
                'CAEC','SMOKE','CH2O','SCC','FAF','TUE','CALC','MTRANS']

    # Create a pipeline for the CatBoost model with optimized parameters
    optuna_model = make_pipeline(
                        ExtractFeatures,  # Extract features from data
#                         AgeRounder,  # Round age feature
#                         HeightRounder,  # Round height feature
#                         MEstimateEncoder(cols = raw_cat_cols),  # Encode categorical features
                        CatBoostClassifier(**params,cat_features=cat_features)  # CatBoost Classifier
                        )

    # Perform cross-validation and get mean validation score
    val_scores,_,_ = cross_val_model(optuna_model,verbose = False)
    return np.array(val_scores).mean()

# Create an Optuna study object for maximizing the objective function
cat_study = optuna.create_study(direction = 'maximize')

In [None]:
# Define parameters for CatBoost model
params = {'learning_rate': 0.13762007048684638, 'depth': 5,
          'l2_leaf_reg': 5.285199432056192, 'bagging_temperature': 0.6029582154263095,
         'random_seed': RANDOM_SEED,
        'verbose': False,
        'task_type':"GPU",
         'iterations':1000}

# Create a pipeline for the CatBoost model with specified parameters
CB = make_pipeline(
                   # Feature preprocessing steps (currently commented out)
#                    ExtractFeatures,  # Extract features from data
#                    AgeRounder,  # Round age feature
#                    HeightRounder,  # Round height feature
#                    MEstimateEncoder(cols = raw_cat_cols),  # Encode categorical features
#                    CatBoostEncoder(cols = cat_features),  # CatBoost Encoder
                   CatBoostClassifier(**params, cat_features=categorical_columns)  # CatBoost Classifier
                   )
# Train Catboost Model
val_scores,val_predictions,test_predictions = cross_val_model(CB)

# Store validation predictions in respective categories
for k,v in target_mapping.items():
    oof_list[f"cat_{k}"] = val_predictions[:,v]

# Store test predictions in respective categories
for k,v in target_mapping.items():
    predict_list[f"cat_{k}"] = test_predictions[:,v]

#Model Evaluation

It defines weights for different models in an ensemble, where each weight corresponds to a specific machine learning model(Random Forest Classifier, LightGBM Classifier, XGBoost Classifier, and CatBoost Classifier). It then combines the predictions from these models using the specified weights to create an ensemble prediction. The ensemble prediction is evaluated by calculating its accuracy score against the actual labels of the training data. Additionally, it generates a confusion matrix based on the ensemble prediction and the actual labels to visually assess the model's performance. Finally, the confusion matrix is plotted to provide insights into the model's classification performance across different target classes. Overall, this code orchestrates an ensemble approach to leverage the strengths of multiple models for improved predictive performance.

In [None]:
# Define weights for different models in the ensemble
weights = {"rfc_":0,  # Weight for Random Forest Classifier
           "lgbm_":3,  # Weight for LightGBM Classifier
           "xgb_":1,  # Weight for XGBoost Classifier
           "cat_":0}  # Weight for CatBoost Classifier

# Make a copy of the oof_list
tmp = oof_list.copy()

# Combine predictions from different models with specified weights
for k,v in target_mapping.items():
    tmp[f"{k}"] = (weights['rfc_']*tmp[f"rfc_{k}"]+  # Combine predictions with Random Forest weight
              weights['lgbm_']*tmp[f"lgbm_{k}"]+  # Combine predictions with LightGBM weight
              weights['xgb_']*tmp[f"xgb_{k}"]+  # Combine predictions with XGBoost weight
              weights['cat_']*tmp[f"cat_{k}"])  # Combine predictions with CatBoost weight

# Combine predictions to get ensemble prediction
tmp['pred'] = tmp[target_mapping.keys()].idxmax(axis = 1)  # Combine predictions from all models to get ensemble prediction
tmp['label'] = train[TARGET]  # Actual labels of the training data

# Calculate ensemble accuracy score
print(f"Ensemble Accuracy Scoe: {accuracy_score(train[TARGET],tmp['pred'])}")

# Generate confusion matrix for the ensemble prediction
cm = confusion_matrix(y_true = tmp['label'].map(target_mapping),  # Generate confusion matrix based on actual labels
                      y_pred = tmp['pred'].map(target_mapping),  # and ensemble prediction
                     normalize='true')  # Normalize the confusion matrix

cm = cm.round(2)  # Round confusion matrix values

# Plot the confusion matrix
plt.figure(figsize=(8,8))
disp = ConfusionMatrixDisplay(confusion_matrix = cm,
                              display_labels = target_mapping.keys())
disp.plot(xticks_rotation=50)  # Plot confusion matrix with rotated xticks for better readability
plt.tight_layout()
plt.show()

# Submission

In [None]:
# Combine predictions from different models with specified weights to generate final predictions
for k,v in target_mapping.items():
    predict_list[f"{k}"] = (weights['rfc_']*predict_list[f"rfc_{k}"]+
                            weights['lgbm_']*predict_list[f"lgbm_{k}"]+
                            weights['xgb_']*predict_list[f"xgb_{k}"]+
                            weights['cat_']*predict_list[f"cat_{k}"])

# Combine predictions to get final prediction
final_pred = predict_list[target_mapping.keys()].idxmax(axis = 1)

# Update submission dataframe with final predictions and save it to a CSV file
sample_sub[TARGET] = final_pred
sample_sub.to_csv("submission.csv",index=False)
sample_sub