# Obesity Risk Prediction Project

## Introduction
Briefly describe the project, its objectives, and the significance of the problem it aims to solve. Mention the Kaggle competition as the source of the dataset and challenge.

## Competition Result

My submissions for the Kaggle obesity risk prediction competition resulted in the following scores:

- XGBoost Model:
  - Public Score: 0.90634
  - Private Score: 0.90968

- LightGBM Model:
  - Public Score: 0.90281
  - Private Score: 0.90823

While my models performed commendably, with both achieving above 90% accuracy, the winning score was 0.91157. This close margin suggests that my models were competitive. For future iterations, I could explore additional feature engineering, model ensembling, or more advanced hyperparameter tuning to close the gap and possibly improve the scores.


## Dataset Description
Detail the dataset used, including the source, number of observations, features, and the target variable. Explain any preprocessing steps or assumptions.


## Data Preparation and Modeling Overview

In this section, I walk through the process of preparing the data and training models to predict obesity risk. Here's a summary of what I've done:

### Initial Steps
- I started by importing the necessary libraries, emphasizing data manipulation and machine learning models.
- I loaded the training and additional obesity datasets, removing any 'id' columns to prevent issues during merging.
- After merging the datasets, I decided to drop the 'SMOKE' feature based on its relevance to our prediction goal.

### Feature Engineering
- I calculated the BMI for each individual, providing a foundational metric for obesity risk.
- Gender was encoded into binary format to simplify the model's input.
- Using BMI, age, and gender, I derived the Body Fat Percentage (BFP), a potentially more insightful feature for obesity risk.

### Preparing Data for Modeling
- I split the features and target, identified and converted categorical features for encoding, and encoded the target variable.
- The dataset was then split into training and testing sets to evaluate the model's performance.

### Scaling and Model Training
- I scaled numerical features to standardize their range and applied transformations to prepare for model training.
- I trained and evaluated two models: XGBoost and LightGBM, choosing them for their performance in classification tasks.
- Each model's accuracy was calculated to assess its ability to predict obesity risk accurately.

Through these steps, I aimed to build a robust predictive model, leveraging the detailed feature engineering and the strengths of advanced ensemble models.


In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.compose import ColumnTransformer
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Load the datasets
file_path_orig = 'data/train.csv'
orig_data = pd.read_csv(file_path_orig)
file_path_new = 'data/ObesityDataSet.csv'
new_data = pd.read_csv(file_path_new)

# Drop 'id' column if present in both datasets before concatenation
if 'id' in orig_data.columns:
    orig_data.drop('id', axis=1, inplace=True)
if 'id' in new_data.columns:
    new_data.drop('id', axis=1, inplace=True)

# Merge the datasets
data = pd.concat([orig_data, new_data], ignore_index=True)

data.drop('SMOKE', axis=1, inplace=True)

# BMI calculation
data['BMI'] = data['Weight'] / (data['Height'] ** 2)

# Encode Gender as 0 for females and 1 for males
data['Gender_encoded'] = data['Gender'].apply(lambda x: 1 if x == 'Male' else 0)

# Body Fat Percentage (BFP) calculation
data['BFP'] = (1.20 * data['BMI']) + (0.23 * data['Age']) - (10.8 * data['Gender_encoded']) - 5.4

# Drop the intermediate columns used for calculation
data.drop(['Gender_encoded'], axis=1, inplace=True)

# Prepare the features (X) and target (y)
X = data.drop(['NObeyesdad'], axis=1)
y = data['NObeyesdad']

# Specify categorical and numerical features
categorical_features = X.select_dtypes(include=['object', 'category']).columns.tolist()
numerical_features = X.select_dtypes(include=[np.number]).columns.tolist()

# Convert categorical features to 'category' dtype
for col in categorical_features:
    X[col] = X[col].astype('category')

# Initialize the LabelEncoder and encode the target variable
le = LabelEncoder()
y_encoded = le.fit_transform(y)

# Split the dataset into training and testing sets
X_train, X_test, y_train_encoded, y_test_encoded = train_test_split(X, y_encoded, test_size=0.2, random_state=42)

# Initialize the StandardScaler and fit it on the numerical features
scaler = StandardScaler()
X_train_scaled = X_train.copy()
X_test_scaled = X_test.copy()
X_train_scaled[numerical_features] = scaler.fit_transform(X_train[numerical_features])
X_test_scaled[numerical_features] = scaler.transform(X_test[numerical_features])

# Adjust the ColumnTransformer to include scaling for numerical features
column_transformer = ColumnTransformer([
    ("cat", OneHotEncoder(handle_unknown='ignore'), categorical_features),
    ("num", StandardScaler(), numerical_features)
], remainder='passthrough')

# Apply transformations for models that require them (e.g., XGBoost)
X_train_transformed = column_transformer.fit_transform(X_train)
X_test_transformed = column_transformer.transform(X_test)

# Determine the number of classes for the multi-class classification problem
num_class = len(np.unique(y_encoded))

# Initialize, train, and evaluate models as before, using the corrected datasets and transformations
# Example for XGBoost:
xgb_model = XGBClassifier(objective='multi:softmax', num_class=num_class, use_label_encoder=False, eval_metric='mlogloss')
xgb_model.fit(X_train_transformed, y_train_encoded)
y_pred_xgb = xgb_model.predict(X_test_transformed)
accuracy_xgb = accuracy_score(y_test_encoded, y_pred_xgb)
print(f'XGBoost Accuracy: {accuracy_xgb}')

# LightGBM Model
lgb_model = LGBMClassifier()
lgb_model.fit(X_train_scaled, y_train_encoded, categorical_feature=[X_train.columns.get_loc(name) for name in categorical_features])
y_pred_lgb = lgb_model.predict(X_test_scaled)
accuracy_lgb = accuracy_score(y_test_encoded, y_pred_lgb)
print(f'LightGBM Accuracy: {accuracy_lgb}')

XGBoost Accuracy: 0.9011805859204197
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001121 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2545
[LightGBM] [Info] Number of data points in the train set: 18295, number of used features: 17
[LightGBM] [Info] Start training from score -2.113635
[LightGBM] [Info] Start training from score -1.912636
[LightGBM] [Info] Start training from score -1.954584
[LightGBM] [Info] Start training from score -1.867412
[LightGBM] [Info] Start training from score -1.646747
[LightGBM] [Info] Start training from score -2.124554
[LightGBM] [Info] Start training from score -2.093921
LightGBM Accuracy: 0.9035854831657193


## Model Performance Summary

The predictive models yielded impressive accuracies in identifying obesity risk:

- **XGBoost Model:** Achieved an accuracy of approximately 90.12%, showcasing its robust performance in handling the dataset.
- **LightGBM Model:** Slightly outperformed XGBoost with an accuracy of around 90.36%. The model's efficiency and effectiveness are highlighted by its fast training time and high accuracy, even with a substantial number of features and data points.

LightGBM's output also provides insights into its training process, including the optimization for multi-threading, the total bins used for numeric features, and the number of data points and features it considered. The initial scores for starting training indicate the model's internal process for handling the multi-class classification task.

In conclusion, both models performed exceptionally well, with LightGBM taking a slight edge in accuracy. This suggests that for this specific task, both models are viable options, with LightGBM offering a bit more efficiency and potentially better handling of large datasets and numerous features.


In [2]:
# XGBoost Model with specified parameters
xgb_model_final = XGBClassifier(
    objective='multi:softmax',
    num_class=num_class,
    use_label_encoder=False,
    eval_metric='mlogloss',
    colsample_bytree=0.8,
    learning_rate=0.05,
    max_depth=5,
    min_child_weight=6,
    n_estimators=300,
    subsample=0.9
)

xgb_model_final.fit(X_train_transformed, y_train_encoded)
y_pred_xgb = xgb_model_final.predict(X_test_transformed)
accuracy_xgb = accuracy_score(y_test_encoded, y_pred_xgb)
print(f'XGBoost Accuracy: {accuracy_xgb}')

XGBoost Accuracy: 0.9020550940096196


In [3]:
# LightGBM Model with specified parameters
lgb_model_final = LGBMClassifier(
    feature_fraction=0.8,
    learning_rate=0.08,
    max_depth=3,
    n_estimators=300,
    num_leaves=7
)
lgb_model_final.fit(X_train_scaled, y_train_encoded, categorical_feature=[X_train.columns.get_loc(name) for name in categorical_features])
y_pred_lgb = lgb_model_final.predict(X_test_scaled)
accuracy_lgb = accuracy_score(y_test_encoded, y_pred_lgb)
print(f'LightGBM Accuracy: {accuracy_lgb}')

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.000936 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2545
[LightGBM] [Info] Number of data points in the train set: 18295, number of used features: 17
[LightGBM] [Info] Start training from score -2.113635
[LightGBM] [Info] Start training from score -1.912636
[LightGBM] [Info] Start training from score -1.954584
[LightGBM] [Info] Start training from score -1.867412
[LightGBM] [Info] Start training from score -1.646747
[LightGBM] [Info] Start training from score -2.124554
[LightGBM] [Info] Start training from score -2.093921
LightGBM Accuracy: 0.9031482291211194


## Test Data Preparation and Prediction Submission

- Loaded the test dataset and calculated additional features (BMI and BFP) mirroring the training phase.
- Dropped unnecessary columns and encoded categorical features as required.
- Prepared the dataset separately for XGBoost and LightGBM models, considering their specific data handling capabilities.
- Predicted obesity risk with both models and prepared submission files, saving them to CSV for submission.

This streamlined approach ensures consistency with the training phase and enables accurate predictions for submission.


In [4]:
# Load the uploaded dataset
file_path = 'test/test.csv'
test_data = pd.read_csv(file_path)

# Calculate BMI and add it to the dataset
test_data['BMI'] = test_data['Weight'] / (test_data['Height'] ** 2)

# Encode Gender as 0 for females and 1 for males
test_data['Gender_encoded'] = test_data['Gender'].apply(lambda x: 1 if x == 'Male' else 0)

# Calculate Body Fat Percentage using the formula
test_data['BFP'] = (1.20 * test_data['BMI']) + (0.23 * test_data['Age']) - (10.8 * test_data['Gender_encoded']) - 5.4

# Drop the intermediate columns used for calculation
test_data.drop(['Gender_encoded'], axis=1, inplace=True)

# Assuming 'SMOKE' column should also be dropped if it's not used in the model
if 'SMOKE' in test_data.columns:
    test_data.drop('SMOKE', axis=1, inplace=True)

# Save 'id' column for submission, then drop it for prediction if it exists
if 'id' in test_data.columns:
    test_ids = test_data['id'].copy()
    test_data = test_data.drop('id', axis=1)

# Convert categorical features to 'category' dtype for LightGBM and encode for XGBoost
for col in categorical_features:
    test_data[col] = test_data[col].astype('category')

# Prepare the data for LightGBM (no need for transformation as it handles categorical features natively)
X_test_lgb = test_data.copy()
X_test_lgb[numerical_features] = scaler.transform(X_test_lgb[numerical_features])

# Apply the transformations for XGBoost
X_test_xgb = column_transformer.transform(test_data)

# Make predictions with both models
y_pred_xgb = xgb_model_final.predict(X_test_xgb)
y_pred_lgb = lgb_model_final.predict(X_test_lgb)

# XGB
submission_predictions_xgb = le.inverse_transform(y_pred_xgb)

# Create the submission DataFrame for XGB
submission_xgb = pd.DataFrame({
    'id': test_ids,
    'NObeyesdad': submission_predictions_xgb
})

# Export to CSV for XGB
submission_file_path_xgb = 'data/test/submission_xgb_bfp2.csv'
submission_xgb.to_csv(submission_file_path_xgb, index=False)

print(f'Submission file saved to: {submission_file_path_xgb}')

#LGB
submission_predictions_lgb = le.inverse_transform(y_pred_lgb)

# Create the submission DataFrame for LGB
submission_lgb = pd.DataFrame({
    'id': test_ids,
    'NObeyesdad': submission_predictions_lgb
})

# Export to CSV for LGB
submission_file_path_lgb = 'data/test/submission_lgb_bfp2.csv'
submission_lgb.to_csv(submission_file_path_lgb, index=False)

print(f'Submission file saved to: {submission_file_path_lgb}')


Submission file saved to: test/submission_xgb_bfp2.csv
Submission file saved to: test/submission_lgb_bfp2.csv


## Competition Conclusion

My submissions for the Kaggle obesity risk prediction competition resulted in the following scores:

- XGBoost Model:
  - Public Score: 0.90634
  - Private Score: 0.90968

- LightGBM Model:
  - Public Score: 0.90281
  - Private Score: 0.90823

While my models performed commendably, with both achieving above 90% accuracy, the winning score was 0.91157. This close margin suggests that my models were competitive. For future iterations, I could explore additional feature engineering, model ensembling, or more advanced hyperparameter tuning to close the gap and possibly improve the scores.


![image.png](attachment:df469260-13b8-488b-846a-8fd88a96b51a.png)

![image.png](attachment:286802f4-d19e-49e7-a617-accba3b30ddf.png)