# Step 4: XGBoost Hyperparameter Tuning

In this notebook, we use Optuna, a hyperparameter optimization framework, to find the best XGBoost configuration for our calorie expenditure prediction task.

## Why XGBoost?
- Gradient boosting algorithms typically perform well on tabular data
- Handles a mix of feature types and scales effectively
- Built-in regularization to prevent overfitting
- Captures both linear and non-linear relationships

## Why Optuna?
- Efficient Bayesian optimization algorithms
- Automatic pruning of unpromising trials
- Parallel computation support
- Visualization capabilities for understanding hyperparameter importance

In [None]:
# Install required libraries if not already present
!pip install optuna xgboost

# Import necessary libraries
import optuna  # For hyperparameter optimization
import pandas as pd  # For data manipulation
import numpy as np  # For numerical operations
from sklearn.model_selection import cross_val_score, KFold  # For cross-validation
from xgboost import XGBRegressor  # XGBoost implementation




Collecting optuna
  Downloading optuna-4.3.0-py3-none-any.whl.metadata (17 kB)
Collecting alembic>=1.5.0 (from optuna)
  Downloading alembic-1.15.2-py3-none-any.whl.metadata (7.3 kB)
Collecting colorlog (from optuna)
  Downloading colorlog-6.9.0-py3-none-any.whl.metadata (10 kB)
Collecting sqlalchemy>=1.4.2 (from optuna)
  Downloading sqlalchemy-2.0.41-cp311-cp311-win_amd64.whl.metadata (9.8 kB)
Collecting Mako (from alembic>=1.5.0->optuna)
  Downloading mako-1.3.10-py3-none-any.whl.metadata (2.9 kB)
Collecting greenlet>=1 (from sqlalchemy>=1.4.2->optuna)
  Downloading greenlet-3.2.2-cp311-cp311-win_amd64.whl.metadata (4.2 kB)
Downloading optuna-4.3.0-py3-none-any.whl (386 kB)
Downloading alembic-1.15.2-py3-none-any.whl (231 kB)
Downloading sqlalchemy-2.0.41-cp311-cp311-win_amd64.whl (2.1 MB)
   ---------------------------------------- 0.0/2.1 MB ? eta -:--:--
   ---------------------------------------- 2.1/2.1 MB 23.7 MB/s eta 0:00:00
Downloading greenlet-3.2.2-cp311-cp311-win_amd64.w

## 1. Load Feature-Engineered Training Data

This dataset contains all the features we created in the feature engineering step (feature_engineering_step2b.ipynb).
The dataset includes both original features and engineered features like polynomial terms and interaction features.

Note: For XGBoost, we use the raw Calories values rather than log-transformed ones, as XGBoost can handle
non-normal distributions well. However, we'll still evaluate using RMSLE (Root Mean Squared Log Error).

In [None]:
# Load the feature-engineered training dataset
train = pd.read_csv('datasets/train_fe.csv')

# Separate target variable (Calories) and features
y = train['Calories']  # Target variable
X = train.drop(columns='Calories')  # Feature matrix

In [None]:
# Define function to evaluate models using Root Mean Squared Log Error (RMSLE)
# via 5-fold cross-validation
def rmsle_cv(model):
    # Create 5-fold cross-validation splits with fixed random seed for reproducibility
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    
    # Calculate negative MSLE scores across all folds
    # Note: sklearn returns negative scores for metrics where lower is better
    scores = -cross_val_score(model, X, y, scoring="neg_mean_squared_log_error", cv=kf, n_jobs=-1)
    
    # Return the square root of the mean score (RMSLE)
    return np.sqrt(scores.mean())

## 2. Define Optuna Objective Function

The objective function defines what we want to optimize (minimize RMSLE in our case).
It creates an XGBoost model with parameters suggested by Optuna and evaluates it using cross-validation.


In [None]:
# Define the objective function for Optuna to minimize
def objective(trial):
    # Define the hyperparameter search space
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 300, 1000),  # Number of boosting rounds
        'max_depth': trial.suggest_int('max_depth', 4, 10),            # Maximum tree depth
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3),  # Learning rate (eta)
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),       # Subsample ratio of training data
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.6, 1.0),  # Feature subsampling ratio
        'gamma': trial.suggest_float('gamma', 0, 5),                   # Minimum loss reduction for split
        'reg_alpha': trial.suggest_float('reg_alpha', 0, 5),           # L1 regularization
        'reg_lambda': trial.suggest_float('reg_lambda', 0, 5),         # L2 regularization
        'random_state': 42,  # For reproducibility
        'n_jobs': -1         # Use all available CPU cores
    }

    # Create XGBoost model with the suggested parameters
    model = XGBRegressor(**params)
    
    # Return RMSLE score (lower is better)
    return rmsle_cv(model)

In [None]:
# Create an Optuna study object
# Direction='minimize' because we want to minimize RMSLE
study = optuna.create_study(direction='minimize')

# Run the optimization process with 30 trials
# Each trial tests a different hyperparameter combination
study.optimize(objective, n_trials=30, show_progress_bar=True)

# Print the best trial information
print("Best trial:")
print(study.best_trial)


[I 2025-05-20 16:53:52,718] A new study created in memory with name: no-name-3e057a1c-902a-4a5b-ac05-6e37b99b6a65


  0%|          | 0/30 [00:00<?, ?it/s]

[I 2025-05-20 16:54:22,695] Trial 0 finished with value: 0.0175326463663772 and parameters: {'n_estimators': 578, 'max_depth': 8, 'learning_rate': 0.28150303899633333, 'subsample': 0.8451286232509485, 'colsample_bytree': 0.8621780128576577, 'gamma': 0.04994641396930677, 'reg_alpha': 1.0994169717138007, 'reg_lambda': 2.806855569115347}. Best is trial 0 with value: 0.0175326463663772.
[I 2025-05-20 16:54:42,568] Trial 1 finished with value: 0.01841559479917221 and parameters: {'n_estimators': 406, 'max_depth': 5, 'learning_rate': 0.1665008027620495, 'subsample': 0.891553116896902, 'colsample_bytree': 0.9017646206016794, 'gamma': 2.4755901597006087, 'reg_alpha': 4.572622877455029, 'reg_lambda': 4.127360945205066}. Best is trial 0 with value: 0.0175326463663772.
[I 2025-05-20 16:54:58,658] Trial 2 finished with value: 0.018115600101071418 and parameters: {'n_estimators': 322, 'max_depth': 10, 'learning_rate': 0.2243484540748487, 'subsample': 0.9799274954878197, 'colsample_bytree': 0.666108

## 3. Train Final XGBoost Model with Best Parameters

Now that Optuna has found the optimal hyperparameters, we'll:
1. Extract the best parameter set
2. Train a final model using these parameters
3. Use this model for making predictions

In [None]:
# Extract the best parameters found during optimization
best_params = study.best_trial.params

# Train the final XGBoost model with the best parameters
best_model = XGBRegressor(**best_params)
best_model.fit(X, y)

### 🏆 Best XGBoost Parameters from Trial 29

These optimized hyperparameters represent our best configuration after 30 trials.
Each parameter plays a specific role in the model's performance:

| Parameter          | Value                       | Purpose                                             |
| ------------------ | --------------------------- | --------------------------------------------------- |
| `n_estimators`     | 761                         | Number of trees in the ensemble                      |
| `max_depth`        | 8                           | Maximum tree depth (controls model complexity)       |
| `learning_rate`    | 0.0433                      | Step size shrinkage to prevent overfitting           |
| `subsample`        | 0.8292                      | Fraction of samples used for fitting trees           |
| `colsample_bytree` | 0.6293                      | Fraction of features used for fitting each tree      |
| `gamma`            | 0.0251                      | Minimum loss reduction for a split                   |
| `reg_alpha`        | 0.8449                      | L1 regularization on weights                        |
| `reg_lambda`       | 2.7842                      | L2 regularization on weights                        |
| **RMSLE**          | **0.01711** ✅ (best so far) | Our evaluation metric - lower is better             |

In [None]:
from xgboost import XGBRegressor

# Load the feature-engineered test data
test = pd.read_csv("datasets/test_fe.csv")

# Load test IDs for submission file
test_ids = pd.read_csv("datasets/test_ids.csv")['id']

# Recreate the best model with exact parameters from optimization
best_model = XGBRegressor(
    n_estimators=761,
    max_depth=8,
    learning_rate=0.04329564685888236,
    subsample=0.829156412199964,
    colsample_bytree=0.6293472330739741,
    gamma=0.025125225652620986,
    reg_alpha=0.8448922499045819,
    reg_lambda=2.784200742308772,
    random_state=42,
    n_jobs=-1
) the full feature-engineered training data

best_model.fit(X, y)






test_preds = np.expm1(test_preds_log)# Reverse the log1p transformation to get final predictionstest_preds_log = best_model.predict(test)# Predict on the test set (log-transformed target)# Make predictions on test data
test_preds_log = best_model.predict(test)

# Convert predictions back to original scale
# Note: If we had log-transformed the target, we'd need to convert back
test_preds = np.expm1(test_preds_log)  # reverse log1p

In [None]:
# Create a submission dataframe with the required format
submission = pd.DataFrame({
    'id': test_ids,        # Test sample IDs
    'Calories': test_preds # Predicted calorie values
})

# Save the submission file
submission.to_csv("datasets/submissions/submission_xgb_tuned_may20.csv", index=False)
print("Submission file 'submission_xgb_tuned_may20.csv' created.")

Submission file 'submission_xgb_tuned_may20.csv' created.


# Summary of XGBoost Tuning

## Key Findings:

1. **Best RMSLE:** 0.01711 - A significant improvement over baseline models

2. **Important Parameters:**
   - Higher number of estimators (761) indicate model benefits from ensemble power
   - Moderate tree depth (8) suggests moderate complexity is sufficient
   - Low learning rate (0.0433) helps with generalization
   - Higher L2 than L1 regularization suggest smoothing is important

3. **Next Steps:**
   - Use these parameters for ensemble models
   - Try SHAP analysis to understand feature contributions
   - Consider further feature engineering based on model insights