# Step 3: Baseline Modeling
Train initial regression models and evaluate using RMSLE.

## Overview:
This notebook establishes baseline performance using several regression models:
- Linear Regression (simplest approach)
- Random Forest (tree-based ensemble)
- XGBoost (gradient boosting)
- LightGBM (gradient boosting)

We'll use cross-validation to get reliable performance estimates and identify the most promising algorithm.

In [None]:
# Import necessary libraries for data manipulation, modeling and evaluation
import pandas as pd  # For data manipulation and analysis
import numpy as np   # For numerical operations
from sklearn.linear_model import LinearRegression  # Basic linear regression model
from sklearn.ensemble import RandomForestRegressor  # Tree-based ensemble model
from sklearn.model_selection import cross_val_score, KFold  # For cross-validation
from sklearn.metrics import mean_squared_log_error  # For evaluation metric

## 1. Load Preprocessed Data
Make sure you're using the transformed version with log(Calories) and no 'id' column.

The preprocessed data includes:
- One-hot encoded categorical variables (Sex)
- Log-transformed target variable (Calories)
- No identifier columns (id) that would interfere with modeling

In [None]:
# Load preprocessed training and test datasets
train = pd.read_csv('datasets/train_preprocessed.csv')  # Load training data
test = pd.read_csv('datasets/test_preprocessed.csv')    # Load test data

# Separate features and target variable
y = train['Calories']  # Target variable (log-transformed calories)
X = train.drop(columns='Calories')  # Feature matrix (all columns except target)

In [None]:
# Define function to evaluate models using k-fold cross-validation with RMSLE metric
def rmsle_cv(model, X, y):
    """
    Calculate Root Mean Squared Logarithmic Error via cross-validation
    
    Parameters:
    - model: The machine learning model to evaluate
    - X: Feature matrix
    - y: Target values (already log-transformed)
    
    Returns:
    - Average RMSLE across all folds
    """
    # Create 5-fold cross-validation with random shuffling and fixed random seed
    kf = KFold(n_splits=5, shuffle=True, random_state=42)
    
    # Calculate RMSLE scores across all folds
    # Note: sklearn returns negative values for error metrics when used with cross_val_score
    # so we negate it to get positive values
    rmsle_scores = -cross_val_score(
        model, X, y, scoring='neg_mean_squared_log_error', cv=kf
    )
    
    # Return the square root of the mean (converting from MSLE to RMSLE)
    return np.sqrt(rmsle_scores.mean())

In [None]:
# Train and evaluate Linear Regression model
# Linear Regression is the simplest model and provides a good baseline
lr = LinearRegression()  # Initialize the model
lr_score = rmsle_cv(lr, X, y)  # Calculate cross-validated RMSLE score
print(f"Linear Regression RMSLE: {lr_score:.5f}")

Linear Regression RMSLE: 0.04522


In [None]:
# Train and evaluate Random Forest model
# Random Forest is an ensemble of decision trees that can capture non-linear relationships
rf = RandomForestRegressor(
    n_estimators=100,  # Number of trees in the forest
    random_state=42,   # For reproducibility
    n_jobs=-1          # Use all available CPU cores
)
rf_score = rmsle_cv(rf, X, y)  # Calculate cross-validated RMSLE score
print(f"Random Forest RMSLE: {rf_score:.5f}")

Random Forest RMSLE: 0.01813


In [6]:
print(f"Linear Regression RMSLE: {lr_score:.5f}")
print(f"Random Forest RMSLE:   {rf_score:.5f}")

Linear Regression RMSLE: 0.04522
Random Forest RMSLE:   0.01813


In [None]:
# Import XGBoost library for gradient boosting
from xgboost import XGBRegressor

# Train and evaluate XGBoost model
# XGBoost is a powerful gradient boosting library that often performs well on tabular data
xgb = XGBRegressor(
    n_estimators=100,    # Number of boosting rounds
    learning_rate=0.1,   # Step size shrinkage to prevent overfitting
    max_depth=6,         # Maximum depth of trees
    random_state=42,     # For reproducibility
    n_jobs=-1            # Use all available CPU cores
)
xgb_score = rmsle_cv(xgb, X, y)  # Calculate cross-validated RMSLE score
print(f"XGBoost RMSLE: {xgb_score:.5f}")

XGBoost RMSLE: 0.01741


In [None]:
# Import LightGBM library for another gradient boosting implementation
from lightgbm import LGBMRegressor

# Train and evaluate LightGBM model
# LightGBM is a gradient boosting framework that uses tree-based learning algorithms
# It's designed to be faster and more efficient than other implementations
lgb = LGBMRegressor(
    n_estimators=100,    # Number of boosting rounds
    learning_rate=0.1,   # Step size shrinkage to prevent overfitting
    max_depth=6,         # Maximum depth of trees
    random_state=42,     # For reproducibility
    n_jobs=-1            # Use all available CPU cores
)
lgb_score = rmsle_cv(lgb, X, y)  # Calculate cross-validated RMSLE score
print(f"LightGBM RMSLE: {lgb_score:.5f}")

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002456 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 358
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 7
[LightGBM] [Info] Start training from score 4.141163
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003011 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 361
[LightGBM] [Info] Number of data points in the train set: 600000, number of used features: 7
[LightGBM] [Info] Start training from score 4.141466
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002462 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough,

In [None]:
# Print a comparison of all model RMSLE scores to easily compare performance
# Lower RMSLE values indicate better model performance
print("Baseline Model Comparison:")
print(f"Linear Regression RMSLE: {lr_score:.5f}")
print(f"Random Forest RMSLE:     {rf_score:.5f}")
print(f"XGBoost RMSLE:           {xgb_score:.5f}")
print(f"LightGBM RMSLE:          {lgb_score:.5f}")
# From these results, we can identify which model performs best on this dataset

Baseline Model Comparison:
Linear Regression RMSLE: 0.04522
Random Forest RMSLE:     0.01813
XGBoost RMSLE:           0.01741
LightGBM RMSLE:          0.01764


## 9. Train Best Model (XGBoost) on Full Data and Predict

Based on our cross-validation results, XGBoost performs the best among our baseline models. We'll now:

1. Train XGBoost on the entire training dataset
2. Generate predictions on the test dataset
3. Transform predictions back to the original scale (reversing the log transformation)
4. Create a submission file for Kaggle

In [None]:
# Fit best model (XGBoost) on full training data
# Using all available data rather than just a training split for final model
xgb.fit(X, y)

# Generate predictions on the test dataset
# The predictions will be in log-transformed scale
test_preds_log = xgb.predict(test)

# Reverse the log1p transform to get predictions in the original scale
# We use expm1() which is the inverse of log1p()
# This converts our predictions from log(calories+1) back to calories
test_preds = np.expm1(test_preds_log)

In [None]:
# Load test IDs - we need these to match predictions with the correct test instances
# We stored IDs separately when preprocessing since they weren't needed for modeling
test_ids = pd.read_csv('datasets/test_ids.csv')['id']

# Prepare submission dataframe with two columns:
# - id: The identifier for each test instance
# - Calories: The predicted calorie expenditure (in original scale)
submission = pd.DataFrame({
    'id': test_ids,
    'Calories': test_preds
})

# Save to CSV file in the format required by Kaggle
# The index=False parameter prevents pandas from adding an additional index column
submission.to_csv('submission_xgb.csv', index=False)
print("Submission file 'submission_xgb.csv' created.")
