<a href="https://colab.research.google.com/github/subhashpolisetti/Decision-Tree-Ensemble-Algorithms/blob/main/GradientBoosting_Regression_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Comparison of Gradient Boosting Models: XGBoost, LightGBM, and CatBoost on California Housing Dataset

This notebook demonstrates the implementation and comparison of three popular gradient boosting algorithms—**XGBoost**, **LightGBM**, and **CatBoost**—on the **California Housing dataset**. The goal is to evaluate and compare the performance of these models for a regression task using Root Mean Squared Error (RMSE) as the evaluation metric.

## Overview

### 1. **Dataset**:
The **California Housing dataset** contains information about various districts in California, including attributes like median income, housing median age, average rooms, and population. The target variable is the **median house value** for each district.

### 2. **Gradient Boosting Models**:
We train and evaluate the following models:
   - **XGBoost**: A highly efficient and scalable gradient boosting algorithm, widely used for machine learning tasks.
   - **LightGBM**: A gradient boosting framework that uses a histogram-based approach to improve speed and memory efficiency.
   - **CatBoost**: A gradient boosting algorithm developed by Yandex, which handles categorical features and has a focus on speed and performance.

### 3. **Training**:
Each model is trained using the same set of training data:
   - **Training Parameters**: Each model is trained with 100 estimators (trees), a learning rate of 0.1, and a maximum tree depth of 3 to prevent overfitting.

### 4. **Evaluation**:
The models are evaluated using the **Root Mean Squared Error (RMSE)** metric on the test set:
   - RMSE is a commonly used metric to evaluate the performance of regression models, with lower values indicating better predictive accuracy.

## Results:
The performance of each model is compared by printing the RMSE values:
   - **XGBoost RMSE**: Measures the RMSE of the XGBoost model.
   - **LightGBM RMSE**: Measures the RMSE of the LightGBM model.
   - **CatBoost RMSE**: Measures the RMSE of the CatBoost model.

## Key Libraries Used:
- `XGBoost`: For training and evaluating the XGBoost model.
- `LightGBM`: For training and evaluating the LightGBM model.
- `CatBoost`: For training and evaluating the CatBoost model.
- `scikit-learn`: For data preprocessing, splitting the data into training and test sets, and calculating the RMSE.
- `pandas`: For handling data manipulation (if necessary).
- `numpy`: For numerical computations.

### Final Thoughts:
This notebook helps to understand the relative performance of different gradient boosting algorithms on a real-world regression task. XGBoost and LightGBM show competitive results, while CatBoost, although powerful, performs slightly worse on this specific dataset.

Feel free to experiment with hyperparameter tuning or different datasets to see how these models perform under different conditions!


In [8]:
from xgboost import XGBRegressor
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Load the California Housing dataset
data = fetch_california_housing()
X, y = data.data, data.target  # X contains the features, and y contains the target (housing prices)

# Split the data into training and testing sets (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train the XGBoost model
xgb_model = XGBRegressor(
    n_estimators=100,        # Number of trees (boosting rounds)
    learning_rate=0.1,       # The step size for each boosting iteration
    max_depth=3,             # Maximum depth of each tree to prevent overfitting
    random_state=42          # Ensures reproducibility by controlling randomness
)

# Fit the model on the training data
xgb_model.fit(X_train, y_train)

# Predict the target values (housing prices) on the test set
y_pred = xgb_model.predict(X_test)

# Calculate and print the Root Mean Squared Error (RMSE) to evaluate the model's performance
# RMSE measures the difference between the actual and predicted values
print("XGBoost RMSE:", mean_squared_error(y_test, y_pred, squared=False))


XGBoost RMSE: 0.543347735766591




In [2]:
import lightgbm as lgb

# Train LightGBM
lgb_model = lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
lgb_model.fit(X_train, y_train)

# Predict and Evaluate
y_pred = lgb_model.predict(X_test)
print("LightGBM RMSE:", mean_squared_error(y_test, y_pred, squared=False))

Dask dataframe query planning is disabled because dask-expr is not installed.

You can install it with `pip install dask[dataframe]` or `conda install dask`.
This will raise in a future version.



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001818 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947
LightGBM RMSE: 0.5379575878564891




In [3]:

pip install catboost

Collecting catboost
  Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl.metadata (1.2 kB)
Downloading catboost-1.2.7-cp310-cp310-manylinux2014_x86_64.whl (98.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.7/98.7 MB[0m [31m9.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: catboost
Successfully installed catboost-1.2.7


In [6]:
from catboost import CatBoostRegressor

# Train CatBoost
catboost_model = CatBoostRegressor(
    iterations=100,        # The number of boosting iterations (trees)
    learning_rate=0.1,     # The step size to update the model in each iteration
    depth=3,               # The maximum depth of the trees
    random_seed=42,        # Seed for random number generation, ensuring reproducibility
    verbose=0              # Suppresses the output during training
)

# Fit the CatBoost model on the training data (X_train and y_train)
catboost_model.fit(X_train, y_train)

# Predict the target variable using the trained model on the test data (X_test)
y_pred = catboost_model.predict(X_test)

# Evaluate the model's performance by calculating the Root Mean Squared Error (RMSE)
# RMSE is a common metric for regression tasks to measure prediction accuracy.
print("CatBoost RMSE:", mean_squared_error(y_test, y_pred, squared=False))


CatBoost RMSE: 0.5704136974196125




In [7]:
# Define a dictionary with model names as keys and model instances as values
models = {
    "XGBoost": XGBRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
    "LightGBM": lgb.LGBMRegressor(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42),
    "CatBoost": CatBoostRegressor(iterations=100, learning_rate=0.1, depth=3, random_seed=42, verbose=0)
}

# Iterate through each model in the models dictionary
for name, model in models.items():
    # Train the model on the training data (X_train, y_train)
    model.fit(X_train, y_train)

    # Use the trained model to predict the target variable on the test data (X_test)
    y_pred = model.predict(X_test)

    # Calculate the Root Mean Squared Error (RMSE) to evaluate the model's performance
    # RMSE is used to measure the difference between actual and predicted values
    rmse = mean_squared_error(y_test, y_pred, squared=False)

    # Print the RMSE for each model, formatted to 4 decimal places
    print(f"{name} RMSE: {rmse:.4f}")




XGBoost RMSE: 0.5433
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.001333 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1838
[LightGBM] [Info] Number of data points in the train set: 16512, number of used features: 8
[LightGBM] [Info] Start training from score 2.071947
LightGBM RMSE: 0.5380
CatBoost RMSE: 0.5704


