# Gradient Boosting Models Exercise: Advanced Ensemble Methods

**ML2 Course - Extra Points Assignment (5 points)**

**Objective:**
The goal of this exercise is to explore and master various gradient boosting algorithms for panel data modeling. You will implement and compare seven state-of-the-art boosting models that represent the cutting edge of machine learning regression techniques.

**Models to Implement:**

1. **AdaBoost** ([AdaBoostRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostRegressor.html)) - Adaptive Boosting, the pioneering boosting algorithm
2. **GBM** ([GradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingRegressor.html)) - Classic Gradient Boosting Machine from scikit-learn
3. **GBM Histogram** ([HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html)) - Histogram-based Gradient Boosting (faster, inspired by LightGBM)
4. **XGBoost** ([XGBRegressor](https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.XGBRegressor)) - Extreme Gradient Boosting, industry standard
5. **LightGBM** ([LGBMRegressor](https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMRegressor.html)) - Light Gradient Boosting Machine, optimized for speed and memory
6. **CatBoost** ([CatBoostRegressor](https://catboost.ai/en/docs/concepts/python-reference_catboostregressor)) - Categorical Boosting, handles categorical features natively
7. **XGBoostLSS** ([XGBoostLSS](https://github.com/StatMixedML/XGBoostLSS)) - XGBoost for Location, Scale and Shape, probabilistic predictions

**Tasks Workflow:**

Following a similar process to the SVM / KNN model (`notebooks/07.knn-model.ipynb`):

1. **Load the prepared training data** from the preprocessing step
2. **Feature Engineering** (if necessary):
   - Note: Tree-based models do NOT require standardization/normalization
   - They are invariant to monotonic transformations of features
3. **Feature Selection**:
   - Use existing feature rankings from `feature_ranking.xlsx` for initial feature selection
   - Consider feature importance from tree-based models
   - Test multiple feature sets (top 20, 30, 50 features, etc.) - please utilize Feature Importance directly from models
4. **Hyperparameter Tuning**: (2 points)
   - Use GridSearchCV or RandomizedSearchCV, or Optuna
   - Focus on key parameters: learning rate, boosting iterations, tree max depth, regularization (if applicable) etc. 
   - Use rolling window cross-validation to avoid data leakage
5. **Identify Local Champions**: (1 point)
   - Select the best model for each algorithm class
   - Compare based on RMSE on validation sets
6. **Save Models**:
   - Pickle the best models for each algorithm
   - Save to `../models/` directory

**Important Notes:**

- Gradient boosting models are powerful but prone to overfitting - pay attention to regularization
- Learning rate and number of estimators have an inverse relationship
- Early stopping can be used to prevent overfitting
- XGBoostLSS provides distributional forecasts (not just point estimates)
- Use time-series aware cross-validation (rolling window) for final model selection

**Model Evaluation:** (2 points)

After completing this notebook:
- Load your models in `notebooks/09.final-comparison-and-summary.ipynb`
- Compare them against existing models (OLS, ARMA, ARDL, KNN, SVR)
- Check if any gradient boosting model becomes the new champion!

---

## Submission Requirements

- Complete this notebook with code and outputs
- Save best model(s) as pickle files in `models/` directory
- Commit and push to your GitHub repository
- Send repository link to: **mj.wozniak9@uw.edu.pl**

**Deadline:** [To be announced by instructor]

In [1]:
# Basic imports
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error

# Scikit-learn boosting models
from sklearn.ensemble import (
    AdaBoostRegressor,
    GradientBoostingRegressor,
    HistGradientBoostingRegressor
)

# External gradient boosting models
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# XGBoostLSS
try:
    from xgboostlss import XGBoostLSS
except:
    XGBoostLSS = None


### 1. Data Load


In [2]:
import os
os.getcwd()

'/Users/sevintan/ML-in-Finance-I-case-study-forecasting-tax-avoidance-rates/notebooks'

In [3]:
data = pd.read_csv("../data/output/train_fe.csv")

print("Shape:", data.shape)
data.head()



Shape: (3993, 116)


Unnamed: 0.1,Unnamed: 0,Ticker,Nazwa2,rok,ta,txt,pi,str,xrd,ni,...,intan_ma,ppe_ma,sale_ma,cash_holdings_ma,roa_past,lev_past,intan_past,ppe_past,sale_past,cash_holdings_past
0,0,11B PW Equity,11 bit studios SA,2005,21.127613,1.24185,6.329725,0.19,0.0,5.0879,...,0.198598,0.013076,0.445954,0.574744,0.240818,0.0,0.198598,0.013076,0.445954,0.574744
1,1,11B PW Equity,11 bit studios SA,2006,21.127613,1.24185,6.329725,0.19,0.0,5.0879,...,0.198598,0.013076,0.445954,0.574744,0.240818,0.0,0.198598,0.013076,0.445954,0.574744
2,2,11B PW Equity,11 bit studios SA,2007,21.127613,1.24185,6.329725,0.19,0.0,5.0879,...,0.198598,0.013076,0.445954,0.574744,0.240818,0.0,0.198598,0.013076,0.445954,0.574744
3,3,11B PW Equity,11 bit studios SA,2008,21.127613,1.24185,6.329725,0.19,0.0,5.0879,...,0.198598,0.013076,0.445954,0.574744,0.240818,0.0,0.198598,0.013076,0.445954,0.574744
4,4,11B PW Equity,11 bit studios SA,2009,21.127613,1.24185,6.329725,0.19,0.0,5.0879,...,0.198598,0.013076,0.445954,0.574744,0.240818,0.0,0.198598,0.013076,0.445954,0.574744


In [4]:

# Define target
target = "etr"

# Drop non-feature columns
X = data.drop(columns=["Ticker", "Nazwa2"])
y = data[target]


print(X.shape, y.shape)


(3993, 114) (3993,)


In [5]:
# Fill missing values with 0 (tree-based models handle this well)
data = data.fillna(0)

print("Remaining NAs:", data.isna().sum().sum())


Remaining NAs: 0


In [9]:
pip install openpyxl

Collecting openpyxl
  Downloading openpyxl-3.1.5-py2.py3-none-any.whl.metadata (2.5 kB)
Collecting et-xmlfile (from openpyxl)
  Downloading et_xmlfile-2.0.0-py3-none-any.whl.metadata (2.7 kB)
Downloading openpyxl-3.1.5-py2.py3-none-any.whl (250 kB)
Downloading et_xmlfile-2.0.0-py3-none-any.whl (18 kB)
Installing collected packages: et-xmlfile, openpyxl
[2K   [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m2/2[0m [openpyxl]1/2[0m [openpyxl]
[1A[2KSuccessfully installed et-xmlfile-2.0.0 openpyxl-3.1.5
Note: you may need to restart the kernel to use updated packages.


In [10]:
ranking = pd.read_excel("../data/output/feature_ranking.xlsx")

ranking.head()



Unnamed: 0.1,Unnamed: 0,mi_score,sign_fscore,sign_fscore_0_1,corr,EN_coef,boruta_rank
0,rok,0.032073,0.1179353,0,-0.032669,0.0,19
1,ta,0.582922,0.001464884,1,0.26734,-1.404307e-07,49
2,txt,0.633067,5.246456e-13,1,0.368732,1.466269e-05,1
3,pi,0.608157,8.614688e-12,1,0.299593,8.453656e-06,3
4,str,0.293955,1.578384e-46,1,0.37287,,9


In [11]:
# Select feature names sorted by significance (MI score)
ranking_sorted = ranking.sort_values(by="mi_score", ascending=False)

# Top feature lists
top20 = ranking_sorted["Unnamed: 0"].values[:20]
top30 = ranking_sorted["Unnamed: 0"].values[:30]
top50 = ranking_sorted["Unnamed: 0"].values[:50]

top20, top30[:5], top50[:5]



(array(['etr_y_past', 'etr_y_ma', 'txt', 'diff', 'ni', 'pi', 'intant',
        'intant_sqrt', 'ta', 'revenue', 'roa', 'roa_clip', 'diff_ma',
        'capex', 'dlc', 'ta_log', 'cce', 'intan_past', 'dltt', 'sale'],
       dtype=object),
 array(['etr_y_past', 'etr_y_ma', 'txt', 'diff', 'ni'], dtype=object),
 array(['etr_y_past', 'etr_y_ma', 'txt', 'diff', 'ni'], dtype=object))

### 4. Hyperparameter Tuning 


In [9]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, shuffle=False
)

print(X_train.shape, X_test.shape)

(3194, 114) (799, 114)


In [13]:
# Fix feature names for XGBoost compatibility
X_train.columns = [str(c).replace('(', '_').replace(')', '_').replace('[', '_').replace(']', '_').replace(',', '_').replace(' ', '') for c in X_train.columns]
X_test.columns = [str(c).replace('(', '_').replace(')', '_').replace('[', '_').replace(']', '_').replace(',', '_').replace(' ', '') for c in X_test.columns]

In [18]:
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import mean_squared_error
import numpy as np

def tune_model(model, param_grid, X_train, y_train, cv_splits=5, randomized=True):
    """
    Tune a model with TimeSeriesSplit CV.
    Uses numpy arrays to avoid issues with special characters in feature names
    (XGBoost / LightGBM JSON / feature_names errors).
    """
    tscv = TimeSeriesSplit(n_splits=cv_splits)

    # Convert to numpy arrays so models don't see pandas column names
    X_array = np.asarray(X_train)
    y_array = np.asarray(y_train)

    if randomized:
        search = RandomizedSearchCV(
            estimator=model,
            param_distributions=param_grid,
            n_iter=20,
            scoring="neg_mean_squared_error",
            cv=tscv,
            n_jobs=-1,
            verbose=1,
            random_state=42
        )
    else:
        search = GridSearchCV(
            estimator=model,
            param_grid=param_grid,
            scoring="neg_mean_squared_error",
            cv=tscv,
            n_jobs=-1,
            verbose=1
        )

    search.fit(X_array, y_array)
    return search

In [20]:
# Hyperparameter grids for each model

param_grids = {
    "AdaBoost": {
        "n_estimators": [50, 100, 200],
        "learning_rate": [0.01, 0.05, 0.1]
    },

    "GBM": {
        "n_estimators": [100, 200],
        "learning_rate": [0.01, 0.05],
        "max_depth": [2, 3, 4],
        "subsample": [0.8, 1.0]
    },

    "HGB": {
        "learning_rate": [0.01, 0.05],
        "max_depth": [None, 3, 5],
        "max_bins": [128, 255]
    },

    "XGB": {
        "n_estimators": [200, 400],
        "learning_rate": [0.01, 0.05],
        "max_depth": [3, 5],
        "subsample": [0.8, 1.0],
        "colsample_bytree": [0.8, 1.0]
    },

    "LGBM": {
        "n_estimators": [200, 400],
        "learning_rate": [0.01, 0.05],
        "num_leaves": [31, 63],
        "subsample": [0.8, 1.0]
    },

    "CatBoost": {
        "iterations": [300, 500],
        "learning_rate": [0.01, 0.05],
        "depth": [4, 6, 8]
    }
}

### 5. Identify Local Champions


In [21]:
from sklearn.ensemble import AdaBoostRegressor, GradientBoostingRegressor, HistGradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor

# Model dictionary
models = {
    "AdaBoost": AdaBoostRegressor(random_state=42),
    "GBM": GradientBoostingRegressor(random_state=42),
    "HGB": HistGradientBoostingRegressor(random_state=42),
    "XGB": XGBRegressor(
        objective="reg:squarederror",
        eval_metric="rmse",
        random_state=42,
        verbosity=0
    ),
    "LGBM": LGBMRegressor(random_state=42),
    "CatBoost": CatBoostRegressor(
        verbose=0,
        random_state=42
    )
}

# Dictionary to store best results
results = {}

for name, model in models.items():
    print(f"üîç Tuning {name}...")

    # get model's parameter grid
    param_grid = param_grids[name]

    # tune model
    search = tune_model(model, param_grid, X_train, y_train, cv_splits=5, randomized=True)

    # store best model + best rmse
    best_model = search.best_estimator_
    best_rmse = np.sqrt(-search.best_score_)  # because scoring="neg_mean_squared_error"

    results[name] = {
        "best_model": best_model,
        "best_rmse": best_rmse,
        "best_params": search.best_params_
    }

    print(f"‚úî {name} done | Best RMSE: {best_rmse:.4f}")
    print("-" * 50)


üîç Tuning AdaBoost...
Fitting 5 folds for each of 9 candidates, totalling 45 fits




‚úî AdaBoost done | Best RMSE: 0.0133
--------------------------------------------------
üîç Tuning GBM...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
‚úî GBM done | Best RMSE: 0.0025
--------------------------------------------------
üîç Tuning HGB...
Fitting 5 folds for each of 12 candidates, totalling 60 fits




‚úî HGB done | Best RMSE: 0.0282
--------------------------------------------------
üîç Tuning XGB...
Fitting 5 folds for each of 20 candidates, totalling 100 fits
‚úî XGB done | Best RMSE: 0.0075
--------------------------------------------------
üîç Tuning LGBM...
Fitting 5 folds for each of 16 candidates, totalling 80 fits




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002304 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003550 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004102 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Start training from score 0.217071
[LightGBM] [Info] Number of data points in the train set



[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003933 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.007722 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002581 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002005 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bin







[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003146 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002027 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003803 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002573 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071









[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.010019 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.148677 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005107 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003860 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003837 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006153 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004257 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004799 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003805 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002566 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426

[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002684 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005274 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006607 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001368 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003361 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426

[LightGBM] [Info] Start training from score 0.217071








[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002509 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.055405 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006539 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002839 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004206 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.008362 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296








[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002858 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.009176 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003623 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.012372 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003334 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004612 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982




[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.008856 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003878 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568








[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003760 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005148 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004037 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.018808 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001305 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007333 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002664 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004637 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003063 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006422 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003707 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071







[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001408 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.007342 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005967 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003894 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453




[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568





[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.003093 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003444 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.025036 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.020620 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169





[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.002894 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005369 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004091 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005988 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006643 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.001700 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.010499 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004640 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004158 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start 







[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005035 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002892 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006008 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071


[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.004291 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total B



[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004638 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.005593 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169




[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.005587 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 8453
[LightGBM] [Info] Number of data points in the train set: 534, number of used features: 99
[LightGBM] [Info] Start training from score 0.214568




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006498 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12426
[LightGBM] [Info] Number of data points in the train set: 1066, number of used features: 106
[LightGBM] [Info] Start training from score 0.217071




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003840 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12774
[LightGBM] [Info] Number of data points in the train set: 1598, number of used features: 107
[LightGBM] [Info] Start training from score 0.221296




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.006882 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 12982
[LightGBM] [Info] Number of data points in the train set: 2130, number of used features: 109
[LightGBM] [Info] Start training from score 0.215806




[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004771 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13075
[LightGBM] [Info] Number of data points in the train set: 2662, number of used features: 110
[LightGBM] [Info] Start training from score 0.214169








[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003991 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 13167
[LightGBM] [Info] Number of data points in the train set: 3194, number of used features: 111
[LightGBM] [Info] Start training from score 0.214828
‚úî LGBM done | Best RMSE: 0.0257
--------------------------------------------------
üîç Tuning CatBoost...
Fitting 5 folds for each of 12 candidates, totalling 60 fits




‚úî CatBoost done | Best RMSE: 0.0110
--------------------------------------------------


### 6. Results

In [22]:
results_df = pd.DataFrame(
    [
        [name, res["best_rmse"], res["best_params"]]
        for name, res in results.items()
    ],
    columns=["model", "best_rmse", "best_params"]
)

results_df.sort_values("best_rmse")


Unnamed: 0,model,best_rmse,best_params
1,GBM,0.002534,"{'subsample': 1.0, 'n_estimators': 200, 'max_d..."
3,XGB,0.00755,"{'subsample': 1.0, 'n_estimators': 200, 'max_d..."
5,CatBoost,0.010992,"{'learning_rate': 0.05, 'iterations': 500, 'de..."
0,AdaBoost,0.013323,"{'n_estimators': 200, 'learning_rate': 0.1}"
4,LGBM,0.025715,"{'subsample': 0.8, 'num_leaves': 63, 'n_estima..."
2,HGB,0.028216,"{'max_depth': 5, 'max_bins': 255, 'learning_ra..."


In [23]:
best_row = results_df.sort_values("best_rmse").iloc[0]
best_row


model                                                        GBM
best_rmse                                               0.002534
best_params    {'subsample': 1.0, 'n_estimators': 200, 'max_d...
Name: 1, dtype: object

In [24]:
import joblib
import os

os.makedirs("../models", exist_ok=True)

for name, res in results.items():
    model_path = f"../models/{name}_best.pkl"
    joblib.dump(res["best_model"], model_path)
    print(f"Saved {name} to {model_path}")


Saved AdaBoost to ../models/AdaBoost_best.pkl
Saved GBM to ../models/GBM_best.pkl
Saved HGB to ../models/HGB_best.pkl
Saved XGB to ../models/XGB_best.pkl
Saved LGBM to ../models/LGBM_best.pkl
Saved CatBoost to ../models/CatBoost_best.pkl
