# Forecasting with XGBoost, LightGBM and other Gradient Boosting models


Gradient boosting models have gained popularity in the machine learning community due to their ability to achieve excellent results in a wide range of use cases, including both regression and classification. Although these models have traditionally been less common in forecasting, recent research has shown that they can be highly effective in this domain. Some of the key advantages of using gradient boosting models for forecasting include:

+ The ease with which exogenous variables, in addition to autoregressive variables, can be incorporated into the model.

+ The ability to capture non-linear relationships between variables.

+ High scalability, which enables the models to handle large volumes of data.

There are several popular implementations of gradient boosting in Python, with four of the most popular being [XGBoost](https://xgboost.readthedocs.io/en/stable/index.html), [LightGBM](https://lightgbm.readthedocs.io/en/latest/), [scikit-learn HistGradientBoostingRegressor](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html#sklearn.ensemble.HistGradientBoostingRegressor) and [CatBoost](https://catboost.ai/). All of these libraries follow the scikit-learn API, making them compatible with skforecast.



<script src="https://kit.fontawesome.com/d20edc211b.js" crossorigin="anonymous"></script>

<div class="admonition note" name="html-admonition" style="background: rgba(0,184,212,.1); padding-top: 0px; padding-bottom: 6px; border-radius: 8px; border-left: 8px solid #00b8d4;">

<p class="title">
    <i class="fa-circle-exclamation fa" style="font-size: 18px; color:#00b8d4;"></i>
    <b> &nbsp Note</b>
</p>

All of the gradient boosting libraries mentioned above - XGBoost, Lightgbm, HistGradientBoostingRegressor, and CatBoost - can handle categorical features natively, but they require specific encoding techniques that may not be entirely intuitive. Detailed information can be found in <a href="https://skforecast.org/latest/user_guides/categorical-features.html#native-implementation-for-categorical-features">categorical features</a> and in this <a href="https://www.cienciadedatos.net/documentos/py39-forecasting-time-series-with-skforecast-xgboost-lightgbm-catboost.html">example</a>. 

</div>

## Libraries

In [3]:
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg

## Data

In [4]:
# Download data
# ==============================================================================
url = (
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
    'data/h2o_exog.csv'
)
data = pd.read_csv(
        url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2']
       )

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')

steps = 36
data_train = data.iloc[:-steps, :]
data_test  = data.iloc[-steps:, :]

## Forecaster LightGBM

In [5]:
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state = 123),
                 lags      = 8
             )

forecaster.fit(y=data_train['y'], exog=data_train[['exog_1', 'exog_2']])
forecaster

ForecasterAutoreg 
Regressor: LGBMRegressor(random_state=123) 
Lags: [1 2 3 4 5 6 7 8] 
Transformer for y: None 
Transformer for exog: None 
Window size: 8 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} 
fit_kwargs: {} 
Creation date: 2023-05-16 08:48:18 
Last fit d

In [6]:
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])

2005-07-01    0.939158
2005-08-01    0.931943
2005-09-01    1.072937
2005-10-01    1.090429
2005-11-01    1.087492
2005-12-01    1.170073
2006-01-01    0.964073
2006-02-01    0.760841
2006-03-01    0.829831
2006-04-01    0.800095
Freq: MS, Name: pred, dtype: float64

In [9]:
# Feature importances
# ==============================================================================
forecaster.get_feature_importances()

Unnamed: 0,feature,importance
0,lag_1,61
1,lag_2,91
2,lag_3,14
3,lag_4,38
4,lag_5,35
5,lag_6,49
6,lag_7,25
7,lag_8,26
8,exog_1,43
9,exog_2,127


## Forecaster XGBoost

In [10]:
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(random_state = 123),
                 lags      = 8
             )

forecaster.fit(y=data_train['y'], exog=data_train[['exog_1', 'exog_2']])
forecaster

ForecasterAutoreg 
Regressor: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, gpu_id=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             n_estimators=100, n_jobs=None, num_parallel_tree=None,
             predictor=None, random_state=123, ...) 
Lags: [1 2 3 4 5 6 7 8] 
Transformer for y: None 
Transformer for exog: None 
Window size: 8 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous

In [11]:
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])

2005-07-01    0.882285
2005-08-01    0.971786
2005-09-01    1.106107
2005-10-01    1.064638
2005-11-01    1.094615
2005-12-01    1.139401
2006-01-01    0.948508
2006-02-01    0.784839
2006-03-01    0.774227
2006-04-01    0.789593
Freq: MS, Name: pred, dtype: float64

In [12]:
# Feature importances
# ==============================================================================
forecaster.get_feature_importances()

Unnamed: 0,feature,importance
0,lag_1,0.286422
1,lag_2,0.125064
2,lag_3,0.001548
3,lag_4,0.027828
4,lag_5,0.07502
5,lag_6,0.011337
6,lag_7,0.058954
7,lag_8,0.045198
8,exog_1,0.07561
9,exog_2,0.293018


In [13]:
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>