# **XGBoost for time series forecasting**

The purpose of this notebook is to develop a XGBoost model in order to make 24 hours ahead predictions of the electricity price in the Spanish electricity market for the 2020 timespan. First section is focused on explaining how XGBoost works, and then I proceed with the implementation of the model in python. This notebook asumes that you understand the fundamentals behind decision trees, gradient boosting and time series. 

# **Table of Contents** 

- [1. XGBoost theoretical explanation](#1)
    - [1.1. XGBoost Intuitive explanation](#1.1)
    - [1.2. Mathematics behind XGBoost ](#1.2)
- [2. XGBoost in practice ](#2)
    - [2.1. XGBoost model](#2.1)


# **1. XGBoost theoretical explanation** <a class="anchor" id="1"></a>

In this section I will explain first the intuitive idea of how XGBoost works and then the mathematics behind it. If you want to complement this information with other sources, here are the sources that helped me the most to understand this algorithm:

- StatQuest with Josh Stammer:  https://www.youtube.com/channel/UCtYLUTtgS3k1Fg4y5tAhLbw

This is a Youtube Channel (that you will problably know if you study statistics and machine learning) where you can find lots of videos about decision trees, gradient boosting and XGBoost. It is ideal if you want to have and intuitive idea of how XGBoost works.

- XGBoost: A Scalable Tree Boosting System. https://arxiv.org/pdf/1603.02754.pdf

This is the original XGBoost paper, where you can delve into the nature of the algorithm.

- XGBoost: A Scalable Tree Boosting System (presentation). https://www.youtube.com/watch?v=Vly8xGnNiWs&ab_channel=RealDataScienceUSA%28formerlyDataScience.LA%29

Here you can find a presentation of XGBoost by its author, Tianqi Chen. 

- XGBoost documentation: https://xgboost.readthedocs.io/en/latest/index.html

Of course, the official documentation, indispensable in order to understand the code and tune hyperparameters.

## **1.1 XGBoost Intuitive explanation** <a class="anchor" id="1.1"></a>

As in all supervised learning problems, the goal here is to find the right model parameters that minimize a loss function. A loss function compares predicted values with real values. In addittion, it usually contains a regularization term in order to deal with overfitting. The simplified loss function for XGBoost is the following:
$$\sum^{n}_{i=1} L(y_i, p_i) + \frac{1}{2} \lambda \omega^2$$

Where $\omega$ is the output value for each leaf and $\lambda$ a regularization parameter.

In order to make predictions, XGBoost starts with a default prediction of 0.5. Then, the first tree starts with a single leaf, and all of the residuals for the 0.5 prediction go to that leaf. With this, we calculate the *similarity score* for that leaf. For now, we will define the *similarity score* as a number that measures how well the leaf does its job.
That being said, the question now is whether or not we can do a better job clustering similar residuals if we split them into two groups. In order to answer this question, we split the leaf (which becomes a node) into 2 leaves by a treshold (computed as in decision trees algorithm), and we compute the gain of splitting the residuales into two groups:
$$Gain = Left_{similarity} + Rigth_{similarity} - Root_{similarity}$$

If the gain value is lower than a hyperparameter called gamma ($\gamma$), we remove the branch. If not, we continue creating new leaves. As we can see, we will prune an XGBoost tree based on its Gain values, and therefore based on the similarity scores.

## **1.2 Mathematics behind XGBoost** <a class="anchor" id="1.2"></a>

In order to minimize the loss function, XGBoost uses the second order Taylor approximation to rewrite the loss function as follows: 

$$ L(y, p_i + \omega) \approx L(y, p_i) + \frac{\partial L(y,p_i)}{\partial p_i} \omega + \frac{1}{2} \frac{\partial^2 L(y,p_i)}{\partial p_i^2}\omega^2 $$


First term corresponds with the loss function from previous prediction, second term corresponds with the first derivative of the loss function from previous prediction (the gradient) and the third term corresponds with the second derivative of the loss function (the hessian). Thus, we can rewrite: 

$$ L(y, p_i + \omega) \approx L(y, p_i) + g \omega + \frac{1}{2} h \omega^2 $$


where $g$ is the gradient and $h$ the hessian.
Knowing this, if we add the regularization term, the problem to solve is the following:

$$ \frac{\partial (g_1 + g_2 + \dots + g_n) \omega}{\partial \omega}  + \frac{1}{2} (h_1 + h_2 + \dots + h_n +\lambda) \omega^2 = 0$$


And the solution turns to be:

$$ \omega = \frac{- (g_1 + g_2 + \dots + g_n)}{h_1 + h_2 + \dots + h_n + \lambda} $$

 
 
 In order to compute the similarity score, we start with the following approximation of the loss function, which is the result of using the second order Taylor approximation and removing constant terms: 
 
 $$(g_1 + g_2 + \dots + g_n)\omega + \frac{1}{2}(h_1 + h_2 + \dots + h_n + \lambda)\omega^2$$
 
 If we multiply by -1, replace $\omega$ by its expression and simplify, we have: 
 
 $$ \frac{1}{2}\frac{(g_1 + g_2 + \dots + g_n)^2}{(h_1 + h_2 + \dots + h_n + \lambda)} $$
 
 
 $\frac{1}{2}$ is omitted as similarity score is a relative measure, thus the final expression is:
 
 $$Similarity \: Score = \frac{(g_1 + g_2 + \dots + g_n)^2}{(h_1 + h_2 + \dots + h_n + \lambda)}$$
 



# **2. XGBoost in practice** <a class="anchor" id="2"></a>





In order to develop the code, we first need to import the following libraries:

In [None]:
import numpy as np
import pandas as pd

from statsmodels.graphics.tsaplots import plot_acf #acf
from statsmodels.graphics.tsaplots import plot_pacf #pacf

import xgboost as xgb # Library for implementing XGBoost algorithm
from matplotlib import pyplot # Data visualization
from sklearn.model_selection import GridSearchCV # For hyperparameters optimization
from sklearn.metrics import mean_absolute_error


Then, we create a function that splits the datasets in a training set and a test set. We will use data from 2014 to 2018 as training set, data from 2019 as validation set (in order to optimize hyperparameters), and data from 2020 as test set. In this function, we have to specify the data we want to split and the year we want to use as test set (training set will be all of the previous data). It is worth mentioning that with time series prediction we cannot use cross validation.

In [None]:
# Split dataset into train and test data
def split_dataset(data,test_year):
    if test_year == 2019:
        train, test = data[:-17544], data[-17712:-8784]
    elif test_year == 2020:
        train, test = data[:-8784], data[-8952:]
    return np.array(train), np.array(test)


After doing this, we need to transform the "time series problem" to a "supervised learning problem" (labeled data). This is key in order to predict time series with supervised learning algorithms. In order to do this, the lags selected to predict (example: $p_{t-24}$ is useful to predict $p_t$) will be the input (what we call x_train or x_test), and the value we want to predict will be what we call y_train and y_test. In the following function we need to specify the data we are going to use and the lags selected (array).

We do this with the following function:

In [None]:
# Time series data to supervised learning problem
def to_supervised(data, lags):
    X = list()
    y = np.array([])
    j = lags[-1]
    i = 0
    for _ in range(len(data)-lags[-1]):
        x_data = data[i:j]
        x_input = np.array([])
        for k in lags:
            x_input = np.append(x_input, x_data[-k])
        X.append(x_input)
        y = np.append(y, data[j:j+1,0])
        j += 1
        i += 1

    return np.array(X), y

 This two functions are enough to prepare the data for the supervised learning scheme. Next step is reading the data (note: precio is the spanish word for price): 

In [None]:
df = pd.read_csv("../input/spanishelectricitymarket/precios_14_20.txt", sep=",", index_col=0)
df.head()


For the lags selection, we proceed with multiples of 24 until reaching the data corresponding with the same hour seven days before (which is 7x24=168). These values are the most significant when we look at the ACF and PACF: 

In [None]:
plot_acf(df,lags=(200), alpha=None)
plot_pacf(df,lags=(200), alpha=None)
lags = np.array([24,48,72,96,120,144,168])

Then we transform the data to a supervised learning problem:

In [None]:
train, test =split_dataset(df, 2019)

train_x, train_y = to_supervised(train, lags)

test_x, test_y = to_supervised(test, lags)



We are also going to use exogenous variables. We select wind energy and demand, as they are the most correlated exogenous variables with the electricity price.

In [None]:

# Exogenous variables

# wind energy
df2 = pd.read_csv("../input/spanishelectricitymarket/eolica_14_20.csv", sep=",", index_col=0)
# demand
df3 = pd.read_csv("../input/spanishelectricitymarket/demanda_14_20.csv", sep=",", index_col=0)


train_wind, test_wind = split_dataset(df2, 2019)
train_demand, test_demand = split_dataset(df3, 2019)

train_x_wind = train_wind[144:-24]
train_x_demand = train_demand[144:-24]

test_x_wind = test_wind[144:-24]
test_x_demand = test_demand[144:-24]

train_x_final = np.append(train_x, train_x_wind, axis=1)
train_x_final = np.append(train_x_final, train_x_demand, axis=1)

test_x_final = np.append(test_x, test_x_wind, axis=1)
test_x_final = np.append(test_x_final, test_x_demand, axis=1)

As we see, for the exogenous variables we only use lag 24, i.e., we pick data from $p_{t-24}$ to predict the value in $p_{t}$)

## **2.1 XGBoost model** <a class="anchor" id="2.1"></a>

We need to create and XGBoost model for regression. In order to optimize hyperparameters, we will use GridSearchCV from Sklearn. The following code is commented as it takes a lot of time to execute.

In [None]:

## FIRST ROUND
## select the hyperparameters we want to tune and the values
# param_grid = {
#     'learning_rate': [0.1, 0.01, 0.05],
#     'min_split_loss': [1000, 2000, 500], # gamma
#     'reg_lambda': [0, 0.5, 10.0], # lambda
#     'max_depth': [8, 9, 10]
# }
# 
## build XGBoost for regression 
# optimal_params = GridSearchCV(
#     estimator = xgb.XGBRegressor(objective = 'reg:squarederror',
#                          eval_metric='mae', n_estimators = 300),
#     param_grid = param_grid,
#     scoring = 'neg_mean_absolute_error', 
#     verbose = 2
# )
# 
## fit the model
# optimal_params.fit(train_x_final,
#                    train_y,
#                    early_stopping_rounds=10,
#                    eval_set=[(test_x_final, test_y)],
#                    verbose=False)
# 
# print(optimal_params.best_params_)
## {'learning_rate': 0.05, 'max_depth': 8, 'min_split_loss': 2000, 'reg_lambda': 10.0}




In [None]:
## SECOND ROUND
# param_grid = {
#     'min_split_loss': [2000, 3000, 5000],
#     'reg_lambda': [10, 30, 100],
#     'max_depth': [8, 7, 6]
# }
# optimal_params = GridSearchCV(
#     estimator = xgb.XGBRegressor(objective = 'reg:squarederror',
#                          eval_metric='mae', n_estimators = 300, learning_rate=0.05),
#     param_grid = param_grid,
#     scoring = 'neg_mean_absolute_error',
#     verbose = 2
# )
# optimal_params.fit(train_x_final,
#                    train_y,
#                    early_stopping_rounds=10,
#                    eval_set=[(test_x_final, test_y)],
#                    verbose=False)
# 
# print(optimal_params.best_params_)
# 
## {'max_depth': 8, 'min_split_loss': 2000, 'reg_lambda': 100}

Since we now have the values for the hyperparameters, we can set values from 2020 as the test set in order to use them to define the final model.

In [None]:
# Set 2020 as test set for price
train, test = split_dataset(df, 2020)

train_x, train_y = to_supervised(train, lags)
test_x, test_y = to_supervised(test, lags)

# Set 2020 as test set for wind and demand
train_wind, test_wind = split_dataset(df2, 2020)
train_demand, test_demand = split_dataset(df3, 2020)

train_x_wind = train_wind[144:-24]
train_x_demand = train_demand[144:-24]

test_x_wind = test_wind[144:-24]
test_x_demand = test_demand[144:-24]

# Append all variables
train_x_final = np.append(train_x, train_x_wind, axis=1)
train_x_final = np.append(train_x_final, train_x_demand, axis=1)

test_x_final = np.append(test_x, test_x_wind, axis=1)
test_x_final = np.append(test_x_final, test_x_demand, axis=1)



Once all this is done, when can now define the final model.

In [None]:
model = xgb.XGBRegressor(objective = 'reg:squarederror',
                         eval_metric='mae',
                         learning_rate = 0.05,
                         min_split_loss = 2000,  # gamma
                         reg_lambda = 100,  # lambda
                         max_depth = 8,
                         n_estimators = 300)


eval_set = [(train_x_final, train_y), (test_x_final, test_y)] 
model.fit(train_x_final, train_y, eval_metric="mae", eval_set=eval_set, verbose=True, early_stopping_rounds=10)


If you want to compare this results with other state-of-the-art models you can check  my end-of-degree project, where I have developed ARIMA, transfer function models, Recurrent Neural Networks and Convolutional Neural Networks for this same problem (code and data in: https://github.com/Adricarpin/TFG.git). I have also built a naive model in order to see if these models were skillful. If we compare XGBoost results with the results from these models, we find that on the one hand XGBoost outperforms the naive model, thus it can be considered as skillful, but on the other hand it doesn't outperform ARIMA and NN models. 

We can compare train errors with test errors in order to know if the model is overfitted or underfitted. We do this with the following graph:

In [None]:
results = model.evals_result()

epochs = len(results['validation_0']['mae'])
x_axis = range(0, epochs)


fig, ax = pyplot.subplots()
ax.set_ylim([0, 40])
ax.plot(x_axis, results['validation_0']['mae'], label='Train')
ax.plot(x_axis, results['validation_1']['mae'], label='Test')
ax.legend()
pyplot.ylabel('mae')
pyplot.title('XGBoost mae')
pyplot.show()


We can also have a look at the trees structure. The following code depicts the structure for the first tree.

In [None]:

clf_xgb = xgb.XGBRegressor(objective = 'reg:squarederror',
                         eval_metric='mae',
                         learning_rate = 0.05,
                         min_split_loss = 2000,
                         reg_lambda = 100,
                         max_depth = 8,
                         n_estimators = 1)

clf_xgb.fit(train_x_final, train_y)

node_params = {'shape': 'box',
               'style': 'filled, rounded',
               'fillcolor': '#78cbe'}
leaf_params = {'shape': 'box',
               'style': 'filled',
               'fillcolor': '#e48038'}

graph_data = xgb.to_graphviz(clf_xgb, num_trees=0, size="10,10",
                condition_node_params=node_params,
                leaf_node_params=leaf_params)

graph_data.view(filename='xgboost_tree')
