### XGBoost for Time Series Forecasting

Using XGBoost (one of the variant of Boosting) for forecasting stock price. [XGBoost documentation](https://xgboost.readthedocs.io/en/latest/)
- XGBoost is an ensemble of decision trees where new trees fix errors of the trees that are already part of the model. It is therefore, trees are added until no further improvements can be made to the model.
- In order to use XGBoost for time series, we need to evaluate the model via `walk-forward validation` instead of `k-fold cross validation` because k-fold sometimes would have biased results.

#### Importing Extensions and Libraries

In [1]:
%load_ext watermark
%load_ext lab_black

In [75]:
import os
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from sklearn.model_selection import train_test_split

np.random.seed(22)

plt.style.use(style="seaborn")
%matplotlib

# make autocomplet working
%config Completer.use_jedi = False

Using matplotlib backend: agg


In [2]:
%watermark -iv -v

Python implementation: CPython
Python version       : 3.8.5
IPython version      : 7.22.0

numpy     : 1.20.2
pandas    : 1.2.3
matplotlib: 3.4.1



#### Loading dataset from yahoo finance 

In [3]:
# data is microsoft's past 1 year daily stock price (April 3 2020 - April 1 2021)
data_url = "https://query1.finance.yahoo.com/v7/finance/download/MSFT?period1=1585908430&period2=1617444430&interval=1d&events=history&includeAdjustedClose=true"
data = pd.read_csv(data_url)
data

Unnamed: 0,Date,Open,High,Low,Close,Adj Close,Volume
0,2020-04-03,155.100006,157.380005,152.190002,153.830002,152.282501,41243300
1,2020-04-06,160.320007,166.500000,157.580002,165.270004,163.607422,67111700
2,2020-04-07,169.589996,170.000000,163.259995,163.490005,161.845322,62769000
3,2020-04-08,165.669998,166.669998,163.500000,165.130005,163.468811,48318200
4,2020-04-09,166.360001,167.369995,163.330002,165.139999,163.478729,51431800
...,...,...,...,...,...,...,...
246,2021-03-26,231.550003,236.710007,231.550003,236.479996,236.479996,25471700
247,2021-03-29,236.589996,236.800003,231.880005,235.240005,235.240005,25227500
248,2021-03-30,233.529999,233.850006,231.100006,231.850006,231.850006,24792000
249,2021-03-31,232.910004,239.100006,232.389999,235.770004,235.770004,43623500


In [4]:
# for simplicity, I am just taking the closing price in this forecast (making it a univariate time series problem)
df = data[["Close"]].round(2).copy()
print(df.shape)
df.head()

(251, 1)


Unnamed: 0,Close
0,153.83
1,165.27
2,163.49
3,165.13
4,165.14


#### Transforming the univariate problem to a supervised problem
For this particular problem, I am going to take the target value as the next day's stock price and drop the NaN of the last row in target column because for the last day, we don't have any shift, meaning, we don't have next days value.

In [5]:
# adding target column
df["target"] = df.Close.shift(-1)
df

Unnamed: 0,Close,target
0,153.83,165.27
1,165.27,163.49
2,163.49,165.13
3,165.13,165.14
4,165.14,165.51
...,...,...
246,236.48,235.24
247,235.24,231.85
248,231.85,235.77
249,235.77,242.35


In [6]:
# drop the NaN values in the last row in target column
df.dropna(inplace=True)
print(df.shape)
df.head()

(250, 2)


Unnamed: 0,Close,target
0,153.83,165.27
1,165.27,163.49
2,163.49,165.13
3,165.13,165.14
4,165.14,165.51


#### Splitting the dataset into train and test dataset
Generally we split data as dependent and independent but here I am splitting the dataset which includes both values.

In [61]:
# custom function for train test split
def train_test_split(data, perc):
    data = data.values
    n = int(len(data) * (1 - perc))
    return data[:n], data[n:]

In [62]:
# 80-20 split of data into train and test respectively
train, test = train_test_split(df, 0.2)

In [66]:
print(len(df))
print(len(train))
print(len(test))

250
200
50


In [68]:
df.shape, train.shape, test.shape

((250, 2), (200, 2), (50, 2))

#### Training with XGBRegressor
I am going to use XGBRegressor model to train and predict based on that. It is the implementation of the scikit-learn API for XGBoost regression.

In [69]:
# lets split the train as dependent and independent to fit into the model
X_train = train[:, :-1]
y_train = train[:, -1]

In [71]:
X_train.shape, y_train.shape

((200, 1), (200,))

In [74]:
# XGBRegressor??

In [81]:
%%time
# using 100 trees and all cpu cores
model = XGBRegressor(n_estimators=100, n_jobs=-1)
model.fit(X_train, y_train)

CPU times: user 2.77 s, sys: 0 ns, total: 2.77 s
Wall time: 365 ms


XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=6,
             min_child_weight=1, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=-1, num_parallel_tree=1, random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
             tree_method='exact', validate_parameters=1, verbosity=None)

#### Prediction

In [96]:
# lets take one of the test data
test[0]

array([224.34, 224.97])

In [97]:
# we need to pass the first array and the model should predict the second array as we have our dataset maintained in this way. lets extract that as variable val.
val = np.array(test[0, 0]).reshape(1, -1)
val

array([[224.34]])

In [98]:
# now, lets predict the next day's stock based on the val value. the model must predict 224.97
pred = model.predict(val)
pred[0]

221.64655

Its not close as the value differs almost 3 dollars.

**Now the next step would be to use Walk-forward validation to make the prediction more correct**