# Forecasting time series

The task is to build a Decision Tree and a Random Forest models to predict the sales amounts of chairs in a furniture shop, given past data on the sales of chairs and other types of furniture. The accuracy of the models should be measured in terms of RMSE and compared to a persistence baseline.

Experiment with different hyperparameter settings for the Decision Tree and Random Forest algorithms to find the best models. Comment on how these models compare with forecasts of sales of chairs achieved with a VAR model from the previous exercise.

Please use "furniture_subcategories.csv", which contains the same prorocessed data as in the previous exercise.

Complete the solution by writing code and comments in places indicated with "???"

In [None]:
# setting logging to print only error messages from Sklearnex
import logging
logging.basicConfig()
logging.getLogger("SKLEARNEX").setLevel(logging.ERROR)

import warnings
warnings.filterwarnings("ignore")

import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import seaborn as sns

sns.set_theme(palette="Set2")

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor

# Load data

We'll read the pre-processed data, and then re-arrange the columns so that "Chairs" is the last column. It will later be used to create the target variable.

In [None]:
# parse_dates=["Order Date"] converts the column to datetime automatically,
# guessing the date format 
df = pd.read_csv("furniture_subcategories.csv",
                 index_col="Order Date", 
                 parse_dates=["Order Date"])
df = df[['Bookcases', 'Tables', 'Furnishings', 'Chairs']]

In [None]:
df.head()

# Train-test split

We will use random sampling.

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = ???

# make sure the training and test sets have the same column name as dfs
train_set.columns = df.columns
test_set.columns = df.columns

print(f"{train_set.shape[0]} train and {test_set.shape[0]} test instances")

# Exploratory Data Analysis

Let's plot the data.

In [None]:
train_set.plot(figsize=(16,3))

There does not appear any seasonality or trend in the series.

# Data cleaning and transformation

Before we can start buinding a model, we need to ensure all the columns are **stationary**. We will use the Augmented Dickey-Fuller (ADF) test and the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) tests to test the series for stationarity.

In [None]:
???

Comment ??? (2-3 sentences)

In [None]:
???
train_diff

In [None]:
for x in train_diff.columns:
    print(x)
    ???
    print(f"ADF, p-value: {adf_pval}")
    ???
    print(f"KPSS, p-value: {kpss_pval}")

Comment??? (2-3 sentences)

In [None]:
test_diff = test_set.diff().dropna()

# Build models

## Baseline

The persistence baseline is outputting the previous day's sales of chairs as the prediction of this day's sales amount.

In [None]:
baseline_predictions = test_diff["Chairs"].shift()[1:]
mse = mean_squared_error(test_diff["Chairs"][1:], baseline_predictions)
baseline_rmse = np.sqrt(mse)
print(f"{x}: {baseline_rmse:.3f}")

## Extra transformation steps

We need to do some transformation steps required to be able to input the data into the scikit-learn's implementation of the ML algorithms.

In [None]:
def create_ar_vars(endog, exog, lags=2):
    """Create autoregressive variables from endogenous and exogenous
    variables
    """
    X, y = [], []
    for i in range(len(endog)-lags):
        endog_row = endog[i:i + lags, 0]
        exog_row = exog[i:i + lags,:].flatten()
        X.append(np.concatenate([endog_row, exog_row]))
        y.append(endog[i + lags, 0])
    return np.array(X), np.array(y)

We first create separate arrays for the predictors and the target, for both the training and test data. Similar to the VAR model, we'll use 2 lags to create autoregressive variables.

In [None]:
Xtrain, ytrain = create_ar_vars(endog=train_diff["Chairs"].values.reshape(-1, 1),
                                exog=train_diff[["Bookcases", "Tables", "Furnishings"]].values.reshape(-1, 1), 
                                lags=2)

Xtest, ytest = create_ar_vars(endog=test_diff["Chairs"].values.reshape(-1, 1), 
                              exog=test_diff[["Bookcases", "Tables", "Furnishings"]].values.reshape(-1, 1),
                              lags=2)

Both predictor arrays need to be scaled (but the target variable should not be transformed).

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = ???
Xtrain = ???
Xtest = ???

Then we can use a grid search to find the most optimal hyperparameters settings.

## Decision Tree regression

We'll fine-tune `min_samples_split` (the minimum number of instances required to be at a node before it gets split) and `max_depth` (the maximum depth of each tree).

In [None]:
dtree = DecisionTreeRegressor(random_state=7)
param_grid = [
    {'max_depth': [???, None],
    'min_samples_split': [2, ???]}
]

tscv = TimeSeriesSplit(n_splits=5)
dtree_grid_search = GridSearchCV(estimator=dtree, cv=tscv,
                        param_grid=param_grid,
                        scoring='neg_mean_squared_error', 
                        return_train_score=True)

start = time.time()
dtree_grid_search.fit(Xtrain, ytrain)
duration = time.time() - start
print(f'Took {duration:.3f} seconds')

In [None]:
cv_results = pd.DataFrame(dtree_grid_search.cv_results_)[['params', 'mean_train_score', 
                                                    'mean_test_score']]
cv_results["mean_train_score"] = np.sqrt(-cv_results["mean_train_score"])
cv_results["mean_test_score"] = np.sqrt(-cv_results["mean_test_score"])
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')

## Random Forest regression

We'll fine-tune `n_estimators` (the number of decision trees used in the random forest) as well as `min_samples_split` and `max_depth` (hyperparameters of specific trees).

In [None]:
rf = RandomForestRegressor(random_state=7)
param_grid = [
    {'n_estimators': [10, ???], 
     'max_depth': [???, None],
     'min_samples_split': [2, ???]
    },
]

tscv = TimeSeriesSplit(n_splits=5)
rf_grid_search = GridSearchCV(estimator=rf, cv=tscv,
                        param_grid=param_grid,
                        scoring='neg_mean_squared_error', 
                        return_train_score=True)

start = time.time()
rf_grid_search.fit(Xtrain, ytrain)
duration = time.time() - start
print(f'Took {duration:.3f} seconds')

Let's print the accuracy scores for every model evaluated during the grid search.

In [None]:
cv_results = pd.DataFrame(rf_grid_search.cv_results_)[['params', 'mean_train_score', 
                                                    'mean_test_score']]
cv_results["mean_train_score"] = np.sqrt(-cv_results["mean_train_score"])
cv_results["mean_test_score"] = np.sqrt(-cv_results["mean_test_score"])
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score', inplace=True)

# set the width of the params column
cv_results.style.set_properties(subset=['params'], **{'width': '200px'})

Comment??? (one-two sentences)

# Evaluate the best DT and RF models on the test data

## Decision tree

In [None]:
best_model = dtree_grid_search.best_estimator_

yhat = best_model.predict(Xtest)

dtree_mse = mean_squared_error(ytest, yhat)
dtree_rmse = np.sqrt(dtree_mse)
dtree_rmse

By how much did the Decision Tree model improve on the persistence baseline, percent-wise?

In [None]:
???

## Random Forest

In [None]:
best_model = rf_grid_search.best_estimator_

yhat = best_model.predict(Xtest)

rf_mse = mean_squared_error(ytest, yhat)
rf_rmse = np.sqrt(rf_mse)
rf_rmse

By how much did the Random Forest model improve on the persistence baseline, percent-wise?

In [None]:
???

# Conclusion

Comment ??? (one-two sentences)

Mention how the accuracy of the ML models compares to that of the VAR model.