# Forecasting time series

The task is to build a Decision Tree and a Random Forest models to predict the sales amounts of furniture in a furniture shop, given past observations. The accuracy of the models should be measured in terms of RMSE and compared to a persistence baseline.

The source of the data is [here](https://www.kaggle.com/pruthvi1995/superstore-sales).

You need to complete the code sections and provide comments in places indicated with "???"

In [None]:
# setting logging to print only error messages from Sklearnex
import logging
logging.basicConfig()
logging.getLogger("SKLEARNEX").setLevel(logging.ERROR)

import time

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV

# import a decision tree regressor class
???

# import a random forest regressor class
???

import warnings
warnings.filterwarnings("ignore")

# Step 1. Load data

We will select only sales relating to Furniture, and will use only the columns "Order Date" and "Sales".

Note `read_excel` will guess that "Order Date" contains dates and will convert the column to the datetime type.

In [None]:
df = pd.read_excel("superstore.xlsx", usecols=["Order Date", "Sales", "Category"])
df = df.loc[df['Category'] == 'Furniture']

# once the relevant rows have been selected, delete the Category column
del df["Category"]

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.shape

In [None]:
# Check if there are any missing values
df.isnull().sum()

In [None]:
# Obtain daily amounts of sales
df = df.groupby('Order Date').sum()

In [None]:
df.plot(figsize=(16,3))

In [None]:
df.shape

The daily values appear to be quite volatile, so we will group the data into weeks and recording the total of sales for that week.

This can be achieved with the `resample` method of the dataframe. Before the method can be used, the dataframe must be set to have a datetime index. The `resample`method takes an argument indicating how the series should be grouped. For example, "D" groups the observations into days, "W" into weeks, "M" into months. 

One can also group the data into a certain number of days (weeks, months, etc). For example, "2D" groups the observations into two-day "bins".

A complete list of the "offset aliases" can be found [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases). 

In [None]:
df = df.resample('W').sum()

In [None]:
# check the length of the time series grouped by the weeks
???

In [None]:
df.plot(figsize=(16,3))

# Step 2. Train-test split

In [None]:
from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(df, test_size=0.1, random_state=7, shuffle=???)

train_set.columns = df.columns
test_set.columns = df.columns

print(f"{train_set.shape[0]} train and {test_set.shape[0]} test instances")

# Step 3. Exploratory Data Analysis

Let's plot the data.

In [None]:
train_set.plot(figsize=(16,3))

There does not appear any seasonality or trend in the series.

# Step 4. Data cleaning and transformation

Before we can start building a model, we need to ensure the data is **stationary**. We will use the Augmented Dickey-Fuller (ADF) test and the KPSS (Kwiatkowski-Phillips-Schmidt-Shin) tests to test the series for stationarity.

In [None]:
from statsmodels.tsa.stattools import adfuller, kpss

adf_pval = adfuller(train_set['Sales'], maxlag=10, regression='nc')[1]

print("ADF, p-value:", adf_pval)

In [None]:
kpss_stat, kpss_pval, lags, crit_vals = kpss(train_set['Sales'])

In [None]:
print("KPSS, p-value:", kpss_pval)

Conclusion from these tests???

In [None]:
train_diff = train_set['Sales'].diff().dropna()

adf_pval = adfuller(train_diff, maxlag=10, regression="nc")[1]
print("ADF, p-value:", adf_pval)

In [None]:
kpss_stat, kpss_pval, lags, crit_vals = kpss(train_diff)
print("KPSS, p-value:", kpss_pval)

Conclusions from these tests ???

In [None]:
test_diff = test_set['Sales'].diff().dropna()

# Step 5. Build models

## 5.1 Baseline

The persistence baseline is generating the previous day's sales as the prediction for this day.

In [None]:
# baseline_predictions
yhat = ???

mse = mean_squared_error(test_diff[1:], yhat)

baseline_rmse = np.sqrt(mse)
baseline_rmse

## 5.2. Extra transformation steps

We need to do some transformation steps required to be able to input the data into the scikit-learn's implementation of the ML algorithms.

In [None]:
def create_ar_vars(ts, lags=2):
    """Create autoregressive X variables
    """
    X, y = [], []
    for i in range(len(ts)-lags):
        X.append(ts[i:i + lags, 0])
        y.append(ts[i + lags, 0])
    return np.array(X), np.array(y)

We first create separate arrays for the predictors and the target, for both the training and test data. We'll use 3 lags to create autoregressive variables.

In [None]:
Xtrain, ytrain = create_ar_vars(train_diff.values.reshape(-1, 1), lags=3)
Xtest, ytest = create_ar_vars(test_diff.values.reshape(-1, 1), lags=3)

Both predictor arrays need to be scaled (but the target variable should not be transformed).

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
Xtrain = ???
Xtest = ???

Then we can use a grid search to find the most optimal hyperparameters settings.

## 5.3 Decision Tree regression

We'll fine-tune `min_samples_split` (the minimum number of instances required to be at a node before it gets split) and `max_depth` (the maximum depth of each tree).

In [None]:
dtree = DecisionTreeRegressor(random_state=7)
param_grid = [
    {'max_depth': [None, ???], # try different values and observe the effects on the accuracy
    'min_samples_split': [2, ???] # try different values and observe the effects on the accuracy
    }
]

tscv = TimeSeriesSplit(n_splits=5)
dtree_grid_search = GridSearchCV(estimator=dtree, cv=tscv,
                        param_grid=param_grid,
                        scoring='neg_mean_squared_error', 
                        return_train_score=True)

start = time.time()
dtree_grid_search.fit(Xtrain, ytrain)
duration = time.time() - start
print(f'Took {duration:.3f} seconds')

In [None]:
cv_results = pd.DataFrame(dtree_grid_search.cv_results_)[['params', 'mean_train_score', 
                                                    'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score')

## 5.4 Random Forest regression

We'll fine-tune `n_estimators` (the number of decision trees used in the random forest) as well as `min_samples_split` and `max_depth` (hyperparameters of specific trees).

In [None]:
rf = RandomForestRegressor(random_state=7)
param_grid = [
    {'n_estimators': [10, ???],  # try different values and observe the effects on the accuracy
     'max_depth': [None, ???],  # try different values and observe the effects on the accuracy
     'min_samples_split': [2, ???]  # try different values and observe the effects on the accuracy
    },
]

tscv = TimeSeriesSplit(n_splits=5)
rf_grid_search = GridSearchCV(estimator=rf, cv=tscv,
                        param_grid=param_grid,
                        scoring='neg_mean_squared_error', 
                        return_train_score=True)

start = time.time()
rf_grid_search.fit(Xtrain, ytrain)
duration = time.time() - start
print(f'Took {duration:.3f} seconds')

Let's print the accuracy scores for every model evaluated during the grid search.

In [None]:
cv_results = pd.DataFrame(rf_grid_search.cv_results_)[['params', 'mean_train_score', 
                                                    'mean_test_score']]
cv_results["mean_train_score"] = -cv_results["mean_train_score"]
cv_results["mean_test_score"] = -cv_results["mean_test_score"]
cv_results["diff, %"] = 100*(cv_results["mean_train_score"]-cv_results["mean_test_score"]
                                                     )/cv_results["mean_train_score"]

cv_results.sort_values('mean_test_score', inplace=True)

# set the width of the params column
cv_results.style.set_properties(subset=['params'], **{'width': '200px'})

The best models with both DT and RF methods do not seem to overfit too much, and their cross-validation RMSEs are quite above the baseline.

# Step 6. Evaluate the best DT and RF models on the test data

## Decision tree

In [None]:
best_model = ???

yhat = best_model.predict(Xtest)

dtree_mse = mean_squared_error(ytest, yhat)
dtree_rmse = np.sqrt(dtree_mse)
dtree_rmse

## Random Forest

In [None]:
best_model = ???

yhat = best_model.predict(Xtest)

rf_mse = mean_squared_error(ytest, yhat)
rf_rmse = np.sqrt(rf_mse)
rf_rmse

By how much did the decision tree model improve on the persistence baseline, percent-wise?

In [None]:
100*(baseline_rmse - dtree_rmse)/baseline_rmse

By how much did the RF model improve on the persistence baseline, percent-wise?

In [None]:
100*(baseline_rmse - rf_rmse)/baseline_rmse

# Conclusion

???