## Linear Regression Modelling

The linear regression model is a comparatively simple, yet sophisticated algorithm that can be used for predicting continuous target variables. Because it is not as powerful as the other two models used in the project, the linear regression will serve as a baseline model against which the other two models' results will be compared.

The following evaluation metrics are used to assess the model's performance:

* R-squared
* Mean Absolute Percentage Error (MAPE)

The reasoning behind their selection is described in the notebook with the CatBoost model.

In [18]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, RobustScaler, FunctionTransformer, PolynomialFeatures
from sklearn.model_selection import cross_validate
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn.metrics import mean_absolute_percentage_error

from sktime.transformations.series.summarize import WindowSummarizer

In [2]:
%run functions_model.py
%run functions_vis.py

### Loading the dataset

In [3]:
d = pd.read_csv("data/train_df.csv", parse_dates=[0])
d_test = pd.read_csv("data/test_df.csv")

## Modelling 
#### Selecting and defining features 

The following features were selected to be included in the baseline model:

* Item Category
* Store
* Weather 
    * Temperature
    * Precipitation
    * Sunshine duration
* Time variables:
    * Timestep (number of days since the first recorded sale)
    * Day of the year
    * Weekday
    * Week of the year 
    * Month
    * Year
* Window variables
    * Lagged features
    * Rolling averages
    * Rolling standard deviation
* Special events 
    * New Year's Eve
    * Halloween
    * Valentine's Day
* Public Space dummy variable
* Public holidays
* Street market dummy variable

</br>

 Because the variable "school holidays" does not seem to significantly impact sales  (p-value in Visualisation notebook), it was not included in the dataset. All the other variables are likely to influence sales to a certain degree. For more information, please refer to the Visualisation notebook.

In [5]:
date = ["date"]

numfeat =["days_back","temperature_2m_mean","sunshine_duration","precipitation_hours"]

catfeat = ["store_name","item_category", 'day', 'halloween', 'hol_pub', 'month', 'nye', 'public_space', 'street_market', 'valentines_day','week_year', 'weekday', 'year']

In [13]:
d2 = d[date + catfeat + numfeat + ["total_amount"]]
d_test2 = d_test[date + catfeat + numfeat + ["total_amount"]]


In [16]:
agg_columns = d2.columns.difference(['date', 'store_name', 'item_category'] + ["total_amount"])
agg_dict = {col: "first" for col in agg_columns}
agg_dict["total_amount"] = "sum"

d2 = d2.groupby(['date', 'store_name', 'item_category']).agg(agg_dict).reset_index().sort_values(by = "date", ascending = False).reset_index(drop = True)
d2["hol_pub"] = d2["hol_pub"].apply(np.int64)

d2 = d2.set_index(["store_name","item_category","date"]).sort_index()

d_test2 = d_test2.groupby(['date', 'store_name', 'item_category']).agg(agg_dict).reset_index().sort_values(by = "date", ascending = False).reset_index(drop = True)
d_test2["hol_pub"] = d_test2["hol_pub"].apply(np.int64)

d_test2 = d_test2.set_index(["store_name","item_category","date"]).sort_index()

In [19]:
kwargs = {"lag_feature": {
    "lag":[1,2,3],
    "mean": [[1,7], [1, 15], [1,30]],
    "std": [[1,4]]
    },
    "target_cols":["total_amount"]}

transformer = WindowSummarizer(**kwargs, n_jobs= -1)

In [22]:
d2wind = transformer.fit_transform(d2)
d2wind = pd.concat([d2["total_amount"], d2wind], axis = 1).dropna()

In [30]:
d2wind.columns

Index(['total_amount', 'total_amount_lag_1', 'total_amount_lag_2',
       'total_amount_lag_3', 'total_amount_mean_1_7', 'total_amount_mean_1_15',
       'total_amount_mean_1_30', 'total_amount_std_1_4', 'day', 'days_back',
       'halloween', 'hol_pub', 'month', 'nye', 'precipitation_hours',
       'public_space', 'street_market', 'sunshine_duration',
       'temperature_2m_mean', 'valentines_day', 'week_year', 'weekday',
       'year'],
      dtype='object')

### Transforming features

In order for the linear regression model to work properly, some additional feature preprocessing is necessary. This includes the following transformations of the variables:

* Scaling numerical variables, including window variables
* Adding polynomial features (i.e. squaring and interactions of numerical features, excluding window variables)
* One-hot-encoding of categorical features
* Log transforming the target variable

In [None]:
num_tr =Pipeline(
  steps=[
    ("scaling", RobustScaler()),
    ("polyint",PolynomialFeatures(3,include_bias=False))
])

In [None]:
cat_tr =Pipeline(steps=[
  ("ohe", OneHotEncoder(drop='first',sparse_output=False))
])

In [None]:
wind_tr =Pipeline(
    steps=[("scaling", RobustScaler())]
)

In [None]:
prepro =ColumnTransformer(
  transformers=[
    ("num", num_tr, numfeat),
    ("cat",cat_tr,catfeat),
    ("wind_tr", wind_tr,['total_amount_lag_1', 'total_amount_lag_2',
       'total_amount_lag_3', 'total_amount_mean_1_7', 'total_amount_mean_1_15',
       'total_amount_mean_1_30', 'total_amount_std_1_4'])
])

### Time-series cross-validation

For more on how time-series k-fold cross validation is performed, please refer to the CatBoost model notebook.

In [25]:
train = d2wind.sample(frac=1, random_state=21)

In [32]:
x_train = train.reset_index().drop("total_amount", axis = 1).set_index("date")
y_train = train.reset_index()[["date","total_amount"]].set_index("date")

model fit and prediction 

In [None]:
lr =Pipeline(
  steps=[
    ("prepro", prepro),
    ("lr",LinearRegression())])
lr.fit(xtrain,ytrain)
ytrainpred =lr.predict(xtrain)

evaluate model with test data<br>
predict

In [None]:
ytestpred =pred_test(train,test,lr,numfeat,catfeat)

fir statistics for train and test

In [None]:
fit_overview(ytrain,ytrainpred,ytest,ytestpred)