# Predicting Units Sold by Store and Product

## Business Understanding

## Data Understanding

I recieved my data from the M5 Forecasting - Accuracy competition on Kaggle. It contains a 5 files: calendar.csv, sales_train_validation.csv, sample_submission.csv, sell_prices.csv, and sales_train_evaluation.csv. The calendar.csv file contains the date for every day, as well as the weekday, month, and Wal-Mart week. It also includes information of the days holiday status, and the days when snap is available. The sales_train_validation.csv contains information on the amount of units sold every day for 3019 different products across 10 different Wal-Mart stores in three different states. The sales_train_evaluation data contains the same information as the validation data, along with the correct values for the 28 day forecast. The sample_submission.csv contains an example csv for the format needed to submit the 28 day forecasts. The sell_prices.csv contains the prices at which the items were sold for every Wal-Mart week.

In [1]:
# imports for notebook
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, MultiLabelBinarizer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor

In [329]:
# read in the dates and their recorded unique characteristics
dates = pd.read_csv('C:/Users/TWood/Downloads/m5-forecasting-accuracy/calendar.csv', parse_dates=[0])

In [330]:
# take a look at the dates df
dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1969 entries, 0 to 1968
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   date          1969 non-null   datetime64[ns]
 1   wm_yr_wk      1969 non-null   int64         
 2   weekday       1969 non-null   object        
 3   wday          1969 non-null   int64         
 4   month         1969 non-null   int64         
 5   year          1969 non-null   int64         
 6   d             1969 non-null   object        
 7   event_name_1  162 non-null    object        
 8   event_type_1  162 non-null    object        
 9   event_name_2  5 non-null      object        
 10  event_type_2  5 non-null      object        
 11  snap_CA       1969 non-null   int64         
 12  snap_TX       1969 non-null   int64         
 13  snap_WI       1969 non-null   int64         
dtypes: datetime64[ns](1), int64(7), object(6)
memory usage: 215.5+ KB


In [331]:
# most days have no event, replace NaN with "None"
dates.replace(np.NaN, 'None', inplace=True)

Some days have multiple events. One hot encoding will not be able to represent columns with multiple events, so the information will need to be in a format that the MultiLabelBinarizer can use. I'll make a new column that contains a list of all events on a given day. 

In [332]:
# remove spaces from all the events names
dates['event_name_1'] = dates['event_name_1'].str.replace(' ', '')
dates['event_name_2'] = dates['event_name_2'].str.replace(' ', '')

In [333]:
# create event column that contains a string of both events with a space between
dates['event'] = dates['event_name_1'] + ' ' + dates['event_name_2']

In [334]:
# split will turn the string into a list of both events
dates['event'] = dates['event'].str.split()

In [335]:
# removes the second element from the list when it is None
dates['event'] = dates['event'].apply(lambda x: [x[0]] if x[1] == 'None' else x)

In [336]:
# instantiat the MultiLabelBinarizer and fit it to the event column
mlb = MultiLabelBinarizer()
mlb.fit(dates['event'])
values = pd.DataFrame(mlb.transform(dates['event']), columns=mlb.classes_)

In [337]:
# adds the encoded columns to the dates dataframe
dates = pd.concat([dates, values], axis=1)

In [338]:
# drops the redundant event columns from dates
dates.drop(columns=['event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'event'], inplace=True)

In [339]:
# instantiate the OneHotEncoder and fit it to weekday and month columns
ohe = OneHotEncoder(sparse=False)
ohe.fit(dates[['weekday', 'month']])
values = ohe.transform(dates[['weekday', 'month']])
values = pd.DataFrame(values, columns=ohe.get_feature_names())

In [340]:
# add the one hot encoded columns to the dates dataframe
dates = pd.concat([dates, values], axis=1)

In [341]:
# drop the unnecessary columns from the dates dataframe
dates.drop(columns=['wday', 'year', 'month', 'weekday'], inplace=True)

In [342]:
# read in the data for units sold
val = pd.read_csv('C:/Users/TWood/Downloads/m5-forecasting-accuracy/sales_train_validation.csv')

In [354]:
# select data only from store CA_1
CA1 = val[val['store_id'] == 'CA_1']

In [343]:
# select only the data from department FOODS_1 in store CA_1 
CA1_F1 = val[val['store_id'] == 'CA_1'&(val['dept_id'] == 'FOODS_1')]

In [355]:
# reducing the unnecessary columns to make the melt faster
CA1.drop(columns=['item_id', 'dept_id', 'cat_id', 'store_id', 'state_id'], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [356]:
# make an iterable array of all the unique ids 
items = CA1['id'].unique()

In [357]:
# convert the dataframe into long format
CA1_ts = CA1.melt(id_vars=['id'], var_name='d', value_name='sales')

In [358]:
# read in the data on the prices for the products
prices = pd.read_csv('C:/Users/TWood/Downloads/m5-forecasting-accuracy/sell_prices.csv')

In [359]:
# make a new id column that matches the format of the id column in CA1_ts
prices['id'] = prices['item_id'] + '_' + prices['store_id'] + '_validation'

In [360]:
# the sell prices are present for all 30490 items for the final week, matches length of val dataframe
(prices['wm_yr_wk'] == 11621).sum()

30490

In [361]:
# create a dataframe with the information from all dataframes
CA1_price = CA1_ts.merge(dates, on='d').merge(prices.drop(columns=['store_id', 'item_id', 'snap_TX', 'snap_WI']), on=['id', 'wm_yr_wk'], how='left')

## Exploratory Data Analysis

In [362]:
# summary statistics of the numerical variables
CA1_price.describe()

Unnamed: 0,sales,wm_yr_wk,snap_CA,snap_TX,snap_WI,ChanukahEnd,Christmas,CincoDeMayo,ColumbusDay,Easter,...,x1_4,x1_5,x1_6,x1_7,x1_8,x1_9,x1_10,x1_11,x1_12,sell_price
count,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,...,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,5832737.0,4702895.0
mean,1.319829,11339.19,0.3293257,0.3293257,0.3293257,0.002613696,0.002613696,0.002613696,0.002613696,0.003136435,...,0.09095661,0.08102457,0.07841087,0.08102457,0.08102457,0.07841087,0.08102457,0.07841087,0.08102457,4.411276
std,4.058652,150.3742,0.4699684,0.4699684,0.4699684,0.05105747,0.05105747,0.05105747,0.05105747,0.05591599,...,0.2875474,0.2728729,0.2688171,0.2728729,0.2728729,0.2688171,0.2728729,0.2688171,0.2728729,3.395051
min,0.0,11101.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.01
25%,0.0,11217.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.22
50%,0.0,11333.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.47
75%,1.0,11448.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,5.77
max,648.0,11613.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,30.98


In [363]:
# summary statistics for non numerical values
CA1_F1_price.describe(include='object')

Unnamed: 0,id,d
count,413208,413208
unique,216,1913
top,FOODS_1_215_CA_1_validation,d_41
freq,1913,216


In [156]:
(CA1_F1_price['date'] == '2011-01-29').sum()

216

In [158]:
CA1_F1_price['lag_28'] = CA1_F1_price['sales'].shift(periods=216)

In [219]:
preds = []
trues = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item]
    test = ts['2016-03-28':]
    trues.append(test['sales'])
    preds.append(test['lag_28'])

In [220]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [221]:
np.mean(RMSEs)

1.8183114470006307

In [312]:
CA1_F1_price['lag_28'].rolling(8).mean()

date
2011-01-30      NaN
2011-01-30      NaN
2011-01-30      NaN
2011-01-30      NaN
2011-01-30      NaN
              ...  
2016-04-24    0.000
2016-04-24    0.375
2016-04-24    0.875
2016-04-24    1.875
2016-04-24    2.125
Name: lag_28, Length: 351815, dtype: float64

In [319]:
CA1_F1_price['lag_28'].tail(28)

date
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    3.0
2016-04-24    1.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    2.0
2016-04-24    0.0
2016-04-24    6.0
2016-04-24    6.0
2016-04-24    0.0
2016-04-24    4.0
2016-04-24    3.0
2016-04-24    2.0
2016-04-24    8.0
2016-04-24    3.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    0.0
2016-04-24    3.0
2016-04-24    4.0
2016-04-24    8.0
2016-04-24    2.0
Name: lag_28, dtype: float64

In [318]:
CA1_F1_price[CA1_F1_price['id'] == 'FOODS_1_219_CA_1_validation']['lag_28'].tail(28)

date
2016-03-28    3.0
2016-03-29    6.0
2016-03-30    5.0
2016-03-31    3.0
2016-04-01    0.0
2016-04-02    3.0
2016-04-03    8.0
2016-04-04    5.0
2016-04-05    2.0
2016-04-06    4.0
2016-04-07    1.0
2016-04-08    3.0
2016-04-09    1.0
2016-04-10    4.0
2016-04-11    6.0
2016-04-12    4.0
2016-04-13    2.0
2016-04-14    1.0
2016-04-15    4.0
2016-04-16    4.0
2016-04-17    3.0
2016-04-18    6.0
2016-04-19    6.0
2016-04-20    1.0
2016-04-21    0.0
2016-04-22    1.0
2016-04-23    4.0
2016-04-24    2.0
Name: lag_28, dtype: float64

In [28]:
# 61292 missing price values
CA1_F1_price['sell_price'].isna().sum()

61292

In [29]:
# Every single time the price is missing, there are no sales
((CA1_F1_price['sell_price'].isna())&(CA1_F1_price['sales'] == 0)).sum()

61292

In [163]:
CA1_F1_price.dropna(inplace=True)

In [164]:
CA1_F1_price['d'] = CA1_F1_price['d'].str.replace('d_', '').astype(int)

In [171]:
CA1_F1_price.drop(columns=['wm_yr_wk', 'weekday', 'wday', 'month', 'year', 'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2', 'event'], inplace=True)

In [176]:
CA1_F1_price.set_index('date', inplace=True)

In [209]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    rf = RandomForestRegressor().fit(X_train, y_train)
    preds.append(rf.predict(X_test))
    trues.append(y_test)
    train_scores.append(rf.score(X_train, y_train))
    test_scores.append(rf.score(X_test, y_test))
    importances.append(rf.feature_importances_)

In [215]:
np.mean(train_scores)

0.5524568116848505

In [216]:
np.mean(test_scores)

-0.22585515756620791

In [217]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [218]:
np.mean(RMSEs)

1.4917644039246414

In [227]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    xgb = XGBRegressor().fit(X_train, y_train)
    preds.append(xgb.predict(X_test))
    trues.append(y_test)
    train_scores.append(xgb.score(X_train, y_train))
    test_scores.append(xgb.score(X_test, y_test))
    importances.append(xgb.feature_importances_)

In [228]:
np.mean(train_scores)

0.8949888703594413

In [229]:
np.mean(test_scores)

-1.26556087486873

In [230]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [231]:
np.mean(RMSEs)

1.9247628696121823

In [214]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm = LGBMRegressor().fit(X_train, y_train)
    preds.append(lgbm.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm.score(X_train, y_train))
    test_scores.append(lgbm.score(X_test, y_test))
    importances.append(lgbm.feature_importances_)

In [193]:
for pred in preds:
    pred[pred < 0] = 0

In [189]:
preds

[array([0.43800372, 0.56157198, 0.84040732, 0.64755176, 0.62397097,
        0.96548823, 0.32786325, 1.16319308, 0.47925759, 1.35713543,
        0.43927371, 0.84690325, 1.03046754, 0.98449991, 0.76559798,
        0.76465471, 0.77963582, 0.35304895, 0.72303992, 1.0299443 ,
        0.82713113, 0.47765905, 0.65423307, 1.15164699, 0.33448149,
        0.45613931, 0.83469697, 0.74291507]),
 array([ 0.20622213,  0.3360104 ,  0.53901561,  0.2896914 ,  0.47699873,
         0.00872392,  0.46698587,  0.36267727,  0.12215601,  0.09284693,
         0.19834826,  0.21799892,  0.29491186,  0.26114442,  0.22544538,
        -0.07998771,  0.50354172,  0.57701413,  0.34521969,  0.44153373,
         0.05676968,  0.09683433,  0.43524591,  0.50660791,  0.30980568,
         0.31155182,  0.46280622,  0.41406952]),
 array([1.44042803, 0.34040306, 0.90979297, 0.39985395, 0.33229313,
        0.46246364, 1.12813185, 1.40480432, 0.09306718, 0.68117959,
        0.28083839, 0.69567872, 1.08456977, 0.74986547, 0.902208

In [190]:
trues

[date
 2016-03-28    2
 2016-03-29    1
 2016-03-30    1
 2016-03-31    0
 2016-04-01    4
 2016-04-02    0
 2016-04-03    0
 2016-04-04    4
 2016-04-05    1
 2016-04-06    3
 2016-04-07    0
 2016-04-08    1
 2016-04-09    0
 2016-04-10    2
 2016-04-11    2
 2016-04-12    0
 2016-04-13    1
 2016-04-14    1
 2016-04-15    0
 2016-04-16    2
 2016-04-17    0
 2016-04-18    4
 2016-04-19    1
 2016-04-20    1
 2016-04-21    0
 2016-04-22    1
 2016-04-23    1
 2016-04-24    0
 Name: sales, dtype: int64,
 date
 2016-03-28    0
 2016-03-29    1
 2016-03-30    0
 2016-03-31    0
 2016-04-01    0
 2016-04-02    0
 2016-04-03    0
 2016-04-04    0
 2016-04-05    0
 2016-04-06    1
 2016-04-07    0
 2016-04-08    0
 2016-04-09    0
 2016-04-10    0
 2016-04-11    1
 2016-04-12    0
 2016-04-13    0
 2016-04-14    1
 2016-04-15    1
 2016-04-16    3
 2016-04-17    1
 2016-04-18    0
 2016-04-19    0
 2016-04-20    1
 2016-04-21    2
 2016-04-22    0
 2016-04-23    0
 2016-04-24    0
 Name: s

In [194]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [195]:
np.mean(RMSEs)

1.635301116269365

In [196]:
importances

[array([280, 230, 220,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,  58,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0, 109, 105,  93, 103,  60,
         72, 103,  36,  53,  73,  46,  67,  81,  54, 103,  40,  45,  64,
         73, 218, 614]),
 array([292, 221, 244,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,  68,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,  86, 101,  89, 117, 117,
         73,  64,  49,  85,  63,  61,  80,  90,  70,  72,  57,  49,  66,
         46, 360, 380]),
 array([267, 249, 229,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,   0,  87,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0,  63, 105,  88,  77,  81,
         62,  60,  59,  68,  55,  41, 101,  54,  46,  67,  58,  71,  30,
         63, 230, 689]),
 array([ 163,  118,  123,    0,    0,    0,    0,

In [222]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm2 = LGBMRegressor(objective='tweedie').fit(X_train, y_train)
    preds.append(lgbm2.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm2.score(X_train, y_train))
    test_scores.append(lgbm2.score(X_test, y_test))
    importances.append(lgbm2.feature_importances_)

In [223]:
np.mean(train_scores)

0.6183382619086748

In [224]:
np.mean(test_scores)

-0.23935959329311823

In [225]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [226]:
np.mean(RMSEs)

1.481867889453839

In [240]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm3 = LGBMRegressor(objective='tweedie', max_depth=5).fit(X_train, y_train)
    preds.append(lgbm3.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm3.score(X_train, y_train))
    test_scores.append(lgbm3.score(X_test, y_test))
    importances.append(lgbm3.feature_importances_)

In [241]:
np.mean(train_scores)

0.4231629876044877

In [242]:
np.mean(test_scores)

-0.2049734069759267

In [243]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [244]:
np.mean(RMSEs)

1.4627363396686461

In [268]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm4 = LGBMRegressor(objective='tweedie', max_depth=5, reg_alpha=5, reg_lambda=5).fit(X_train, y_train)
    preds.append(lgbm4.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm4.score(X_train, y_train))
    test_scores.append(lgbm4.score(X_test, y_test))
    importances.append(lgbm4.feature_importances_)

In [269]:
np.mean(train_scores)

0.3374336477148698

In [270]:
np.mean(test_scores)

-0.12552314049765806

In [272]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [273]:
np.mean(RMSEs)

1.4216511358671393

In [302]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm5 = LGBMRegressor(objective='tweedie', max_depth=8, reg_alpha=5, reg_lambda=5).fit(X_train, y_train)
    preds.append(lgbm5.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm5.score(X_train, y_train))
    test_scores.append(lgbm5.score(X_test, y_test))
    importances.append(lgbm5.feature_importances_)

In [303]:
np.mean(train_scores)

0.3878222547786668

In [304]:
np.mean(test_scores)

-0.1262605434239212

In [305]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [306]:
np.mean(RMSEs)

1.4234435435576729

In [307]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm6 = LGBMRegressor(objective='tweedie', max_depth=8, num_leaves=63, reg_alpha=10, reg_lambda=10).fit(X_train, y_train)
    preds.append(lgbm6.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm6.score(X_train, y_train))
    test_scores.append(lgbm6.score(X_test, y_test))
    importances.append(lgbm6.feature_importances_)

In [308]:
np.mean(train_scores)

0.30332831727692455

In [309]:
np.mean(test_scores)

-0.11649224969706638

In [310]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [311]:
np.mean(RMSEs)

1.409228119188317

In [320]:
preds = []
trues = []
train_scores = []
test_scores = []
importances = []
for item in items:
    ts = CA1_F1_price[CA1_F1_price['id'] == item].drop(columns=['id'])
    train = ts[:'2016-03-28']
    test = ts['2016-03-28':]
    X_train = train.drop('sales', axis=1)
    X_test = test.drop('sales', axis=1)
    y_train = train['sales']
    y_test = test['sales']
    lgbm6 = LGBMRegressor(objective='tweedie', n_estimators=500, max_depth=8, num_leaves=63, reg_alpha=10, reg_lambda=10).fit(X_train, y_train)
    preds.append(lgbm6.predict(X_test))
    trues.append(y_test)
    train_scores.append(lgbm6.score(X_train, y_train))
    test_scores.append(lgbm6.score(X_test, y_test))
    importances.append(lgbm6.feature_importances_)

In [321]:
np.mean(train_scores)

0.3073240245852358

In [322]:
np.mean(test_scores)

-0.11645447882040985

In [323]:
RMSEs = []
for i in range(216):
    RMSEs.append(mean_squared_error(trues[i], preds[i], squared=False))

In [324]:
np.mean(RMSEs)

1.4104239391420565

## Thoughts

- need to take into account different types of items, if you're selling Christmas items after Christmas for a reduced price, does it actually reflect the demand of the product
- need to take into account marketing, end-caps, and how close to eye-level the products, how do these effect demand
- need to take into account inventory, sometimes something has high demand but lack of inventory fails to reflect the full demand, like Oatly milk.