### Statistical Tests - Regression

In this workbook, we will focus on validating the predictive power of selected features, which will then be used for classification/regression model
<br>

Based on the exploratory analysis, the following features were identified:
- month
- dayofweek
- holiday vs normal day
- product category
- shop id

On top of that, we will add **average monthly price** to test (assuming item_cnt will be inversely proportional to price of items)

Give the test set for the submission is asking predictions at monthly level, dayofweek and holiday features would not be applicable. We left with month, product category and shop id and average monthly price

In [1]:
import numpy as np
import pandas as pd
import statsmodels.api as sm
from statsmodels.formula.api import ols
from datetime import datetime as dt
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import linear_model
sns.set()
%matplotlib inline 

  import pandas.util.testing as tm


In [2]:
raw_df = pd.read_csv('data/train_df.csv', parse_dates=['date'])
# outlier removal steps as identified in exploratory analysis
id_out = [11, 14, 15, 12]
raw_df.loc[(raw_df.date == '2013-11-29') & (raw_df.item_category_id.isin(id_out)), 'sales'] = np.nan
raw_df.loc[(raw_df.date_block_num == 10) & (raw_df.item_category_id.isin(id_out))].interpolate(method='linear');

#### Convert daily dataframe info monthly

In [3]:
raw_df['month'] = raw_df['date'].dt.month

In [4]:
month_df = raw_df.groupby(['date_block_num', 'shop_id', 'item_category_id','month']).agg({'item_cnt_day':'sum','item_price':'mean'}).reset_index()

In [5]:
month_df_flat = pd.get_dummies(columns=['shop_id', 'item_category_id', 'month'], drop_first=True, data=month_df)

In [6]:
month_df_flat.head()

Unnamed: 0,date_block_num,item_cnt_day,item_price,shop_id_1,shop_id_2,shop_id_3,shop_id_4,shop_id_5,shop_id_6,shop_id_7,...,month_3,month_4,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12
0,0,53.0,1938.688889,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,28.0,242.0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,16.0,671.357143,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,28.0,855.653846,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,65.0,1573.490909,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [7]:
X = month_df_flat.iloc[:,2:]
y = month_df_flat['item_cnt_day']

In [8]:
m = sm.OLS(y, X).fit()
m.summary()

0,1,2,3
Dep. Variable:,item_cnt_day,R-squared (uncentered):,0.554
Model:,OLS,Adj. R-squared (uncentered):,0.553
Method:,Least Squares,F-statistic:,502.6
Date:,"Fri, 10 Apr 2020",Prob (F-statistic):,0.0
Time:,20:25:51,Log-Likelihood:,-377290.0
No. Observations:,62345,AIC:,754900.0
Df Residuals:,62191,BIC:,756300.0
Df Model:,154,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
item_price,-0.0017,0.001,-2.043,0.041,-0.003,-6.72e-05
shop_id_1,-50.5064,15.899,-3.177,0.001,-81.668,-19.344
shop_id_2,-111.3798,11.086,-10.047,0.000,-133.108,-89.652
shop_id_3,-113.5816,11.089,-10.243,0.000,-135.315,-91.848
shop_id_4,-99.6824,11.054,-9.018,0.000,-121.349,-78.016
shop_id_5,-99.9047,11.076,-9.020,0.000,-121.613,-78.196
shop_id_6,-61.6136,10.999,-5.602,0.000,-83.172,-40.055
shop_id_7,-82.5435,11.048,-7.471,0.000,-104.198,-60.889
shop_id_8,-103.0438,14.347,-7.182,0.000,-131.165,-74.923

0,1,2,3
Omnibus:,100399.808,Durbin-Watson:,2.074
Prob(Omnibus):,0.0,Jarque-Bera (JB):,147649332.484
Skew:,10.289,Prob(JB):,0.0
Kurtosis:,240.518,Cond. No.,764000.0


**Observations**: Item prices and majority of the categorical value under shop, item_category, and month have low p-value and are significant. These feature help explain > 50% of the item_cnt variance. Adding the seasonality nature of the data as identified [before](https://github.com/sittingman/sales_forecast/blob/master/model_ts.ipynb), we can create lag features to improve the predictive power of the regression/classification models.