# ARIMA Forecasting for Total Monthly Sales 


Citation: https://machinelearningmastery.com/introduction-to-time-series-forecasting-with-python/

I used the above book by Jason Brownlee , Ph.D. as a reference for building the below model. Some of the code snippets used are from the book.

In [None]:
# Import libraries

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# Write the csv files to dataframes
items=pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
item_categories=pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
shops=pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
test=pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')
sales_train=pd.read_csv('..//input/competitive-data-science-predict-future-sales/sales_train.csv', parse_dates=['date'])

Understanding the Data


In [None]:
#print first 5 rows from each file
print(items.head())
print(item_categories.head())
print(shops.head())
print(test.head())
print(sales_train.head())

In [None]:
#print the total number of rows
print(items.count())
print(item_categories.count())
print(shops.count())


There are 22170 items and they fall into 84 categories. There are 60 shops in total
The shops file gives us the mapping between shop name and shop id.
We are items file is mapping between item name and item_id. It will not be of much value to us as we can refer to the item by its id. Same with the shops file which has a mapping between the shop name and shop id.
The item_categories have a mapping between the item and its category. This could be useful to us but ignoring for now for our prediciton. We need to find out the next months shop->item->Item count number


In [None]:
#Compare values hot id and item id for sin train and test data
print(test.head())
print(sales_train.head())
print(sales_train.count(axis=0))

sales_train.describe()

In [None]:

print('Test shop ids unique: ',test.shop_id.nunique())

print('Train shop ids unique: ',sales_train.shop_id.nunique())

print('Test unique item_ids: ',test.item_id.nunique())
print('Train unique item_ids: ',sales_train.item_id.nunique())


Not all the shop id/item id combinations that exist in the sales file exist in the test file. We can use only the shop id/item ids that exist in test file to forecast our results. First lets get an over all trend and understanding for all the stores and then we can model for individual stores and item ids.

In [None]:
#Check for any null values

print(sales_train.isnull().sum())

#Check for duplicates 

df_items_dup=items[items.duplicated(subset='item_name')]
df_shops_dup=shops[shops.duplicated(subset='shop_name')]
print(df_items_dup)
print(df_shops_dup)



Checking for overall storewide trend

In [None]:
sales_monthly=pd.DataFrame()

# Monthly Sales grouped by item_cnt


sales_monthly=sales_train.groupby(["date_block_num"])["item_cnt_day"].sum()
sales_monthly.index = pd.date_range('1/1/2013', periods=34, freq='M')
sales_monthly


We can see data is resampled with monthly. we will have 34 records. The date block number is 0 to 33 covering 34 months( Jan 2013 to Oct 2015)

In [None]:
sales_monthly.describe()

In [None]:
# line plot for the time series
from  matplotlib import pyplot
sales_monthly.plot()
pyplot.show()


When we plot the monthly sales wrt date, we see that there is a decreasing trend . We can also see there is seasonality

In [None]:
#Density plot for the time series
pyplot.subplot(211)
sales_monthly.hist()
pyplot.subplot(212)
sales_monthly.plot(kind='kde')
pyplot.show()

The distribution is not guassian but close. The distribution has a long tail. We need to explore data transformations

In [None]:
#ADF test 
from statsmodels.tsa.stattools import adfuller
X = sales_monthly.values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

ADF is greater than critical value 5%. so the data is not stationary . Lets make the data stationary and check the ADF value

In [None]:
#ADF test for stationarity after one differential

from statsmodels.tsa.stattools import adfuller
from pandas import Series
# create a differenced time series
def difference(dataset):
    diff = list()
    for i in range(1, len(dataset)):
        value = dataset[i] - dataset[i - 1]
        diff.append(value)
    return Series(diff)

X = sales_monthly.values
# difference data
stationary = difference(X)
stationary.index = sales_monthly.index[1:]
# check if stationary
result = adfuller(stationary)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
    print('\t%s: %.3f' % (key, value))

ADF value is less than 5% critical value. So the data is not stationary any more. Null hypothesis can be rejected. In the ARIMA model, the d value could be 1

In [None]:
#ACF and PACF plots
from pandas import read_csv
from statsmodels.graphics.tsaplots import plot_acf
from statsmodels.graphics.tsaplots import plot_pacf
from matplotlib import pyplot
pyplot.figure()
pyplot.subplot(211)
plot_acf(sales_monthly, lags=20, ax=pyplot.gca())
pyplot.subplot(212)
plot_pacf(sales_monthly, lags=20, ax=pyplot.gca())
pyplot.show()

From ACF plot, p value seems to be close to 1 and from PACF plot, q value also seem to be close to one

# Hyper Parameter Tuning

In [None]:
def evaluate_arima_model(X, arima_order):
# prepare training dataset
    X = X.astype('float32')
    train_size = int(len(X) * 0.80)
    train, test = X[0:train_size], X[train_size:]
    history = [x for x in train]
# make predictions
    predictions = list()
    for t in range(len(test)):
        model = ARIMA(history, order=arima_order)
        model_fit = model.fit(disp=0)
        yhat = model_fit.forecast()[0]
        predictions.append(yhat)
        history.append(test[t])
# calculate out of sample error
    rmse = sqrt(mean_squared_error(test, predictions))
    return rmse

In [None]:

def evaluate_models(dataset, p_values, d_values, q_values):
    dataset = dataset.astype('float32')
    best_score, best_cfg = float("inf"), None
    for p in p_values:
        for d in d_values:
            for q in q_values:
                order = (p,d,q)
                try:
                    rmse = evaluate_arima_model(dataset, order)
                    if rmse < best_score:
                        best_score, best_cfg = rmse, order
                    print('ARIMA%s RMSE=%.3f' % (order,rmse))
                except:
                    continue
    print('Best ARIMA%s RMSE=%.3f' % (best_cfg, best_score))

In [None]:
import warnings
from pandas import read_csv
from statsmodels.tsa.arima_model import ARIMA
from sklearn.metrics import mean_squared_error
from math import sqrt
# evaluate parameters
p_values = range(0,5)
d_values = range(0, 2)
q_values = range(0, 5)
warnings.filterwarnings("ignore")
evaluate_models(sales_monthly.values, p_values, d_values, q_values)




We can see the best ARIMA order is (0,1,0). So its only a AR and MA seems to be not applicable. Lets check with two orders (0,1,0) and (1,1,0)- the next best

In [None]:
model = ARIMA(sales_monthly, order=(0,1,0))
model_fit = model.fit(disp=0)
yhat = model_fit.forecast()[0]
print(yhat)

In [None]:
model = ARIMA(sales_monthly, order=(1,1,0))
model_fit = model.fit(disp=0)
yhat = model_fit.forecast()[0]
print(yhat)

Both the orders give pretty close prediction. Let us see what the Auto Arima gives

# Auto ARIMA implmentation

In [None]:
#install auto arima packages
! pip install pmdarima

In [None]:
#Run auto Arima with seasonal = True
from pmdarima import auto_arima
model = auto_arima(sales_monthly, seasonal = True, trace=True, error_action='ignore', suppress_warnings=True)
model.fit(sales_monthly)
forecast = model.predict(n_periods=1)
print(forecast)

Auto arima predicted the next month sales to be 71,056 across all stores

# Auto ARIMA implemented for each shop_id/item_id combination

In [None]:
#For the test file, add a new column to write the projected value for item_count
test['item_cnt_month'] = test.apply(lambda _: '', axis=1)

#for i in range(0,214200):
for i in range(0,5000):
    shop_id=test.iloc[i,1]
    item_id=test.iloc[i,2]
    sales_shop=sales_train[sales_train['shop_id']==shop_id]
    sales_shop_item=sales_shop[sales_shop['item_id']==item_id]
    if sales_shop_item.empty:
        test.iloc[i,3]=0
        continue
    sales_shop_item.date = [pd.datetime(x.year, x.month,1) for x in sales_shop_item.date.tolist()]
    sales_by_shop_item = sales_shop_item.resample('M', on='date').sum()
    sales_monthly_shop_item=sales_by_shop_item.groupby(["date"])["item_cnt_day"].sum()
    sales_monthly_shop_item.index=sales_monthly_shop_item.index.strftime("%Y-%m")
    date_list= pd.date_range('2013-01', periods=34, freq='M')
    for date in date_list:
        if date.strftime("%Y-%m") not in sales_monthly_shop_item.index:
            sales_monthly_shop_item[date.strftime("%Y-%m")]=0
    sales_monthly_shop_item.sort_index(inplace=True)
    
    from pmdarima import auto_arima
    model = auto_arima(sales_monthly_shop_item, trace=True, error_action='ignore', suppress_warnings=True)
    model.fit(sales_monthly_shop_item)
    forecast = model.predict(n_periods=1)
    test.iloc[i,3]=forecast



##### Running for the first 5000 records in the file as 240000 records are taking a lot of time.

In [None]:

test.to_csv('test_pred.csv')
test.head()

Limitations of the model:
1) We ignore some features like item cateogries and item price. We need to check if they impact the model

2) The model is very process intensive as we are looping across shop id and item id  to implement the model at shop id vs item id

3) Need to check if daily resampling gives better results than monthly resampling

To implement in my next notebook

1) XGBoost to see if regression can be applied and improve the model

2) Prophet and hierarchical time series implmentation since this is a 4 level hierarchy (corp->shops->item categories->items

3) Implement LSTM deep learning model



