

This is my first ever Kaggle Project. As I am only a novice in time-series forecasting and Python Programming, I have relied on a variety of sources for this project. Please let me know of any mistakes I have made and also any good resources I can refer to for me to improve 

Task: Given 5 years of store-item sales data, predict 3 months of sales for 50 different items at 10 different stores.

The arrangement of my kaggle kernel is as follows

1) Preliminary Data Analysis

2) ARIMA Model

3) SARIMA Model  

4) Final thoughts and conclusion 

In [None]:
import pandas as pd   #data analysis
import numpy as np
import matplotlib.pyplot as plt #graph visualisation
from datetime import datetime
from pandas import Series 
import seaborn as sns  #graph visualisation 
%matplotlib inline

sns.set_style("darkgrid")
sns.axes_style("darkgrid")

import warnings
warnings.filterwarnings("ignore")

# for accuracy and error calculation
from sklearn.metrics import mean_squared_error
from math import sqrt
from statsmodels.tsa.seasonal import seasonal_decompose #for conducting stationarity analysis 
# for conducting ARIMA and SARIMA models 
import statsmodels.api as sm
from statsmodels.tsa.stattools import adfuller
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
import time 


## Loading the datasets

In [None]:
train=pd.read_csv("../input/demand-forecasting-kernels-only/train.csv",parse_dates=True,index_col=['date'])
test=pd.read_csv("../input/demand-forecasting-kernels-only/test.csv",parse_dates=True,index_col=['date'])


# train['date'] = pd.to_datetime(train['date'], format="%Y-%m-%d")

# train.head()

In [None]:
train.head()

In [None]:
train.shape,test.shape

In [None]:
# train_df = train_df.set_index('date')
train['year'] = train.index.year
train['month'] = train.index.month
train['day'] = train.index.day
train['day_of_week'] = train.index.dayofweek

train.head()
# Monday=0 and Sunday= 6

In [None]:
test['year'] = test.index.year
test['month'] = test.index.month
test['day'] = test.index.day
test['day_of_week'] = test.index.dayofweek

test.head()

## PRELIMINARY DATA ANALYSIS 

In [None]:
# sns.lineplot(x=train.index, y="sales",legend = 'full' , data=train[:28])
train['sales'].plot(figsize=(10,8))

A generally increasing trend of total sales can be observed from the graph above, with evidence of a pattern for every year. 

We observe that sales usually peak during the middle of the year,around the June and July period, before decreasing in the second half of the year 

## Breakdown of sales for each store

In [None]:
store_count= len(train['store'].unique())
# 10 stores in dataset
fig,axes = plt.subplots(store_count,figsize=(12,13))
# use a for loop to iterate through all 10 stores and plot the graph of resampled 
# total weekly sales data for each store 
for i in train['store'].unique():
    g= train.loc[train['store']==i,'sales'].resample('W').sum()
    ax= g.plot(ax=axes[i-1])
    ax.set_ylabel('sales')
    ax.set_xlabel('year')
fig.tight_layout()

The above graph series further proves our point that the sales have a seasonal pattern. We can see that all stores have similar trends through the duration of the data, albeit with minor differences. With all stores having a seasonal pattern, we can seek to pool them together for our analysis  

In [None]:
sns.boxplot(x="day_of_week", y="sales", data=train)

It seems like Sunday has the highest median sales among all the days 

## Sales by Store

In [None]:
# plot graph of sales over the 5 years
graph_sales= sns.FacetGrid(train,col='store',col_order=[1,2,3,4,5,6,7,8,9,10],col_wrap=2)
graph_sales.map_dataframe(sns.barplot,"year","sales")

From the graph, we can observe that there is a general increasing trend in sales for each store from 2013 to 2017 

## Mean Sales by Store

In [None]:
# plot mean sales by store  
overall_sales_by_store= train[['sales','store']].groupby(['store']).mean().plot.bar(figsize=(10,8))

Stores 2 and 8 has the highest mean sales, which could be due to to a variety of reasons, such as being located in an area with heavy customer traffic, or due to better customer services provided

## Mean Sales by Store by month

In [None]:
# Plot average Sales by month for all stores 
train[['sales','month']].groupby(['month']).mean().plot.bar(figsize=(10,8))

From the graph, we can observe that there is general increasing mean sales trend, peaking in the month of July before decreasing for the second half of the year onwards. The top 3 months of Sales are:  July, June and August. 

It is likely that the summer period may be a period of major discounts,such as Summer Sales, or that it is the tourist peak season that could account for the highest sales figures during this period 

It is also worthy to see that there is an increase of sales for all stores in the month of November, which could be due to the store running promotions and campaigns such as Black Friday. 


## Plotting relative sales per year for both stores and items 

In [None]:
store_sales_trend_year=pd.pivot_table(train,index="year",columns='store',values='sales',aggfunc='mean').values
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.plot(store_sales_trend_year/store_sales_trend_year.mean(0)[np.newaxis])
plt.xlabel("Year")
plt.ylabel("relative sales")
plt.title(" store ")
plt.subplot(1,2,2)

item_sales_trend_year=pd.pivot_table(train,index="year",columns='item',values='sales',aggfunc='mean').values
plt.plot(item_sales_trend_year/item_sales_trend_year.mean(0)[np.newaxis])
plt.xlabel("Year")
plt.ylabel("relative sales")
plt.title("items")

Both stores and items experienced a similar growth trend over the years

## Relative sales per month for both stores and items

In [None]:
store_sales_trend_month=pd.pivot_table(train,index="month",columns='store',values='sales',aggfunc='mean').values
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.plot(store_sales_trend_month/store_sales_trend_month.mean(0)[np.newaxis])
plt.xlabel("month")
plt.ylabel("relative sales")
plt.title(" store ")
plt.subplot(1,2,2)

item_sales_trend_month=pd.pivot_table(train,index="month",columns='item',values='sales',aggfunc='mean').values
plt.plot(item_sales_trend_month/item_sales_trend_month.mean(0)[np.newaxis])
plt.xlabel("month")
plt.ylabel("relative sales")
plt.title("items")

As with relative sales per year, both store and item sales follow similar trends over the months, with sales increasing during the first half of the year (peaking around June,before decreasing  in the second half)

In [None]:
item_sales_trend_day = pd.pivot_table(train, index='day_of_week', columns='item',
                              values='sales', aggfunc='mean').values
store_sales_trend_day = pd.pivot_table(train, index='day_of_week', columns='store',
                               values='sales', aggfunc='mean').values

plt.figure(figsize=(12, 5))
plt.subplot(1,2,1)
plt.plot(store_sales_trend_day / store_sales_trend_day.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Day of Week")
plt.ylabel("Relative Sales")


plt.subplot(1,2,2)
plt.plot(item_sales_trend_day /item_sales_trend_day.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Day of Week")
plt.ylabel("Relative Sales")
plt.show()

From the graph above, we can see that items and stores seem to have a common pattern over the days of the week.

From the preliminary analysis of the train data set, we can observe that all stores show similar trends and seasonality in the years,albeit with some difference in sales levels. 

The data looks to be additive in nature due to progressive increment of sales volume . For our time-series forecasting, we will focus our analysis on 1 store-item pair.

# MODEL BUILDING

For Time-Series forecasting to be conducted, we first need to ensure that our data to be stationary. Stationary data refers to data where its mean,standard deviation and covariance do not vary with time. This is an important factor to note in time series analysis to prevent any errorneous and misleading analysis conducted. 



There are 50 items and 10 stores as part of our dataset, which gives us 500 store-item pairs. ARIMA and SARIMA forecasting methods will be used for our prediction. For simplicity purposes, we will use one store-item pair(Store 1 and Item 1 ) to construct our prediction models.

In [None]:
# First start with item 1-store 1 pair 
S1_I1=train.loc[(train['store']==1) & (train['item']==1)]
S1_I1.head()

In [None]:
S1_I1['sales'].plot()
# seasonal trend where it peaks at mid-year with general increasing sales over time 

We will now conduct a time series decomposition to break down time series for  S1-I1 pair to show trend,seasonal and residual components to determine S1-I1's stationarity

In [None]:
# use freq-365 due to long term nature of data 
plt.figsize=(50,30)
decomposition=seasonal_decompose(S1_I1['sales'],model='additive',freq=365)
fig=decomposition.plot()

From the seasonal decomposition, we can clearly observe a increasing trend and yearly seasonality that exists in the dataset. This is a indication that the data is not stationary. 

In [None]:
# A better look at the data trend 
trend=decomposition.trend
trend.plot()

In [None]:
plt.figsize=(50,30)
seasonal = decomposition.seasonal 
seasonal.plot()


Evidence of yearly seasonality with increasing trend

In [None]:
residual = decomposition.resid
residual.plot()

Another way to check for stationarity in the data is to plot moving average and moving standard deviation.

In [None]:
# pLot moving average and moving standard deviation to see if it varies in time 
rolmean = S1_I1['sales'].rolling(window=12).mean()
rolstd = S1_I1['sales'].rolling(window=12).std()

fig = plt.figure(figsize=(12, 8))
orig = plt.plot(S1_I1['sales'], color='blue',label='Original')
mean = plt.plot(rolmean, color='red', label='Rolling Mean')
std = plt.plot(rolstd, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()

As seen in the graph above, it is clear that the mean follows the sales trend and is not constant. We can also observe that the moving standard deviation is also fluctuating together with the data

 We will now conduct Dickey Fuller test to test the stationarity of the pair. 

 If the test statistic is less than the critical value, we can reject the null hypothesis (aka the series is stationary). When the test statistic is greater than the critical value, we fail to reject the null hypothesis (which means the series is not stationary).

In our above example, the test statistic > critical value, which implies that the series is not stationary. This confirms our original observation which we initially saw in the visual test.

In [None]:
from statsmodels.tsa.stattools import adfuller

def adf_test(dataset):
    dftest=adfuller(dataset,autolag='AIC')
    print("1. Test statistic:", dftest[0])
    print("2. P-value:", dftest[1])
    print("3. No of lags:", dftest[2])
    print("4. No of observations used for ADF Regression and critical values calculation:", dftest[3])
    print("5. critical values: ")
    for key,val in dftest[4].items():
        print("\t",key,": ", val)

In [None]:
adf_test(S1_I1['sales'])


The Dicky-Fuller test results shows the test statistic being higher than critical value at 1% in addtion to upwards trend and seasonality observed. The model is considered not to be stationary

### Use of differencing method to remove any trends in series 

In [None]:
# To modify data to obtain stationary pattern
# to use the differencing method to remove 
first_diff= S1_I1.sales-S1_I1.sales.shift(1)
first_diff=first_diff.dropna(inplace=False)
first_diff.head()

In [None]:
# Perform seasonal decompose on 1st degree differencing series
# use freq-365 due to long term nature of data 
plt.figsize=(50,40)
decomposition=seasonal_decompose(first_diff,model='additive',freq=365)
fig=decomposition.plot()


It can be seen that increasing trend is now removed, and that data values have roughly constant mean and standard deviation 

In [None]:
# Replot the rolling mean and standard deviation graphs 
rolmean1 = first_diff.rolling(window=12).mean()
rolstd1 = first_diff.rolling(window=12).std()

fig = plt.figure(figsize=(12, 8))
orig = plt.plot(first_diff, color='blue',label='Original')
mean = plt.plot(rolmean1, color='red', label='Rolling Mean')
std = plt.plot(rolstd1, color='black', label = 'Rolling Std')
plt.legend(loc='best')
plt.title('Rolling Mean & Standard Deviation')
plt.show()

In [None]:
# Readminister Dicky Fuller test on altered series
adf_test(first_diff)

We now observe that the p-value is now very small and the Test statistic value isless than 1% critical value. We can conclude that the data is now stationary 

### Plot ACF and PACF function









We now proceed to plot both the Auto Correlation Function (ACF) and Partial-Auto Correlation Function (PACF). The 2 graphs seek to summarise the strength of a relationship between an observation in a time series with observations at prior time steps.

In [None]:
# Initial data before first-order differencing  
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(S1_I1.sales, lags=40, ax=ax1) # 
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(S1_I1.sales, lags=40, ax=ax2)# 



From the initial series, can see that there is evident of recurring patterns,after every 7 lags (days). For PACF, there is also consistent trend among the partial autocorrelation,indicating pattern exists and seasonal trend 


In [None]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(first_diff, lags=40, ax=ax1) # 
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(first_diff, lags=40, ax=ax2)# 

There are 2 ways to determine the p,d,q combinations for our ARIMA model:

1) Plot the ACF and PACF graphs as seen above to determine the most appropriate ratio 

2) Fit the trainset into the auto_arima() function and run it to determine the p,d,q combination with the lowest AIC 

We will use option 1) for this kernel. 

We know that d=1 since the train data achieved stationarity after 1st order differencing 

For p, we can determine that p=6 as the AR term becomes significant after 6 time lags 

For q, it is usually determined by the ACF Plot, but we will set it to 0 to prevent the risk of wrong selection which can affect our prediction model 

Thus the ARIMA combination used will be **ARIMA(6,1,0)**

## ARIMA MODEL BUILDING 

ARIMA(Auto-Regressive Integrated Moving Average), is a generalization of the simpler AutoRegressive Moving Average and adds the notion of integration.

It is made up for the following elements: 

**AR: Autoregression**. A model that uses the dependent relationship between an observation and some number of lagged observations.

**I: Integrated.** The use of differencing of raw observations (e.g. subtracting an observation from an observation at the previous time step) in order to make the time series stationary.

**MA: Moving Average.** A model that uses the dependency between an observation and a residual error from a moving average model applied to lagged observations.

Each of these components are explicitly specified in the model as a parameter. A standard notation is used of ARIMA(p,d,q) where the parameters are substituted with integer values to quickly indicate the specific ARIMA model being used.

The parameters of the ARIMA model are (p,d,q) and is defined as follows:

p: The number of lag observations included in the model, also called the lag order. This is dependent on the number of autoregressive terms (AR). We can find out the required number of AR terms by inspecting the PACF plot

d: The number of times that the raw observations are differenced, also called the degree of differencing

q: Order of moving average i.e number of moving average terms (MA) that shuld go into the ARIMA model. 

 

To build the model, the train-test split approach will be adopted. Data from the S1_P1 data is split into a trainset(consisting of all data up to the last 3 months in the dataset) and a validation set(made up of the last 3 months of data from S1_P1) where the model will be tested. This method enables us to evaluate the performance of the models generated on the different dataset without testing it on the same data used for training, preventing biasedness and ensuring fair evaluation 








In [None]:
Train=S1_I1['2013-01-01':'2017-09-30']['sales']
valid=S1_I1['2017-10-01':'2017-12-31']['sales']

In [None]:
Train.shape,valid.shape

In [None]:
valid.head(),valid.tail()

In [None]:
from statsmodels.tsa.arima_model import ARIMA
model=ARIMA(Train,freq='D',order=(6,1,0))
model=model.fit()
model.summary()

In [None]:
# Make prediction on test set
start= len(Train)
end=len(Train)+len(valid)-1
pred=model.predict(start=start,end=end,typ='levels')
print(pred)

In [None]:
pred.plot(legend=True)
valid.plot(legend=True)

In [None]:
# Calculate the errors of the model 
rmse = sqrt(mean_squared_error(valid,pred))
print("ARIMA model MRSE: {}".format(rmse)) 
# get rmse of 7.2432

In [None]:
# SMAPE calculations 
def smape(A, F):
    return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))
print("ARIMA model SMAPE: {:.4}".format(smape(valid,pred))+"%")

## MAKE PREDICTIONS ON THE SUBSEQUENT 90 DAYS (testset)

In [None]:
model2=ARIMA(S1_I1['sales'],freq='D',order=(6,1,0))
model2=model2.fit()

In [None]:
index_future_dates=pd.date_range(start='2018-01-01',end='2018-03-31')
# print(index_future_dates)
pred=model2.predict(start=len(S1_I1),end=len(S1_I1)+90,typ='levels').rename('ARIMA Predictions')
# pred.index=index_future_dates
print(pred)
# print(pred)

In [None]:
# test.head()
# test.tail()
# test.index
# index_future_dates=test.index
# testset runs from 1 Jan 2018 to 31 Mar 2018


test_S1_I1=test.loc[(test['store']==1) & (test['item']==1)]
# test_S1_I1.head()
test_S1_I1['sales']=pred
# test.loc['store'==1]

In [None]:
test_S1_I1

In [None]:
arima_submission=pd.DataFrame(data=test_S1_I1,columns=['id','sales']).reset_index(drop=True)
arima_submission

In [None]:
# Option to save arima model predictions to csv
# arima_submission.to_csv('arima_submission.csv',index=False)

## SARIMA MODEL 

Since the SARIMA ratios of (P,D,Q) corresponds to that of the ARIMA values, we will use the same values obtained by our ARIMA model for our SARIMA model building, with a season value of 7 to represent a weekly series

In [None]:
sarima_model= sm.tsa.statespace.SARIMAX(Train,
                                        order= (6,1,0),
                                        seasonal_order=(6,1,0,7),
                                        enforce_stationarity=False,
                                        enforce_invertibility=False)
sarima_model=sarima_model.fit()
sarima_model.summary()
# print(model_aic.summary().tables[1])


In [None]:
start= len(Train)
end=len(Train)+len(valid)-1
predict=sarima_model.predict(start=start,end=end,typ='levels')
print(predict)

In [None]:

predict.plot(legend=True)
valid.plot(legend=True)

We can see that the predicted values generated from SARIMA more closely follows the actual sales values in the validation set

In [None]:
rmse=sqrt(mean_squared_error(valid,predict))
rmse

In [None]:
def smape(A, F):
    return 100/len(A) * np.sum(2 * np.abs(F - A) / (np.abs(A) + np.abs(F)))
print("SARIMA model SMAPE: {:.4}".format(smape(valid,predict))+"%")

RMSE and SMAPE for SARIMA is 5.839 and 24,49% respectively. This is an improvement over that of ARIMA,indicating that SARIMA is a better model when it comes to forecasting in this context

In [None]:
model2=SARIMAX(S1_I1.sales,order= (6,1,0),seasonal_order=(6,1,0,7), enforce_stationarity=False, enforce_invertibility=False).fit()

In [None]:
index_future_dates=pd.date_range(start='2018-01-01',end='2018-03-31')
# print(index_future_dates)
pred=model2.predict(start=len(S1_I1),end=len(S1_I1)+90,typ='levels').rename('SARIMA Predictions')
# pred.index=index_future_dates
print(pred)
# print(pred)

In [None]:
test_S1_I1=test.loc[(test['store']==1) & (test['item']==1)]
# test_S1_I1.head()
test_S1_I1['sales']=pred
# test.loc['store'==1]

In [None]:
test_S1_I1

In [None]:
sarima_submission=pd.DataFrame(data=test_S1_I1,columns=['id','sales']).reset_index(drop=True)
sarima_submission

In [None]:
sarima_submission.to_csv('sarima_submission.csv',index=False)

# FINAL THOUGHTS 

This kernel has highlighted the use of both ARIMA and SARIMA methods for time-series forecasting. Both methods are powerful techniques that can be applied for time-series forecasting. It is important for analysts to conduct a thorough exploratory data analysis in the data to derive useful insights before moving on to forecasting.

Besides the 2 techniques shown in this kernel, analysts can also consider utilising alternative methods, such as XGBoost and also Facebook Prophet which can achieve models that predicts more accurately and faster. However,one needs to be mindful of the business needs of the organisation/project before deciding on the how to approach the case.

Thank you for taking time to read through this kernel and I appreciate any feedback and advice as I seek to deepen my knowledge on data analytics techniques. 







References:

https://www.kaggle.com/sumi25/understand-arima-and-tune-p-d-q/comments 

https://www.kaggle.com/alexdance/store-item-combination-part-3-month-and-arima 

https://www.kaggle.com/hmoritajp718/intro-to-time-series-forecast#ARIMA 

https://www.kaggle.com/thexyzt/keeping-it-simple-by-xyzt#Conclusion

https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

https://www.youtube.com/watch?v=z-uSBE8Pxwg

https://realpython.com/train-test-split-python-data/#the-importance-of-data-splitting

https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/

https://github.com/nachi-hebbar/ARIMA-Temperature_Forecasting/blob/master/Temperature_Forecast_ARIMA.ipynb

https://www.youtube.com/watch?v=8FCDpFhd1zk&list=PLqYFiz7NM_SMC4ZgXplbreXlRY4Jf4zBP&index=6