![](https://miro.medium.com/max/2774/1*2s3HRji-pT6flEiYsIZXEw.png) 

# Motivation: Given 5 years of store-item sales data, and asked to predict 3 months of sales for 50 different items at 10 different stores.

What information do we have right now? 

The dataset is privided with 4 columns: 

1. date: The date of sale
2. store: This is the store number
3. item: This is the item number
4. sales: Sales made on that particular day

Approach: Here I will try to use the basic ARIMA model, below are the steps to start with the solution: 

1. Loading and Handling Time Series in Pandas
2. How to Check Stationarity of a Time Series?
3. How to make a Time Series Stationary?
4. Forecasting a Time Series

In [None]:
#import dependencies

import warnings 
warnings.filterwarnings('ignore')

#Basic packages
import os
import pandas as pd
import numpy as np
import datetime

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns


# Time Series
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from pandas.plotting import autocorrelation_plot
from statsmodels.tsa.stattools import adfuller, acf, pacf,arma_order_select_ic
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

color = sns.color_palette()
sns.set_style('darkgrid')

In [None]:
#Read files in the directory

for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
train_df = pd.read_csv("/kaggle/input/demand-forecasting-kernels-only/train.csv")
train_df['date'] = pd.to_datetime(train_df['date'], format="%Y-%m-%d")
train_df.head()

In [None]:
train_df.info()
# Here we can see that there are no missing values in our dataframe

In [None]:
# Expand dataframe with more useful columns
def expand_df(df):
    data = df.copy()
    
    data['day'] = data.date.dt.day
    data['month'] = data.date.dt.month
    data['year'] = data.date.dt.year
    data['dayofweek'] = data.date.dt.dayofweek
    return data

train_df = expand_df(train_df)
display(train_df)

grand_avg = train_df.sales.mean()
print(f"The grand average of sales in this dataset is {grand_avg:.4f}")

In [None]:
train_df = train_df.set_index('date')
train_df.head()

In [None]:
train_df.tail()

# EDA

pivot_table: https://www.dataquest.io/blog/pandas-pivot-table/

In [None]:
#Sales by year

agg_year_item = pd.pivot_table(train_df, index='year', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_year_store = pd.pivot_table(train_df, index='year', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_year_item / agg_year_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Year")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_year_store / agg_year_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Year")
plt.ylabel("Relative Sales")
plt.show()

**We can see that the Items and Stores have grown similarly over the time.**

In [None]:
# Now lets check the sales w.r.t. the month

agg_month_item = pd.pivot_table(train_df, index='month', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_month_store = pd.pivot_table(train_df, index='month', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_month_item / agg_month_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Month")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_month_store / agg_month_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Month")
plt.ylabel("Relative Sales")
plt.show()

**Sales by month has also shown similar growth pattern for both items and sales.**

In [None]:
#Sales by days of week

agg_weekly_item = pd.pivot_table(train_df, index='dayofweek', columns='item',
                               values='sales', aggfunc=np.mean).values
agg_weekly_store = pd.pivot_table(train_df, index='dayofweek', columns='store',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(12, 5))
plt.subplot(121)
plt.plot(agg_weekly_item / agg_weekly_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Days of week")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_weekly_store / agg_weekly_store.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Days of week")
plt.ylabel("Relative Sales")
plt.show()

In [None]:
#Also now lets check for item store relationship

agg_store_item = pd.pivot_table(train_df, index='store', columns='item',
                                values='sales', aggfunc=np.mean).values

plt.figure(figsize=(14, 5))
plt.subplot(121)
plt.plot(agg_store_item / agg_store_item.mean(0)[np.newaxis])
plt.title("Items")
plt.xlabel("Store")
plt.ylabel("Relative Sales")
plt.subplot(122)
plt.plot(agg_store_item.T / agg_store_item.T.mean(0)[np.newaxis])
plt.title("Stores")
plt.xlabel("Item")
plt.ylabel("Relative Sales")
plt.show()

Now we have checked the relationships between different variables. 

It's time for us to decompose the time series

In [None]:
sns.lineplot(x="date",y="sales",legend="full",data=train_df)

# **Lets decompose the time series:** 

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several components, each representing an underlying pattern category. When we decompose a time series into components, we usually combine the trend and cycle into a single trend-cycle component (sometimes called the trend for simplicity). Thus we think of a time series comprising three components: a trend-cycle component, a seasonal component, and a remainder component (containing anything else in the time series).

**The Seasonal component:** A seasonal pattern occurs when a time series is affected by seasonal factors such as the time of the year or the day of the week. Seasonality is always of a fixed and known frequency. A time series can contain multiple superimposed seasonal periods.

**The Trend component:** A trend exists when there is a long-term increase or decrease in the data. It does not have to be linear. Sometimes a trend is referred to as “changing direction” when it might go from an increasing trend to a decreasing trend.

**The Cyclical component:** The cyclical component represents phenomena that happen across seasonal periods. Cyclical patterns do not have a fixed period like seasonal patterns do. The cyclical component is hard to isolate and it's often ‘left alone’ by combining it with the trend component.

**The Noise component:** The noise or the random component is what remains behind when you separate out seasonality and trend from the time series. Noise is the effect of factors that you do not know, or which you cannot measure. It is the effect of the known unknowns, or the unknown unknowns.

https://towardsdatascience.com/what-is-time-series-decomposition-and-how-does-it-work-9b67e007ae90

In [None]:
# Lets decompose for data of smaller size. Here I will take data having item and store equal to 1.

train_item1 = train_df[train_df['item']==1]
train_final = train_item1[train_item1['store']==1]

from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(train_final['sales'], model='additive', freq=365)

fig = plt.figure()  
fig = result.plot()  
fig.set_size_inches(14, 12)

# How to determine additive or multiplicative model for decomposition?

We use multiplicative models when the magnitude of the seasonal pattern in the data depends on the magnitude of the data. On other hand, in the additive model, the magnitude of seasonality does not change in relation to time.

Depending on whether the composition is multiplicative or additive, we’ll need to divide or subtract the trend component from the original time series to retrieve the seasonal and noise components.

References:

https://sigmundojr.medium.com/seasonality-in-python-additive-or-multiplicative-model-d4b9cf1f48a7

https://towardsdatascience.com/what-is-time-series-decomposition-and-how-does-it-work-9b67e007ae90 

# Stationarity

Before applying any statistical model on a Time Series, the series has to be stationary or time invariant, which means that, over different time periods, **it should have constant means, constant variance and constant covariance.** It means that the data should have constant mean throughout, scattered consistently and should have same frequency throughout. So, if our data mean, variance and covariance is varied with time then our data is non-stationary and we have to make it stationary before applying any method. This is necessary because if our data has some regular pattern then there’s a high probability that over a different interval, it will have same behavior and can cause problem in accuracy of model. And also, mathematical computation for stationary data is easier as compared to that of non-stationary data.

![](https://miro.medium.com/max/1050/1*xdblkZyg6YmmReAkZHUksw.png)

# Lets check the stationarity

Here we are going to check the stationarity using 2 methods:

**1. Rolling Mean:** Plot the moving average or moving standard deviation to see if it varies with time.

**2. ADCF Test — Augmented Dickey–Fuller test:** This is used to gives us various values that can help in identifying stationarity. *The Null hypothesis says that a Time-series is non-stationary.* It comprises of a Test Statistics & some critical values for some confidence levels. If the Test statistics is less than the critical values, we can reject the null hypothesis & say that the series is stationary. The ADCF test also gives us a p-value. According to the null hypothesis, lower values of p is better.

In [None]:
# Rolling Mean Analysis

def roll_stats(timeseries, window = 12, cutoff = 0.01):
    
    rolmean = timeseries.rolling(window).mean()
    rolstd = timeseries.rolling(window).std()

    #Plot rolling statistics:
    fig = plt.figure(figsize=(16, 4))
    orig = plt.plot(timeseries, color='blue',label='Original')
    mean = plt.plot(rolmean, color='red', label='Rolling Mean')
    std = plt.plot(rolstd, color='black', label = 'Rolling Std')
    plt.legend(loc='best')
   # plt.rcParams['agg.path.chunksize'] = 50000
    plt.title('Rolling Mean & Standard Deviation')
    plt.show()
    
roll_stats(train_final['sales'])    

In [None]:
# Perform Dickey-Fuller test
def dickey_fuller_test(timeseries, window = 12, cutoff = 0.01):
    dftest = adfuller(timeseries, autolag='AIC', maxlag = 20 )
    dfoutput = pd.Series(dftest[0:4], index=['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
    for key,value in dftest[4].items():
        dfoutput['Critical Value (%s)'%key] = value
    pvalue = dftest[1]
    if pvalue < cutoff:
        print('p-value = %.4f. The series is likely stationary.' % pvalue)
    else:
        print('p-value = %.4f. The series is likely non-stationary.' % pvalue)
    
    print(dfoutput)
    
dickey_fuller_test(train_final['sales'])    

In [None]:
# Make the time series stationary

first_diff = train_final.sales - train_final.sales.shift(1)
first_diff = first_diff.dropna(inplace = False)
print(first_diff.head())
roll_stats(first_diff,window = 12, cutoff = 0.01)
dickey_fuller_test(first_diff, window = 12)

As we can see that the time series is stationary now.

**Lets start with ACF and PACF:**

**ACF** is an (complete) auto-correlation function which gives us values of auto-correlation of any series with its lagged values. We plot these values along with the confidence band and tada! We have an ACF plot. In simple terms, it describes how well the present value of the series is related with its past values. A time series can have components like trend, seasonality, cyclic and residual. ACF considers all these components while finding correlations hence it’s a ‘complete auto-correlation plot’.

**PACF** is a partial auto-correlation function. Basically instead of finding correlations of present with lags like ACF, it finds correlation of the residuals (which remains after removing the effects which are already explained by the earlier lag(s)) with the next lag value hence ‘partial’ and not ‘complete’ as we remove already found variations before we find the next correlation. So if there is any hidden information in the residual which can be modeled by the next lag, we might get a good correlation and we will keep that next lag as a feature while modeling. Remember while modeling we don’t want to keep too many features which are correlated as that can create multicollinearity issues. Hence we need to retain only the relevant features.

**Now let’s see what is an AR and MA time series process:**

**Auto regressive (AR)** process , a time series is said to be AR when present value of the time series can be obtained using previous values of the same time series i.e the present value is weighted average of its past values. Stock prices and global temperature rise can be thought of as an AR processes.
The AR process of an order p can be written as,

![](https://miro.medium.com/max/463/1*rPyGlYZdKKJ-Ozc3NkEa9g.jpeg)

Where ϵt is a white noise and y’t-₁ and y’t-₂ are the lags. Order p is the lag value after which PACF plot crosses the upper confidence interval for the first time. These p lags will act as our features while forecasting the AR time series. We cannot use the ACF plot here because it will show good correlations even for the lags which are far in the past.  If we consider those many features, we will have multicollinearity issues.This is not a problem with PACF plot as it removes components already explained by earlier lags, so we only get the lags which have the correlation with the residual i.e the component not explained by earlier lags.

**Moving average (MA)** process, a process where the present value of series is defined as a linear combination of past errors. We assume the errors to be independently distributed with the normal distribution. The MA process of order q is defined as ,

![](https://miro.medium.com/max/485/1*Kj9au6g7FEh4rczVsO5MrA.jpeg)

Here ϵt is a white noise. To get intuition of MA process lets consider order 1 MA process which will look like,

![](https://miro.medium.com/max/215/1*dfjgyPr40013a_4eFEJaLg.jpeg)

Order q of the MA process is obtained from the ACF plot, this is the lag after which ACF crosses the upper confidence interval for the first time.

Reference: https://towardsdatascience.com/significance-of-acf-and-pacf-plots-in-time-series-analysis-2fa11a5d10a8

In [None]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(train_final.sales, lags=40, ax=ax1) # 
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(train_final.sales, lags=40, ax=ax2)# , lags=40

In [None]:
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(first_diff, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(first_diff, lags=40, ax=ax2)

**ARIMA,** short for ‘Auto Regressive Integrated Moving Average’ is actually a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.

Any ‘non-seasonal’ time series that exhibits patterns and is not a random white noise can be modeled with ARIMA models.

An ARIMA model is characterized by 3 terms: p, d, q

where,

p: number of autoregressive terms (AR order)

d: number of nonseasonal differences (differencing order)

q: number of moving-average terms (MA order)

Reference: https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/

**How to find the order of the AR term (p)**

You can find out the required number of AR terms by inspecting the Partial Autocorrelation (PACF) plot. Any autocorrelation in a stationarized series can be rectified by adding enough AR terms. So, we initially take the order of AR term to be equal to as many lags that crosses the significance limit in the PACF plot.

In [None]:
arima_model = sm.tsa.ARIMA(train_final.sales, (6,1,0)).fit(disp=False)
print(arima_model.summary())

In [None]:
from scipy import stats
from scipy.stats import normaltest

resid = arima_model.resid
print(normaltest(resid))
# returns a 2-tuple of the chi-squared statistic, and the associated p-value. the p-value is very small, meaning
# the residual is not a normal distribution

fig = plt.figure(figsize=(12,8))
ax0 = fig.add_subplot(111)

sns.distplot(resid ,fit = stats.norm, ax = ax0) # need to import scipy.stats

# Get the fitted parameters used by the function
(mu, sigma) = stats.norm.fit(resid)

#Now plot the distribution using 
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Residual distribution')


# ACF and PACF
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(arima_model.resid, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(arima_model.resid, lags=40, ax=ax2)

# Prediction using ARIMA Model

In [None]:
start_index = 1726
end_index = 1826
train_df['forecast'] = arima_model.predict(start = start_index, end= end_index, dynamic= True)  
train_df[start_index:end_index][['sales', 'forecast']].plot(figsize=(12, 8))

Clearly we can see that there is some problem with this model. So let's try with some different model. 

In [None]:
# Now we will use SARIMAX

sarima_model = sm.tsa.statespace.SARIMAX(train_final.sales, trend='n', order=(6,1,0)).fit()
print(sarima_model.summary())

In [None]:
resid = sarima_model.resid
print(normaltest(resid))

fig = plt.figure(figsize=(12,8))
ax0 = fig.add_subplot(111)

sns.distplot(resid ,fit = stats.norm, ax = ax0) # need to import scipy.stats

# Get the fitted parameters used by the function
(mu, sigma) = stats.norm.fit(resid)

#Now plot the distribution using 
plt.legend(['Normal dist. ($\mu=$ {:.2f} and $\sigma=$ {:.2f} )'.format(mu, sigma)], loc='best')
plt.ylabel('Frequency')
plt.title('Residual distribution')

# ACF and PACF
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(sarima_model.resid, lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(sarima_model.resid, lags=40, ax=ax2)

# Prediction using SARIMAX

In [None]:
start_index = 1726
end_index = 1826
train_df['forecast'] = sarima_model.predict(start = start_index, end= end_index, dynamic= True)  
train_df[start_index:end_index][['sales', 'forecast']].plot(figsize=(12, 8))

We can see that this model is better than simple ARIMA model. 

**To be continued....**

This is just a rough prediction. I will be updating it in the upcoming versions by trying different values for p,d & q. Also will try different models. 

# References:

https://www.kaggle.com/ashishpatel26/keeping-it-simple-by-xyzt

https://www.kaggle.com/sumi25/understand-arima-and-tune-p-d-q

https://medium.com/analytics-vidhya/an-introduction-to-time-series-analysis-2a12d3702299 