# Problem Statement

Rossmann is a European drug distributor which operates over 3,000 drug stores across seven European countries. Since a lot of drugs come with a short shelf life, that is, they do not have a long expiry date, it becomes imperative for Rossmann to accurately forecast sales at their individual stores. Currently, the forecasting is taken care of by the store managers who are tasked with forecasting daily sales for the next six weeks. 

 

As expected, store sales are influenced by many factors, including promotional campaigns, competition, state holidays, seasonality, and locality.

 

With thousands of individual managers predicting sales based on their unique circumstances and intuitions, the accuracy of the forecasts is quite varied. To overcome this problem, the company has hired you as a data scientist to work on the forecasting problem. As part of your job role, you are tasked with building a forecasting model to forecast the daily sales for the next six weeks. To help you with the same, you have been provided with historical sales data for 1,115 Rossmann stores.

Since the company is just embarking on this project, the scope has been kept to nine key stores across Europe.

In [None]:
%matplotlib inline

import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import warnings
warnings.simplefilter('ignore')

from statsmodels.tsa.stattools import adfuller 
from statsmodels.tsa.stattools import kpss
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.vector_ar.vecm import coint_johansen #Johansen Cointegration test
from scipy.stats import boxcox
from pylab import rcParams
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.vector_ar.vecm import coint_johansen #Johansen Cointegration test
from scipy.stats import boxcox

from statsmodels.graphics.tsaplots import plot_pacf

from statsmodels.graphics.tsaplots import plot_acf

import sklearn.preprocessing

pd.set_option('display.max_rows', 100)



In [None]:
input_file_path= '../input/rossmann-store-sales'

file_path = './'

train = pd.read_csv(input_file_path+"/train.csv")
stores = pd.read_csv(input_file_path+"/store.csv")

In [None]:
train.head(12)

In [None]:
stores.head()

In [None]:
train.describe()

In [None]:
stores.describe()

# Analyzing Train data
- Converting Date feature to Datetime

In [None]:
train['Date'] =  pd.to_datetime(train['Date'], format="%Y-%m-%d")

In [None]:
train['StateHoliday'].unique()

- Some of columns contain 0 instead of '0' . Hence replacing them with the '0' in the train data set
- Replacing all other categories as 1

In [None]:
train['StateHoliday'] = train['StateHoliday'].replace({'0':0})
train['StateHoliday'] = train['StateHoliday'].replace({'a':1})
train['StateHoliday'] = train['StateHoliday'].replace({'b':1})
train['StateHoliday'] = train['StateHoliday'].replace({'c':1})

In [None]:
train.info()

# Analyzing the features
- We have more than 10 Lakh records part of trainng data set and 8 features

In [None]:
train.shape

In [None]:
train.hist(bins=30, figsize=(20,20))
plt.legend(loc='best')
plt.title('Sales vs Customers')
plt.show(block=False)

## Analysis on Stores that were Open on State Holiday and School Holidays

In [None]:
print('Total number of unqiue stores ', len(train['Store'].unique()))

print('Stores that were closed when Schools were declared holiday ',len(train[((train['Open']==0) & (train['SchoolHoliday']==1))]['Store'].unique()))
print('Stores that were closed during State holiday ',len(train[((train['Open']==0) & (train['StateHoliday']==1))]['Store'].unique()))
print('Stores that were closed on Saturday ',len(train[((train['Open']==0) & (train['DayOfWeek']==6))]['Store'].unique()))
print('Stores that were closed on Sunday ',len(train[((train['Open']==0) & (train['DayOfWeek']==7))]['Store'].unique()))

print('Stores that were closed on Weekdays and no state or school holiday ',
                  len(train[((train['Open']==0) & (~train['DayOfWeek'].isin([6,7])) 
                  & (train['SchoolHoliday']==0) 
                  & (train['StateHoliday']==0))]['Store'].unique()))

From the above we can clearly see when the Schools or State holiday were declared, most of the shops were closed.

**On Saturday 443 shops remain closed while on Sunday 1105 shops were closed.**

There were shops which were closed on a Weekdays which was not a state or a school holiday

In [None]:
train.isnull().sum()

**Since the company is just embarking on this project**, the scope has been kept to nine key stores across Europe. The stores are key for the company keeping in mind the revenue and historical prestige associated with them.  Only considering the following stores for further analysis - **1,3,8,9,13,25,29,31 and 46.**

Hence for further anaylsis we are only going to consider the above mentioned 9 stores alone 

In [None]:
stores_list = [1,3,8,9,13,25,29,31,46]


train = train[train['Store'].isin(stores_list)]

In [None]:
train.info()

# Outlier Handling on the data set

In [None]:
fig,axs = plt.subplots(2,1, figsize=(20,8))
sns.boxplot(x='Store',y='Sales', data = train, whis=[0,99],ax=axs[0])
sns.boxplot(x='Store',y='Customers' ,data = train, whis=[0,99],ax=axs[1])

We can see outlier for both Customers and Sales at **99th percentile**. **While training the model store wise outliers would be removed** before processing them 

**Since we are only considering the following stores(1,3,8,9,13,25,29,31,46) for our analysis, lets filter the stores data set**

In [None]:
stores = stores[stores['Store'].isin(stores_list)]

In [None]:
stores.isnull().sum()

##### The following columns have null values 
- CompetitionOpenSinceMonth    
- CompetitionOpenSinceYear  
- Promo2SinceWeek             
- Promo2SinceYear             
- PromoInterval   




> Analyzing CompetitionOpenSinceMonth and CompetitionOpenSinceYear

In [None]:
stores[(stores['CompetitionOpenSinceMonth'].isnull() & stores['CompetitionOpenSinceYear'].isnull())]

In [None]:
stores[(stores['Promo2SinceWeek'].isnull() & stores['Promo2SinceYear'].isnull() & stores['PromoInterval'].isnull()) ]

In [None]:
categorical_cols = ['CompetitionOpenSinceYear', 'CompetitionOpenSinceMonth']
for i in categorical_cols:
  stores[i]= stores[i].fillna(stores[i].median())

## Replacing all other columns with 0 
stores= stores.fillna(0)

In [None]:
stores.isnull().sum()

*In Both stores and train data set we have imputed the null values appropriately*

* Lets now merge the data set based on the Store id

In [None]:
train_stores_data = pd.merge(train, stores, how='inner', on='Store')

In [None]:
train_stores_data.info()

Assortment, StoreType and PromoInterval are Object DataType. Lets analyze them and have them converted into int datatype.

In [None]:
print(train_stores_data['StoreType'].unique())
print(train_stores_data['Assortment'].unique())
print(train_stores_data['PromoInterval'].unique())

In [None]:
train_stores_data['StoreType'] = train_stores_data['StoreType'].map({'a':1,'c':2,'d':3})
train_stores_data['Assortment'] = train_stores_data['Assortment'].map({'a':1,'c':2})
train_stores_data['PromoInterval'] = train_stores_data['PromoInterval'].map({0:0,'Jan,Apr,Jul,Oct':1,'Feb,May,Aug,Nov':2})

In [None]:
train_stores_data.info()

# EDA on the data set


In [None]:
eda_train_data = train_stores_data.copy()

eda_train_data['Year'] = pd.DatetimeIndex(eda_train_data['Date']).year
eda_train_data['Month'] = pd.DatetimeIndex(eda_train_data['Date']).month
eda_train_data['Day'] = pd.DatetimeIndex(eda_train_data['Date']).day
eda_train_data['Week'] = pd.DatetimeIndex(eda_train_data['Date']).week

#### Defining common functions for plotting the charts for EDA

In [None]:
def plot_bar_chart(groupby,data_mapping=None ):
  eda_grp = eda_train_data.copy()
  
  eda_grp = eda_train_data.groupby([groupby],as_index = False)
  eda_grp = eda_grp.agg({'Sales':np.mean})
  if data_mapping is not None:
    eda_grp[groupby] =eda_grp[groupby].map(data_mapping).astype(str)
  fig, ax= plt.subplots( figsize=(15,5))
  sns.barplot(x=groupby,y='Sales',data = eda_grp, ax=ax)
  plt.show()

def plot_line_chart(groupby,data_mapping=None ):
  eda_grp = eda_train_data.copy()
  
  eda_grp = eda_train_data.groupby([groupby],as_index = False)
  eda_grp = eda_grp.agg({'Sales':np.mean})
  if data_mapping is not None:
    eda_grp[groupby] =eda_grp[groupby].map(data_mapping).astype(str)
  fig, ax= plt.subplots( figsize=(15,5))
  sns.lineplot(x=groupby,y='Sales',data = eda_grp, ax=ax)
  plt.show()

def plot_factor_chart(col=None, hue=None, x='Month', y='Sales', row='Year', data=eda_train_data):
  if col!=None and hue !=None:
    sns.factorplot(data = eda_train_data, x =x, y = y, 
               col = col, # per store type in cols
               hue = hue,
               row = row
             )
  elif hue!=None and col == None:
    if row == None:
      sns.factorplot(data = eda_train_data, x =x, y = y, 
               hue = hue
             )
    else:
       sns.factorplot(data = eda_train_data, x =x, y = y, 
               hue = hue,
               row = row
             )
  else:
    if row == None:
      sns.factorplot(data = eda_train_data, x =x, y = y)
    else:
      sns.factorplot(data = eda_train_data, x =x, y = y, row=row)

def plot_scatter_plot(hue, col, values):
   fig, axs = plt.subplots(3,3,figsize=(15,10))
   for i, ax in enumerate(axs.flatten()):
     sns.scatterplot(x='Sales',y='Customers',hue=hue,data=eda_train_data[eda_train_data['Store']==values[i]],ax=ax)
     ax.set_title('Store '+ str(values[i]))
   fig.tight_layout()
   fig.subplots_adjust(top=0.90)
   plt.suptitle('Sales vs Customers with hue as ' + hue)

# Weekly Sales Analysis

In [None]:
plot_bar_chart('DayOfWeek',{1:'Monday',2:'Tuesday',3:'Wednesday',4:'Thursday',5:'Friday',6:'Saturday',7:'Sunday'})

**Observation** We can clearly see that Monday seems to have the highest Sales while sales continues to dip throught the week and sees a spike on Friday. While on Sunday, there is no sales. May be these shops are closed on Sundays or people dont prefer to shop over the weekends. Lets look into Sundays further

In [None]:
eda_train_data[eda_train_data['DayOfWeek']==7]['Open'].describe()

We can see that the shops remain close on Sudays.

# Monthly Analysis of Sales

In [None]:
plot_bar_chart('Month',{1:'January',2:'February',3:'March',4:'April',5:'May',6:'June',7:'July',8:'August',9:'September',10:'October',11:'November',12:'December'})

**Observation** We can clearly see that November and December months sales are much higher compared to all the other months. Probably, New year, Christmas and Thanks giving might be playing a huge role in the high Sales volume

## Yearly Sales

In [None]:
plot_bar_chart('Year')

**Observation** We can see only a gradual increase of Sales from 2013 till 2015

# Analysis on Store Type and Assortment 

In [None]:
plot_bar_chart('StoreType',{1:'a',2:'c',3:'d'})

plot_bar_chart('Assortment',{1:'a',2:'c'})

**Observation** 
- Store type **'C'** has higher Sales average compared **'a' and 'd'**
- Assortment **'C'** (Extended type) seems to have higher Sales average compared to **'a' (Basic type)**

In [None]:
correlation = train_stores_data.corr()
plt.figure(figsize=(20,20))

sns.heatmap(correlation, annot=True, cmap='YlGnBu_r')

**Observation** Sales seems to have high positive correlation with Customers, Promo , Open or not and negative correlation with DayOfWeek

# Analyzing Promotional offers and their impacts on the Sales

In [None]:
plot_factor_chart(col='Promo', hue='Promo2')

**Observation** We can clearly see from the above, that whenever Promotional offeres were given, there is upward trend in Sales. 

- Similarly during the month of November and December, there is sharp increase in Sales. The data contains seasonal factors which plays a major role in its sales 

- While <mark>Promo has an impact on the sales, Promo2 doesnt seem have much of an impact on the Sales increase</mark>

In [None]:
# Sales trend over days
plot_factor_chart(hue='Promo',x='DayOfWeek',y='Sales',row=None)
plot_factor_chart(hue='Promo',x='Month',y='Sales',row=None)
plot_factor_chart(hue='Promo',x='Year',y='Sales',row=None)

**Observation** 
- We can see that no promotional offers were made over the weekend
- We can see that great promotional offers were made during the months of November and December
- No major upward trend could be observed between 2013 to 2015 interms of the promotional offers made

## Analysis on Holidays on the overall Sales

In [None]:
plot_factor_chart(col='StateHoliday', hue='Open')

**Observation** We can clearly see that when State Holidays were declared the shops remain closed and there is no sales

In [None]:
plot_factor_chart(col='SchoolHoliday', hue='Open',data=eda_train_data[eda_train_data['StateHoliday'] == 0])

**Observation** If Schools are closed, not many shops remain closed. Not much could be inferred when the Schools remain closed

# Analysis on Customers vs Sales

In [None]:
plot_scatter_plot(hue='Promo', col='Sale', values=stores_list)

**Observation** We can see with the promotional offers, the Sales seems to increase and number of customers visting the shops seems to increase as well

# Analysis of Competition on Sales

In [None]:
plot_line_chart('CompetitionOpenSinceYear')
plot_line_chart('CompetitionOpenSinceMonth')
plot_line_chart('CompetitionDistance')

**Observation**
- We can see that if the Competition is open since a very long time seems to have lesser impact compared to the ones that were open for a long time. 

In [None]:
fig, ((axis1,axis2),(axis3,axis4),(axis5,axis6),(axis7,axis8)) = plt.subplots(4,2,figsize=(12,12))
sns.boxplot(y='Sales',x='Open',data=eda_train_data, ax = axis1)
sns.boxplot(y='Sales',x='DayOfWeek',data=eda_train_data, ax = axis2)
sns.boxplot(y='Sales',x='StateHoliday',data=eda_train_data, ax = axis3)
sns.boxplot(y='Sales',x='SchoolHoliday',data=eda_train_data, ax = axis4)
sns.boxplot(y='Sales',x='Promo',data=eda_train_data, ax = axis5)
sns.boxplot(y='Sales',x='Promo2',data=eda_train_data, ax = axis6)
sns.boxplot(y='Sales',x='StoreType',data=eda_train_data, ax = axis7)
sns.boxplot(y='Sales',x='Assortment',data=eda_train_data, ax = axis8)
fig.tight_layout(pad=1.0)

**Observation**
- Only when the Shops are Open we can see the Sales are made. Similarly , if the State Holiday is there, the shops remain closed. 
- DayOfWeek Seems to have significant impact on the Sales
- When State Holiday is observed, then we see that shops remain closed and no sales is made. However, School Holiday doesnt seem to have any major imapct on the Sales
- Promo has Significant impact on Sales while Promo2 doesnt have any impact. 
- Store Type and Assortment though the upper cap seems to be signficantly higher for StoreType C and Assortment C , but we dont see significant impact on the mean of the Sales.


**EDA Overall Analysis**

  - Overall we can see Monday's have higher Sales throughput
  - Promotional offers werent offered during weekends
  - Whenever StateHoliday is declared, the shops remain closed and no sales is made 
  - Promo2 doesnt seem to have any impact on the Sales increase
  - Store Type 'C' and Assortment 'C' seems to have higher sales compared to other store models and assortment of stores
  - We see upward trend in sales and customers if more Promo offers are there
  - Also we can observe that Customers and Sales have high correlation


# Time Series Analysis and Predictions Store Wise


#### Defining common methods for time series analysis

In [None]:
def plot_time_series(data,store_number):
  fig= plt.figure(figsize = (15,5), constrained_layout=True)
  data.resample('W').sum().plot()
  
  fig.tight_layout()
  fig.suptitle('Rosmaan Store Sales for Store '+ store_number)
  fig.subplots_adjust(top=0.90)
  plt.show(block=False)

In [None]:
def stationarity_test(data, store_number):
  result = adfuller(data)
  print('ADF Statistic: %f' % result[0])
  print('p-value: %f' % result[1])
  print('Critical Values:')
  for key, value in result[4].items():
    if key=='5%':
      print('\t%s: %.3f' % (key, value))

  if result[1] <= 0.05:
    print('Time Series for Store '+str(store_number)+' is Stationary')
  else:
    print('Time Series for Store '+str(store_number)+' is not Stationary')
  

def plot_seasonal_decompose(data):
  rcParams['figure.figsize'] = 12, 8
  decomposition = sm.tsa.seasonal_decompose(data, model='additive',freq=365) # additive seasonal index
  fig = decomposition.plot()
  plt.show()

def plot_auto_corr(data, title):
    plt.figure(figsize=(12,4))

    plt.subplot(121)
    plot_acf(data,lags=30, ax=plt.gca())
    plt.title('ACF for '+title)

    plt.subplot(122)
    plot_pacf(data,lags=30, ax=plt.gca())
    plt.title('PACF for'+title)

    plt.show()

def percentage_error(actual, predicted):
    res = np.empty(actual.shape)
    for j in range(actual.shape[0]):
        if actual[j] != 0:
            res[j] = (actual[j] - predicted[j]) / actual[j]
        else:
            res[j] = predicted[j] / np.mean(actual)
    return res

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs(percentage_error(np.asarray(y_true), np.asarray(y_pred)))) * 100

"""
    Johansen cointegration test of the cointegration rank of a VECM

    Parameters
    ----------
    endog : array_like (nobs_tot x neqs)
        Data to test
    det_order : int
        * -1 - no deterministic terms - model1
        * 0 - constant term - model3
        * 1 - linear trend
    k_ar_diff : int, nonnegative
        Number of lagged differences in the model.
"""

def johansen_output(data, variables):
    res = coint_johansen(data[variables],-1,1)

    output = pd.DataFrame([res.lr2,res.lr1],
                          index=['max_eig_stat',"trace_stat"])
    traces = res.lr1
    cvts = res.cvt  ## 0: 90%  1:95% 2: 99%

    N, l = np.asarray(data[variables]).shape
    r =0
    for i in range(l):
        if traces[i] > cvts[i, 1]:
            r = i + 1

    print(output.T,'\n')
    print("Critical values(90%, 95%, 99%) of max_eig_stat\n",res.cvm,'\n')
    print("Critical values(90%, 95%, 99%) of trace_stat\n",res.cvt,'\n')
    if r == 2:
      print("The Rank is 2. There is no cointegration and the series is stationary. Since they are stationary, you can build Var/VarMax'\n'")
    elif r == 1:
      print("The Rank is 1. Variable 2 can be expressed in terms of Variable 1. Cointegration exists. Since they are cointegrated, you can build Var/Varmax'\n'")
    elif r == 0:
      print("The Rank is 0.  Uable to find a non zero value for ω1 and ω2 ⇒ No cointegrating vector exists ⇒ y1t and y2t are not cointegrated ⇒ You will not be able to build a VAR/ VARMAX model as it is.'\n'")

In [None]:
def remove_outliers(data):
    print('Removing 99 percentile outlier from dataset')
    sales_upper_limit=data['Sales'].quantile(0.99)
    customers_upper_limit=data['Customers'].quantile(0.99)
    data = data[data['Sales']<= sales_upper_limit]
    data = data[data['Customers']<= customers_upper_limit]
    return data

In [None]:
def impute_missing_data(data, store_number):
    print('Using Linear imputation to impute missing data in datset')
    temp = pd.date_range(start=data.index.min(), end= data.index.max())

    print(len(temp))

    data= data.reindex(temp, fill_value=np.nan)
    
    data = data.assign(Sales_Linear_Interpolation=data.Sales.interpolate(method='linear'))
    data[['Sales_Linear_Interpolation']].plot(figsize=(12, 4))
    plt.legend(loc='best')
    plt.title('Store {}: Linear interpolation'.format(store_number))
    plt.show(block=False)


    data = data.assign(Customers_Linear_Interpolation=data.Customers.interpolate(method='linear'))
    data[['Customers_Linear_Interpolation']].plot(figsize=(12, 4))
    plt.legend(loc='best')
    plt.title('Store {}: Linear interpolation'.format(store_number))
    plt.show(block=False)


    data['Sales'] = data['Sales_Linear_Interpolation']
    data['Customers'] = data['Customers_Linear_Interpolation']
    data.drop(columns=['Sales_Linear_Interpolation','Customers_Linear_Interpolation'],inplace=True)
    
    data= data.fillna(0)
    
    data[data['DayOfWeek']==7] = data[data['DayOfWeek']==7].assign(Sales = 0 )
    
    data[data['DayOfWeek']==7] = data[data['DayOfWeek']==7].assign(Customers = 0 )
    
    print(data[data['DayOfWeek']==7].head())

    data.sort_index(inplace=True)

    data.isnull().sum()
    
    return data

#### Performing Preprocessing Steps

In [None]:
## Scaling of Sales and Customers features before applying any model on them
data = train_stores_data.copy()
'''
  We had already Seen, Customers, Promo and DayOfWeek, Open has high correlation with Sales. 
  Hence only considering the following below columns only for our further analysis
'''
data = data[['Store','Date','Sales','Customers','Promo', 'DayOfWeek', 'Open']]

## Creating Dummies for DayOfWeek
#data= pd.get_dummies(data, columns=['DayOfWeek'], drop_first=True  )

# Scaling the Sales and Customers with MinmaxScaler
mms=sklearn.preprocessing.MinMaxScaler()

data[["Sales", "Customers"]]=mms.fit_transform(data[["Sales", "Customers"]])

# Setting Date as Index 
data = data.set_index('Date')

# Sorting the index
data.sort_index(inplace=True)

In [None]:
data.head()

In [None]:
exog_univariate = ['Customers','Promo','Open', 'DayOfWeek']

exog_multivariate = ['Promo','Open', 'DayOfWeek']

## Classes for Univariate and Multivariate model building

- For Univariate we had considered, **ARIMA, ARIMAX and SARIMAX**
- For Multivariate we had considered **VAR and VARMAX**

In [None]:
class UniVariateTimeSeries:
  def __init__(self, data, store_number, endog, order, seasonal_order=None, exog=None):
    self.endog = endog
    self.exog= exog
    self.data = data
    self.order= order
    self.seasonal_order= seasonal_order
    self.store_number=store_number
    
  def build_arima(self):
    
    ## Predict for 6 weeks
    train_length = -42

    endog_data=self.data[self.endog].astype(float)

    endog_data[self.endog]=endog_data[self.endog].astype(float)

    train = endog_data[:train_length]
    test=endog_data[train_length:]

    arima_model = ARIMA(train, order=self.order)
    arima_model_fit = arima_model.fit()

  
    test['ArimaForecastedSales'] = arima_model_fit.predict(start=test.index.min(),end=test.index.max()).round(2)
    

    print(test.head(5))

    print('\n\n ------------ Model Fit -----------------------')
    # priniting summary of the model fit 
    plt.figure(figsize=(16,2)) 
    plt.plot(test['Sales'], label='Test')
    plt.plot(test['ArimaForecastedSales'], label='ARIMA')
    plt.legend(loc='best')
    plt.title('ARIMA Model')
    plt.show()

    rmse_sales = np.sqrt(mean_squared_error(test['Sales'], test['ArimaForecastedSales'])).round(2)
    mape_sales = mean_absolute_percentage_error(test['Sales'], test['ArimaForecastedSales']).round(2)

    tempResults = pd.DataFrame({'Store':self.store_number,'Time Series Model':'ARIMA', 'RMSE': [rmse_sales],'MAPE': [mape_sales], 'Variable':['Sales'] })
    return tempResults


  def build_arimax(self):
    
    ## Predict for 6 weeks
    train_length = -42

    endog_data=self.data[self.endog].astype(float)

    endog_data[self.endog]=endog_data[self.endog].astype(float)

    train_endog = endog_data[:train_length]
    test_endog=endog_data[train_length:]

    train_exog = self.data[:train_length][self.exog].astype(float)
    test_exog = self.data[train_length:][self.exog].astype(float)

    
    arima_model = ARIMA(train_endog, order=self.order, exog=train_exog)
    arima_model_fit = arima_model.fit()

  
    test_endog['ArimaxForecastedSales'] = arima_model_fit.predict(start=test_endog.index.min(),end=test_endog.index.max(), exog=test_exog).round(2)
   
    print(test_endog.head(5))

    print('\n\n ------------ Model Fit -----------------------')
    # priniting summary of the model fit 
    plt.figure(figsize=(16,2)) 
    plt.plot(test_endog['Sales'], label='Test')
    plt.plot(test_endog['ArimaxForecastedSales'], label='ARIMA')
    plt.legend(loc='best')
    plt.title('ARIMAX Model')
    plt.show()

    rmse_sales = np.sqrt(mean_squared_error(test_endog['Sales'], test_endog['ArimaxForecastedSales'])).round(2)
    mape_sales = mean_absolute_percentage_error(test_endog['Sales'], test_endog['ArimaxForecastedSales']).round(2)
    smape_sales = np.round(np.mean(np.abs(test_endog['Sales'] - test_endog['ArimaxForecastedSales'])/((abs(test_endog['Sales'])+abs(test_endog['ArimaxForecastedSales']))/2))*100,2)

    tempResults = pd.DataFrame({'Store':self.store_number,'Time Series Model':'ARIMAX', 'RMSE': [rmse_sales],'MAPE': [mape_sales], 'Variable':['Sales'] })
    return tempResults


  def build_sarima(self):
    
    ## Predict for 6 weeks
    train_length = -42

    endog_data=self.data[self.endog].astype(float)

    endog_data[self.endog]=endog_data[self.endog].astype(float)


    train = endog_data[:train_length]
    test=endog_data[train_length:]

    sarima_model = SARIMAX(train['Sales'], order=self.order,
                                           seasonal_order=self.seasonal_order,  
                                           enforce_stationarity=False,
                                            enforce_invertibility=False,
                           simple_differencing =True)
    sarima_model_fit = sarima_model.fit()

    sarima_model_fit.plot_diagnostics(figsize=(10, 10))
    plt.show()

    test['SarimaForecastedSales'] = sarima_model_fit.predict(start=test.index.min(),end=test.index.max()).round(2)
    
    print(test.head(5))

    print('\n\n ------------ Model Fit -----------------------')
    # priniting summary of the model fit 
    plt.figure(figsize=(16,2)) 
    plt.plot(test['Sales'], label='Test')
    plt.plot(test['SarimaForecastedSales'], label='ARIMA')
    plt.legend(loc='best')
    plt.title('SARIMA Model')
    plt.show()

    rmse_sales = np.sqrt(mean_squared_error(test['Sales'], test['SarimaForecastedSales'])).round(2)
    mape_sales = mean_absolute_percentage_error(test['Sales'], test['SarimaForecastedSales']).round(2)
    smape_sales = np.round(np.mean(np.abs(test['Sales'] - test['SarimaForecastedSales'])/((abs(test['Sales'])+abs(test['SarimaForecastedSales']))/2))*100,2)

    tempResults = pd.DataFrame({'Store':self.store_number,'Time Series Model':'SARIMA', 'RMSE': [rmse_sales],'MAPE': [mape_sales], 'Variable':['Sales'] })
    return tempResults


  def build_sarimax(self):
    
    ## Predict for 6 weeks
    train_length = -42

    endog_data=self.data[self.endog].astype(float)

    endog_data[self.endog]=endog_data[self.endog].astype(float)

    train_endog = endog_data[:train_length]
    test_endog=endog_data[train_length:]

    train_exog = self.data[:train_length][self.exog].astype(float)
    test_exog = self.data[train_length:][self.exog].astype(float)

    
    sarimax_model = SARIMAX(train_endog, order=self.order, seasonal_order=self.seasonal_order, exog=train_exog,  
                                           enforce_stationarity=False,
                                            enforce_invertibility=False,
                           simple_differencing =True)
    sarimax_model_fit = sarimax_model.fit()

  
    test_endog['SarimaxForecastedSales'] = sarimax_model_fit.predict(start=test_endog.index.min(),end=test_endog.index.max(), exog=test_exog).round(2)
    print(test_endog.head(5))
    
    print('\n\n ------------ Model Fit -----------------------')
    # priniting summary of the model fit 
    plt.figure(figsize=(16,2)) 
    plt.plot(test_endog['Sales'], label='Test')
    plt.plot(test_endog['SarimaxForecastedSales'], label='ARIMA')
    plt.legend(loc='best')
    plt.title('SARIMAX Model')
    plt.show()

    rmse_sales = np.sqrt(mean_squared_error(test_endog['Sales'], test_endog['SarimaxForecastedSales'])).round(2)
    mape_sales = mean_absolute_percentage_error(test_endog['Sales'], test_endog['SarimaxForecastedSales']).round(2)
    smape_sales = np.round(np.mean(np.abs(test_endog['Sales'] - test_endog['SarimaxForecastedSales'])/((abs(test_endog['Sales'])+abs(test_endog['SarimaxForecastedSales']))/2))*100,2)

    tempResults = pd.DataFrame({'Store':self.store_number,'Time Series Model':'SARIMAX', 'RMSE': [rmse_sales],'MAPE': [mape_sales], 'Variable':['Sales'] })
    return tempResults




In [None]:
class MultiVariateTimeSeries:
  def __init__(self, data, store_number, endog, order, seasonal_order=None, exog=None):
    self.endog = endog
    self.exog= exog
    self.data = data
    self.order= order
    self.seasonal_order= seasonal_order
    self.store_number=store_number

  def build_var_model(self):
    
    ## Predict for 6 weeks
    train_length = -42

    endog_data=self.data[self.endog].astype(float)

    endog_data[self.endog]=endog_data[self.endog].astype(float)


    train = endog_data[:train_length]
    test=endog_data[train_length:]


    mod = sm.tsa.VARMAX(train[self.endog], order=self.order, trend='n')
    result = mod.fit(maxiter=1000, disp=False)
    print(result.summary())

    predictions = result.predict(start=test.index.min(),end=test.index.max())

    # Plot Imppulse response
    plt.figure(figsize=(12,6)) 
    result.impulse_responses(10, orthogonalized=True).plot(figsize=(13,3))
    plt.legend(loc='best')
    plt.title('Responses to a shock')
    plt.show()
 
    results = pd.DataFrame(columns=['Store','Time Series Model', 'RMSE','MAPE', 'Variable'])
    for var in self.endog:
      plt.figure(figsize=(12,6)) 
      plt.plot(test[var], label='Test')
      plt.plot(predictions[var], label='VAR')
      plt.legend(loc='best')
      plt.title('VAR Model - '+ var)
      plt.show()

      rmse = np.sqrt(mean_squared_error(test[var], predictions[var])).round(2)
      mape = mean_absolute_percentage_error(test[var], predictions[var]).round(2)

      tempResults = pd.DataFrame({'Store':[self.store_number],'Time Series Model':['VAR'], 'RMSE': [rmse],'MAPE': [mape], 'Variable':[var] })
      results = pd.concat([results, tempResults])

    return results


  def build_varmax(self):
    
    ## Predict for 6 weeks
    train_length = -42

    endog_data=self.data[self.endog]

    endog_data[self.endog]=endog_data[self.endog].astype(float)

    train_endog = endog_data[:train_length]
    test_endog=endog_data[train_length:]

    train_exog = self.data[:train_length][self.exog].astype(float)
    test_exog = self.data[train_length:][self.exog].astype(float)

    
    mod = sm.tsa.VARMAX(train_endog[self.endog], order=self.order, trend='n',exog=train_exog)
    result = mod.fit(maxiter=1000, disp=False)
    print(result.summary())

    # Plot Imppulse response
    plt.figure(figsize=(12,6)) 
    result.impulse_responses(10, orthogonalized=True).plot(figsize=(13,3))
    plt.legend(loc='best')
    plt.title('Responses to a shock')
    plt.show()

    predictions = result.predict(start=test_endog.index.min(),end=test_endog.index.max(),exog= test_exog)

    results = pd.DataFrame(columns=['Store','Time Series Model', 'RMSE','MAPE', 'Variable'])
    for var in self.endog:
      plt.figure(figsize=(12,6)) 
      plt.plot(test_endog[var], label='Test')
      plt.plot(predictions[var], label='VAR')
      plt.legend(loc='best')
      plt.title('VAR Model - '+ var)
      plt.show()

      rmse = np.sqrt(mean_squared_error(test_endog[var], predictions[var])).round(2)
      mape = mean_absolute_percentage_error(test_endog[var], predictions[var]).round(2)

      tempResults = pd.DataFrame({'Store':self.store_number,'Time Series Model':'VARMAX', 'RMSE': [rmse],'MAPE': [mape], 'Variable':[var] })
      results = pd.concat([results, tempResults])

    return results

# Store 1 Model Building

In [None]:
stores_1 = data[data.Store == 1]

stores_1 = remove_outliers(stores_1)

# Dropping Store and Open Column
stores_1.drop(['Store'], axis=1, inplace=True)

len(stores_1)

stores_1= impute_missing_data(stores_1, 1)

len(stores_1)

In [None]:
plot_time_series(stores_1['Sales'], '1')

In [None]:
stationarity_test(stores_1['Sales'], 1)

In [None]:
plot_seasonal_decompose(stores_1['Sales'])

In [None]:
plot_auto_corr(stores_1['Sales'], ' Store '+str(1))

In [None]:
len(stores_1)

**Observation** 
- The time series plot looks more or less stationary.
- ADF fuller confirms the time series is stationary.
- Store 1 seems have downward trend
- We can see that during the months of Novermber and December, there is a spike in the Sales post which we see a decline. There is seasonality involved in the data set.
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.


## Builiding ARIMA Model for Store 1

In [None]:
# Empty Data Frame to Store results obtained

results = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results

In [None]:
stores_1.head()

In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_1,store_number=1,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

tempResults

results = pd.concat([results, tempResults])

results


## Builiding ARIMAX Model for Store 1

In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_1,store_number=1,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results = pd.concat([results, tempResults])

results

## Builiding SARIMAX Model for Store 1

In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_1,store_number=1,endog=endog,order=(7,0,0),seasonal_order=(7,0,1,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results = pd.concat([results, tempResults])

results

# MultiVariate Model(s)

### Johansen cointegration Test

In [None]:
# Checking for Cointegration between Sales and Customers
johansen_output(stores_1, ['Sales','Customers'])

## Builiding VAR Model for Store 1

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_1,store_number=1,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

#results = pd.concat([results,tempResults])

tempResults

In [None]:
results = pd.concat([results,tempResults])

results


## Builiding VARMAX Model for Store 1

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_1,store_number=1,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

tempResults

In [None]:
results = pd.concat([results,tempResults])

results


**Observation** 
We can see **SARIMAX** model has given the best results for the **Store 1**

### Ranking the result and storing the best model in final result for Store 1

In [None]:
final_result_stores = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(20))

print('\n\n Best Model for Store 1')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)

## Store **3** Model Building

In [None]:
stores_3 = data[data.Store == 3]
# Dropping Store Column
stores_3.drop(['Store'], axis=1, inplace=True)

# Removing outliers from data set
stores_3 = remove_outliers(stores_3)

# Imputing data if any missing
stores_3= impute_missing_data(stores_3, 3)

stores_3.sort_index(inplace=True)

print(len(stores_3))

In [None]:
stores_3.head()

In [None]:
plot_time_series(stores_3['Sales'], '3')

In [None]:
stationarity_test(stores_3['Sales'], 3)

In [None]:
plot_seasonal_decompose(stores_3['Sales'])

In [None]:
plot_auto_corr(stores_3['Sales'], ' Store '+str(3))

**Observation** 
- The time series plot looks more or less stationary.
- ADF fuller confirms the time series is stationary.
- Store 3 seems have downward trend and after which we see upward trend from Jan 2015
- We can see that during the months of Novermber and December, there is a spike in the Sales post which we see a decline. There is seasonality involved in the data set.
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
results_3 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_3

## Builiding ARIMA Model for Store 3

In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_3,store_number=3,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

results_3 = pd.concat([results_3, tempResults])

results_3

# Building ARIMAX for Store 3

In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_3,store_number=3,endog=endog, order=(7,0,0), exog=exog_univariate)

tempResults = univariate_arima.build_arimax()

results_3 = pd.concat([results_3, tempResults])

results_3

# Building SARIMAX Model for Store 3

In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_3,store_number=3,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_3 = pd.concat([results_3, tempResults])

results_3

# MultiVariate Model(s)

### Johansen cointegration Test

In [None]:
# Checking for Cointegration between Sales and Customers
johansen_output(stores_3, ['Sales','Customers'])

# Building VAR Model For Store 3

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_3,store_number=3,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()


In [None]:
tempResults

In [None]:
results_3 = pd.concat([results_3, tempResults])

results_3

# Building VARMAX Model for Store 3

In [None]:
exog_multivariate

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_3,store_number=3,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

tempResults


In [None]:
results_3 = pd.concat([results_3, tempResults])

results_3

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_3.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_3])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(20))

print('\n\n Best Model for Store 3')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)

# Store 8 Model Building

In [None]:
stores_8 = data[data.Store == 8]
# Dropping Store Column
stores_8.drop(['Store'], axis=1, inplace=True)

# removing outliers
stores_8 = remove_outliers(stores_8)

# impute missing data 
stores_8= impute_missing_data(stores_8, 8)

stores_8.sort_index(inplace=True)

print(len(stores_8))

In [None]:
plot_time_series(stores_8['Sales'], '8')

In [None]:
stationarity_test(stores_8['Sales'], 8)

In [None]:
plot_seasonal_decompose(stores_8['Sales'])

In [None]:
plot_auto_corr(stores_8['Sales'], ' Store '+str(8))

**Observation** 
- The time series plot looks more or less stationary.
- ADF fuller confirms the time series is stationary.
- Store 8 seems have good upward trend
- We can see that during the months of Novermber and December, there is a spike in the Sales post which we see a decline. There is seasonality involved in the data set.
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
results_8 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_8

## Builiding ARIMA Model for Store 8

In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_8,store_number=8,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

results_8 = pd.concat([results_8, tempResults])

results_8

## Builiding ARIMAX Model for Store 8

In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_8,store_number=8,endog=endog, order=(7,0,0), exog=exog_univariate)

tempResults = univariate_arima.build_arimax()

results_8 = pd.concat([results_8, tempResults])

results_8

## Building SARIMAX Model for Store 8

In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_8,store_number=8,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_8 = pd.concat([results_8, tempResults])

results_8

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_8, ['Sales','Customers'])

## Building VAR Model For Store 8

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_8,store_number=8,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()


results_8 = pd.concat([results_8, tempResults])

results_8

## Building VARMAX Model for Store 8

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_8,store_number=8,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()


results_8 = pd.concat([results_8, tempResults])

results_8

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_8.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_8])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 8')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)

# Store 9 Model Building

In [None]:
stores_9 = data[data.Store == 9]
# Dropping Store Column
stores_9.drop(['Store'], axis=1, inplace=True)

# removing outliers
stores_9 = remove_outliers(stores_9)


# impute missing data
stores_9= impute_missing_data(stores_9, 9)

stores_9.sort_index(inplace=True)

print(len(stores_9))

In [None]:
plot_time_series(stores_9['Sales'], '9')

In [None]:
plot_seasonal_decompose(stores_9['Sales'])

In [None]:
stationarity_test(stores_9['Sales'], 9)

In [None]:
plot_auto_corr(stores_9['Sales'], ' Store '+str(9))

**Observation** 
- The time series plot looks more or less stationary.
- ADF fuller confirms the time series is stationary.
- Store 9 seems have good upward trend
- We can see that during the months of August, Novermber and December , there is a spike in the Sales post which we see a decline. There is seasonality involved in the data set.
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
# Empty Data Frame to Store results obtained

results_9 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_9

# Builiding ARIMA Model for Store 9


In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_9,store_number=9,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

tempResults

results_9 = pd.concat([results_9, tempResults])

results_9

# Builiding ARIMAX Model for Store 9


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_9,store_number=9,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results_9 = pd.concat([results_9, tempResults])

results_9

# Builiding SARIMAX Model for Store 9


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_9,store_number=9,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_9 = pd.concat([results_9, tempResults])

results_9

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_9, ['Sales','Customers'])

# Builiding VAR Model for Store 9

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_9,store_number=9,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

results_9 = pd.concat([results_9, tempResults])

results_9

# Builiding VAR Model for Store 9

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_9,store_number=9,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

results_9 = pd.concat([results_9, tempResults])

results_9

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_9.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_9])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 9')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)


# Store 13 Model Building

In [None]:
stores_13 = data[data.Store == 13]

# Dropping Store Column
stores_13.drop(['Store'], axis=1, inplace=True)

print(len(stores_13))

# Interpolate 
stores_13= impute_missing_data(stores_13, 13)

print(len(stores_13))
stores_13 = remove_outliers(stores_13)

print(len(stores_13))


# Interpolate  again for missing data 
stores_13= impute_missing_data(stores_13, 13)


print(len(stores_13))



In [None]:
plot_time_series(stores_13['Sales'], '13')

In [None]:
plot_seasonal_decompose(stores_13['Sales'])

In [None]:
stationarity_test(stores_13['Sales'], 13)

**Observation** We can see that the data isnt stationary. Hence let us apply 1st order differencing on the data set  to check whether the data becomes stationary

In [None]:
plot_auto_corr(stores_13['Sales'], ' Store '+str(13))

**Observation** 
- The time series plot looks more or less stationary.
- ADF fuller confirms the time series is stationary.
- We can see that during the months of Novermber and December , there is a spike in the Sales post which we see a decline. There is seasonality involved in the data set.
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
# Empty Data Frame to Store results obtained

results_13 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_13

# Builiding ARIMA Model for Store 13


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_13,store_number=13,endog=endog,order=(7,0,0))

tempResults = univariate_arima.build_arima()

results_13 = pd.concat([results_13, tempResults])

results_13

# Builiding ARIMAX Model for Store 13


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_13,store_number=13,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results_13 = pd.concat([results_13, tempResults])

results_13

# Builiding SARIMAX Model for Store 13


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_13,store_number=13,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_13 = pd.concat([results_13, tempResults])

results_13

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_13, ['Sales','Customers'])

# Builiding VAR Model for Store 13

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_13,store_number=13,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

results_13 = pd.concat([results_13, tempResults])

results_13

# Builiding VAR Model for Store 13

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_13,store_number=13,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

results_13 = pd.concat([results_13, tempResults])

results_13

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_13.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_13])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 13')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)


# Store 25 Model Building

In [None]:
stores_25 = data[data.Store == 25]
# Dropping Store Column
stores_25.drop(['Store'], axis=1, inplace=True)

print(len(stores_25))

#remove outliers
stores_25 = remove_outliers(stores_25)

# Interpolate 
stores_25= impute_missing_data(stores_25, 25)

print(len(stores_25))


In [None]:
plot_time_series(stores_25['Sales'], '25')

In [None]:
plot_seasonal_decompose(stores_25['Sales'])

In [None]:
stationarity_test(stores_25['Sales'], 25)

In [None]:
plot_auto_corr(stores_25['Sales'], ' Store '+str(25))

**Observation** 
- Store 25 doesnt have values between 15-Jan-2014 and 13-Feb -2014. Hence we see no sales was carried out during this period of time
- ADF fuller confirms the time series is stationary.
- Between 10/2013 to 07/22014 we see low trend after which the sales seems to pickup
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
# Empty Data Frame to Store results obtained

results_25 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_25

# Builiding ARIMA Model for Store 25


In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_25,store_number=25,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

tempResults

results_25 = pd.concat([results_25, tempResults])

results_25

# Builiding ARIMAX Model for Store 25


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_25,store_number=25,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results_25 = pd.concat([results_25, tempResults])

results_25

# Builiding SARIMAX Model for Store 25


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_25,store_number=25,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_25 = pd.concat([results_25, tempResults])

results_25

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_25, ['Sales','Customers'])

# Builiding VAR Model for Store 25

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_25,store_number=25,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

results_25 = pd.concat([results_25, tempResults])

results_25

# Builiding VAR Model for Store 25

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_25,store_number=25,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

results_25 = pd.concat([results_25, tempResults])

results_25

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_25.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_25])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 25')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)


# Store 29 Model Building

In [None]:
stores_29 = data[data.Store == 29]
# Dropping Store Column
stores_29.drop(['Store'], axis=1, inplace=True)

print(len(stores_29))

#remove outliers
stores_29 = remove_outliers(stores_29)

# Interpolate 
stores_29= impute_missing_data(stores_29, 29)

print(len(stores_29))

In [None]:
plot_time_series(stores_29['Sales'], '29')

In [None]:
plot_seasonal_decompose(stores_29['Sales'])

In [None]:
stationarity_test(stores_29['Sales'], 29)

In [None]:
plot_auto_corr(stores_29['Sales'], ' Store '+str(29))

**Observation** 
- Store 29 the time series plot seems to be stationary
- ADF fuller confirms the time series is stationary.
- We can see upward trend in sales and after 7th month and November and December we can see a spike in sales which shows seasonal trend
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
# Empty Data Frame to Store results obtained

results_29 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_29

# Builiding ARIMA Model for Store 29


In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_29,store_number=29,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

tempResults

results_29 = pd.concat([results_29, tempResults])

results_29

# Builiding ARIMAX Model for Store 29


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_29,store_number=29,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results_29 = pd.concat([results_29, tempResults])

results_29

# Builiding SARIMAX Model for Store 29


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_29,store_number=29,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_29 = pd.concat([results_29, tempResults])

results_29

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_29, ['Sales','Customers'])

# Builiding VAR Model for Store 29

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_29,store_number=29,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

results_29 = pd.concat([results_29, tempResults])

results_29

# Builiding VAR Model for Store 29

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_29,store_number=29,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

results_29 = pd.concat([results_29, tempResults])

results_29

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_29.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_29])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 29')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)


# Store 31 Model Building

In [None]:
stores_31 = data[data.Store == 31]
# Dropping Store and Open Column
stores_31.drop(['Store'], axis=1, inplace=True)

print(len(stores_31))

#remove outliers
stores_31 = remove_outliers(stores_31)

# Interpolate 
stores_31= impute_missing_data(stores_31, 31)

print(len(stores_31))


In [None]:
plot_time_series(stores_31['Sales'], '31')

In [None]:
plot_seasonal_decompose(stores_31['Sales'])

In [None]:
stationarity_test(stores_31['Sales'], 31)

In [None]:
plot_auto_corr(stores_31['Sales'], ' Store '+str(31))

**Observation** 
- Store 31 the time series plot seems to be stationary
- ADF fuller confirms the time series is stationary.
- We can see upward trend in sales after 03/2014.  After 7th month and November and December we can see a spike in sales which shows seasonal trend
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
# Empty Data Frame to Store results obtained

results_31 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_31

# Builiding ARIMA Model for Store 31


In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_31,store_number=31,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

results_31 = pd.concat([results_31, tempResults])

results_31

# Builiding ARIMAX Model for Store 31


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_31,store_number=31,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results_31 = pd.concat([results_31, tempResults])

results_31

# Builiding SARIMAX Model for Store 31


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_31,store_number=31,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_31 = pd.concat([results_31, tempResults])

results_31

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_31, ['Sales','Customers'])

# Builiding VAR Model for Store 31

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_31,store_number=31,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

results_31 = pd.concat([results_31, tempResults])

results_31

# Builiding VAR Model for Store 31

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_31,store_number=31,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

results_31 = pd.concat([results_31, tempResults])

results_31

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_31.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_31])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 31')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)

# Store 46 Model Building

In [None]:
stores_46 = data[data.Store == 46]
# Dropping Store Column
stores_46.drop(['Store'], axis=1, inplace=True)

print(len(stores_46))

# Interpolate 
stores_46= impute_missing_data(stores_46, 46)

print(len(stores_46))
stores_46 = remove_outliers(stores_46)

print(len(stores_46))


# Interpolate  again for missing data 
stores_46= impute_missing_data(stores_46, 46)


print(len(stores_46))

In [None]:
plot_time_series(stores_46['Sales'], '46')

In [None]:
plot_seasonal_decompose(stores_46['Sales'])

In [None]:
stationarity_test(stores_46['Sales'], 46)

In [None]:
plot_auto_corr(stores_46['Sales'], ' Store '+str(46))

**Observation** 
- Store 46 the time series plot seems to be stationary
- ADF fuller confirms the time series is stationary.
- We can see some upward trend. But after 07/2014 there is a dip in sales .  November and December we can see a spike in sales which shows seasonal trend
- PACF plot has good correlation @ both 7/14. ACF plot shows some trend and doesnt cut off. 
- Hence **P=7 and Q=0** is considered for further modeling.

In [None]:
# Empty Data Frame to Store results obtained

results_46 = pd.DataFrame(columns= ['Time Series Model','Store','Variable','MAPE','RMSE'])

results_46

# Builiding ARIMA Model for Store 46


In [None]:
endog=['Sales']

univariate_arima = UniVariateTimeSeries(data=stores_46,store_number=46,endog=endog, order=(7,0,0))

tempResults = univariate_arima.build_arima()

tempResults

results_46 = pd.concat([results_46, tempResults])

results_46

# Builiding ARIMAX Model for Store 46


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_46,store_number=46,endog=endog,exog=exog_univariate, order=(7,0,0))

tempResults = univariate_arima.build_arimax()

results_46 = pd.concat([results_46, tempResults])

results_46

# Builiding SARIMAX Model for Store 46


In [None]:
univariate_arima = UniVariateTimeSeries(data=stores_46,store_number=46,endog=endog,order=(7,0,0),seasonal_order=(7,0,0,12),exog=exog_univariate)

tempResults = univariate_arima.build_sarimax()

results_46 = pd.concat([results_46, tempResults])

results_46

# MultiVariate Model(s)
### Johansen cointegration Test

In [None]:
johansen_output(stores_46, ['Sales','Customers'])

# Builiding VAR Model for Store 46

In [None]:
endog_multi_variate = ['Sales','Customers']

multiVariate_varmax = MultiVariateTimeSeries(data=stores_46,store_number=46,endog=endog_multi_variate,order=(7,0))

tempResults = multiVariate_varmax.build_var_model()

results_46 = pd.concat([results_46, tempResults])

results_46

# Builiding VARMAX Model for Store 46

In [None]:
multiVariate_varmax = MultiVariateTimeSeries(data=stores_46,store_number=46,endog=endog_multi_variate,order=(7,0), exog= exog_multivariate)

tempResults = multiVariate_varmax.build_varmax()

results_46 = pd.concat([results_46, tempResults])

results_46

In [None]:
final_result_stores = pd.read_csv(file_path+'/final_results.csv')

In [None]:
results_cols = ['MAPE', 'RMSE']

rank_data= results_46.copy()

rank_data["Rank"] = rank_data[results_cols].apply(tuple,axis=1).rank(method='dense',ascending=True).astype(int)

rank_data.sort_values("Rank", inplace=True)

rank_data= rank_data[rank_data.Rank ==1 ]

# Dropping Rank Column
rank_data=rank_data.drop('Rank',axis=1)
final_result = pd.DataFrame(columns=['Store','Time Series Model', 'Variable','MAPE','RMSE'])
final_result = pd.concat([final_result, rank_data])


final_result_stores = pd.concat([final_result_stores, results_46])


print('Comprehensive Result')
print('==================================')
print(final_result_stores.head(10))

print('\n\n Best Model for Store 46')
print('==================================')
print(final_result.head())

In [None]:
final_result_stores.to_csv(file_path+'/final_results.csv', index = False)

In [None]:
final_result_stores.head(100)

## Final Best Models for All 9 Stores (**1,3,8,9,13,25,29,31 and 46.**)

In [None]:
results_cols = ['MAPE', 'RMSE']

final_rank_data= final_result_stores.copy()

ranks = final_rank_data.sort_values(results_cols, ascending = True).groupby('Store').first().reset_index()

ranks

**Observation** We had seen the data did contain some seasonality. And the above models generated for all the 9 stores does prove the fact that  SARIMAX has proven to perform better compared to other models like ARIMA/VAR/VARMAX. For Store 8, 29 and 46 alone ARIMAX has given better results


**Final EDA Overall Analysis**

  - Overall we can see Monday's have higher Sales throughput
  - Promotional offers werent offered during weekends
  - Whenever StateHoliday is declared, the shops remain closed and no sales is made 
  - Promo2 doesnt seem to have any impact on the Sales increase
  - Store Type 'C' and Assortment 'C' seems to have higher sales compared to other store models and assortment of stores
  - We see upward trend in sales and customers if more Promo offers are there
  - Also we can observe that Customers and Sales have high correlation
  
**Time Series Model Prediction Analysis**

  - We saw that <mark>SARIMAX performed well for **1,3,8,9,13,25,29,31 and 46.**, and ARIMAX gave better results for **8,29 and 46**</mark>
  - We had used Automated Dicky Fuller (ADF) test to identify whether the data set was stationary or not
  - In case of outlier removal, some of the timeseries data was removed as a result. Linear Interpolation was used for missing data imputation
  - Both Sales and Customers variables were standardized with MinMaxScaler before modelling
  - Johnson Co-integration test was carried out to find the cointegration. Since the data was stationary, we had  used <mark>Sales and Customers</mark> varibles for Multivariate Model building 
  - Promo2 did'nt have much impact on the Sales based on the EDA we had carried out. With Promo we also observed that the Sales increased 
