# M5 forecasting -Accuracy

The M5 forecasting(Accuracy) challenge, is a challenge put up by the  University of Nicosia which comprises of sales data from 10 Walmart stores in 3 states with 3049 items in the stores. The objective as given by the host is as follows:
The objective of the M5 forecasting competition is to advance the theory and practice of forecasting by identifying the method(s) that provide the most accurate point forecasts for each of the 42,840 time series of the competition.
The value 42840 is formed as a result of the hierarchical nature of the problem with each step requiring a solution. 
The different hiearchies are:

1	Unit sales of all products, aggregated for all stores/states:	1

2	Unit sales of all products, aggregated for each State:	3

3	Unit sales of all products, aggregated for each store :	10

4	Unit sales of all products, aggregated for each category:	3

5	Unit sales of all products, aggregated for each department:	7

6	Unit sales of all products, aggregated for each State and category:	9

7	Unit sales of all products, aggregated for each State and department:	21

8	Unit sales of all products, aggregated for each store and category:	30

9	Unit sales of all products, aggregated for each store and department:	70

10	Unit sales of product x, aggregated for all stores/states:	3,049

11	Unit sales of product x, aggregated for each State:	9,147

12	Unit sales of product x, aggregated for each store:	30,490

Total	42,840 

For further details on the competion visit: https://www.kaggle.com/c/m5-forecasting-accuracy/data

In this kernel, we will take a look at this huge dataset,doing EDA,optimizing for lower end workstations and apply computationally inexpensive models(Prophet,SARIMAX) using Colab and Kaggle Notebooks.

This kernel is for those who are starting out with time series and using Prophet, a very straightforward library for time series and very limited resources. You may find it helpful to get your data in proper format as well and the code written in very straightforward terms! Hope you enjoy this kernel!





## Importing the data and installing the required libraries:


Note that we use an older version of numpy as the library pmdarima(used for finding order for SARIMA) is not compatible with the current version(as of 3/6/2020) with colab.

In [None]:
#importing libraries for the analysis:
import pandas as pd
from datetime import datetime
import matplotlib.pyplot as plt
import seaborn as sns
from statsmodels.tsa.seasonal import seasonal_decompose
from matplotlib import dates
# Load specific forecasting tools
from statsmodels.tsa.statespace.sarimax import SARIMAX
!pip install --upgrade numpy==1.18.3 
import numpy as np
!pip install pmdarima
from pmdarima import auto_arima                              # for determining ARIMA orders

In [None]:
# Converting the csv files to pandas dataframe:
calender=pd.read_csv('/kaggle/input/m5-forecasting-accuracy/calendar.csv',parse_dates=True)
sales=pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sales_train_evaluation.csv',parse_dates=True)
prices=pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sell_prices.csv')
calender['date']=pd.to_datetime(calender['date'])
sample_submission=pd.read_csv('/kaggle/input/m5-forecasting-accuracy/sample_submission.csv')

# A brief look at the various dataframes given:
Now that we have assigned the csv files to dataframes, let us take a look at the data given to us.

### Calender:
The calender dataframe gives us details on the dates, events,holidays SNAP(explained in the EDA section) etc for the duration for around the 5 years the sales data is provided.

In [None]:
# get the dimensions of the first dataframe 'calender' and show the first five rows:
print(calender.shape)
calender.head()

In [None]:
# Information on the different columns of the dataframe:
calender.info()

### Sales:
The sales data gives us the number of items sold everyday for each item in a store for all the 3049 items in 10 stores.


In [None]:
# get the dimensions of the first dataframe 'sales' and show the first five rows:
print(sales.shape)
sales.head(5)

In [None]:
# Information on the different columns of the dataframe:
sales.info()

### Prices:
The prices dataframe gives us the prices of the all the items in the stores on a weekly basis.

In [None]:
# get the dimensions of the first dataframe 'prices' and show the first five rows:
print(prices.shape)
prices.head()

In [None]:
# Information on the different columns of the dataframe:
prices.info()

# Optimizing the dataframes:
We first want to start by making a final dataset which contains all the necessary values in one table so we can do our analysis with ease.

A major roadblock while doing this analysis is the size of the final dataset I intended to create. With more than 40 million rows and quite a few columns, I realized that my code platform(colab) could not support the ram usage(colab provides 12 gb ram,a little less than my personal computer) of such a large unoptimized dataset. Hence, this section tries to reduce the size of the dataset so as to be able to run the notebook.

First, we start of by changing all the most of the columns with string type data to categorical dtype as almost all the columns are categories in this dataset.

In [None]:
#Converting dtypes to 'categorical' and filling nans with -1( we use -1 to further reduce the dataset size as compared to nans) for all the 3 dataframes:
calender=calender.fillna(-1)
calender[['event_type_1','event_type_2','event_name_1','event_name_2']]=calender[['event_type_1','event_type_2','event_name_1','event_name_2']].astype(('category'))

sales[['id','item_id','cat_id','store_id','state_id']]=sales[['id','item_id','cat_id','store_id','state_id']].astype('category')

# For prices column we combine the item_id and store_id to form the id of the data which can later be joined with sales dataframe:
prices['id']=prices['item_id']+'_'+prices['store_id']+'_evaluation'

prices[['id','store_id','item_id']]=prices[['id','store_id','item_id']].astype('category')
# We also drop store_id and item_id as they no longer play any role in the dataset and all the information is stored in 'id'.
prices.drop(['store_id','item_id'],axis=1,inplace=True)
# We also drop dept_id from sales as we will note be using the column:
sales.drop('dept_id',axis=1,inplace=True)


Next, to further reduce the size of the dataframes we downscale the integer and float types of the various columns in the dataframe. For eg, if the data has a range less that int(8) but the data type is attributed to int(32), the code below reduces to the datatype to int(8) saving a lot  of storage.

In [None]:
# This very convinient piece of code is commonly found on kaggle competitions which performs the above tasks for all the rows:
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df
    

In [None]:
# Applying the above function to the prices dataframe and calander dataframe:
# We apply the function to sales dataframe after applying melt to sales:
prices=reduce_mem_usage(prices)
calender=reduce_mem_usage(calender)

Next, we make the dataframe sales into a more usable format where each day is a column rather than a row. This essentially makes the table vertical.

In [None]:
# We use pd.melt to do the task above, which essentially bring the table to the format given below:
sales=pd.melt(sales,id_vars=['id','item_id','cat_id','store_id','state_id'])
sales.head()

We will now use our memory reduction function on this dataframe as pandas/numpy processes data much slower on multiple columns as compared to multiple rows.

In [None]:
sales=reduce_mem_usage(sales)

We can see that there is a huge reduction in all the dataframes sizes which will prevent colab from crashing due to excessive RAM usage.
With this, we come to an end to data optimization.

## Forming the final dataframe:
Now that we have optimized our individual dataframes we will combine them three to make a convinient dataframe with all the information needed for our analysis in one place.

In [None]:
# Here we merge all three dataframes:
df=pd.merge(pd.merge(calender[['date','d','wm_yr_wk',
       'event_name_1', 'event_type_1', 'event_name_2', 'event_type_2',
       'snap_CA', 'snap_TX', 'snap_WI']],sales,left_on='d',right_on='variable',how='inner'),prices,left_on=['id','wm_yr_wk'],right_on=['id','wm_yr_wk'],how='inner')

# We get rid of the columns on which the dataframe was joined on as we already have the date column instead:
df.drop(['d','variable','wm_yr_wk'],axis=1,inplace=True)
# Rearranging the columns to our convinience:
cols=['date','id', 'item_id', 'cat_id', 'store_id', 'state_id','sell_price','event_name_1', 'event_type_1', 'event_name_2','event_type_2', 'snap_CA', 'snap_TX', 'snap_WI','value']
df=df[cols]


Given below is a brief look at our final dataframe:

In [None]:
df.tail()

In [None]:
# Given below is information on the various columns of the dataframe:
df.info()

With this we have come to an end of this section. Now we can proceed with our Exploratory Data Analysis for our final dataset.

# Overview of the project:
Before we start with the EDA, we need to figure out what we need to look for. Especially considering the size of the dataframe and the reduced computational power I personally possess, the main aim of the EDA is going to select a suitable model for our final analysis. Especially considering the complex heriarchy and the many variables we have to choose how we perform our EDA very conservatively. Performing EDA will only work in subsets of the main dataset(trying otherwise inevitibly leads to Colab crashing). So we will decide now what we need to do in our EDA section to get to our goal of choosing an appropriate model.

Before we even start our EDA we can start by eliminating models that will not be suitable for our use. 

Some basic concepts:
Top down model: In this kind of models, we look at all the 3049 items, and forecast them for all the stores together rather than forecast for individual stores. Since sales in the stores are correlated, after we forecast for the items, we can distribute the items according to the store's market share. 
The primary advantage of this kind of a method is that we have to run a relatively smaller number of time series analysis models at the sacrifice at some approximation.
Bottom up model: This is right the opposite of the model above. We make a forecast for every item in every store individually. Which means that we will need to run 30490 models which is quite an incredible number. The advantage being we wont have to approximately divide the market share and predict. Another disadvantage to note is that sometimes, trends which can be noted in the overall scheme goes missing for individual store, but if that is prevelant here we can find out.

Global model: These models considering all the items together as sales of each are not independant to one another.

### Elimination of models: 
We will start by eliminating the most complex models and going from there.
1. RNN's(Recurrent Neural Networks) with LSTM via the global  model. Here we feed the entire final dataset to the RNN. The advantage of this is that the RNN can potentially figure the dependance of the various variables on each other. It is an extremely complex model which we cannot simply run due to the fact that putting the entire dataset into any model to process is simply out of scope in terms of computational power, RAM, complexity of hyper parameter tuning etc.
2. Another such global bottom up method is VAR,VARIMA,seasonal VARIMA etc, again,due to its sheer complexity, various tuning difficulties, size etc we will not even consider these methods.
3. RNN's in general: Even the simplest two layer RNN done for 3049 items is simply out of our scope and even if it was computationally viable, it would much better be to use another less complex method made specifically for our use case.
4. Combination/ensemble: We will only be running a single model on the entire dataset for reducing the complexity and computational needs.
5. Models such as Simple exponential smoothing, MA model etc. These models are not considered as almost in most cases SARIMA or Prophet performs considerably better than these especially with the datasets large number of exogenous variables, multiple seasonality etc.

## Models under consideration:

There are still a fair number of models not considered, but with the current domain knowledge the final options from which we will choose from are:


1. SARIMAX  :A traditional model which is still quite widely used and very poweful as well. It can include exogenous variables as well as seasonality. If we do indeed have the computational capabilities and there is a potential gain in accuracy we will implement SARIMAX bottom up.
2. Facebook's : This is a library put out by facebook which could potentially be extremely useful in this use case. It is easy to implement yet very powerful capable of taking external variables and is known to perform very well with various seasonal changes as it is modelled based on FB's forecasting algorithms.

This is the final list of the models we have in consideration. We will start of with a general EDA for the table trying to compare check the extent of relation between the various states, stores, values etc,effects of holidays and exogenous variables etc and how we reduce the dimensions etc. We mostly do this to check whether top down model will give us satisfactory results,reduce dimensionality and gain insights into the data. Once we have completed the EDA, we will based on our results etc, choose subsets of the total data and try the  models on it giving us a better understanding on which model we should choose without high computational costs. 


With this we have come to the end of this section and now will proceed with the EDA.


















# Exploratory Data analysis:

For our EDA, initially we will take a look at how each of the explanotory variables influence the target variable.

### Total sales over time : 
First we will check how the total sales varies over time of all items to get an idea of the trends, seasonality etc.We will be doing this in two steps, we will plot out the data for the entire duration of the dataset. And then, take a look at subsets of data to check for further weekly/monthly seasonality etc.

In [None]:
df.tail()

In [None]:
df[['date','value']].groupby('date').agg({'value':'sum'}).plot(figsize=(20,8),grid=True);

Given above is the plot for the total duration for all the sales of all items. There are a few notable features. One is the general upward trend seen over the years. There also seems to be a visible seasonality that is consistent through the years. Another point that can be noted is the huge dip around the christmas season as well as lesser dips in other points which indicates the importance of inputting holiday details.

Next we will take a look at one years worth of data

In [None]:
df.loc[(df.date>'2015-01-01')&(df.date<'2016-01-01')][['date','value']].groupby('date').agg({'value':'sum'}).plot(figsize=(20,8),grid=True);

Similar to the graph above, this too shows seasonality, we can see the slight seasonality over the year as well as monthly/weekly seasonality. We can see a small spike at the start of almost everymonth as well as weekly fluctuations. It could be  that people purchase more during the weekend than on weekdays.Let us have a closer look at a random normal month and check how the weekly seasonality varies.


In [None]:
ax=df.loc[(df.date>'2014-08-01')&(df.date<'2014-09-01')][['date','value']].groupby('date').agg({'value':'sum'}).plot(figsize=(20,8))
ax.xaxis.set_minor_locator(dates.DayLocator())
ax.xaxis.set_minor_formatter(dates.DateFormatter("%a-%B-%d"))
ax.tick_params(which='minor', rotation=45)
ax.grid(b=True, which='minor')

Looking through a few individual months worth of data, we notice that the shopping activity is least during the weeekends and increases over the week as shown above. Also, there seems to be a slight downward trend across the month maybe because people tend to spend more when their salary comes which often is at the start of a month.

### Sum of sales by store:

 I would like to see how the sales vary by store for all the products. We can get an  idea of how items pass through each store and their respective market shares and trends in general.  We will be using a rolling 90 day window to understand the trends and seasonality a little better. However, note that, the weekly seasonality gets lost due to the fact that we are using a 90 day rolling window.

In [None]:
storewise=df[['date','store_id','value']].groupby(['date','store_id']).agg({'value':'sum'})
storewise.reset_index(inplace=True)
storewise.pivot(index="date", columns="store_id", values="value").rolling(window=90).mean().plot(figsize=(20,8),grid=True,title='Sum of sales by store');

Straight away we can notice a few important points. For one, we see an inital upward trend followed by stability over the past few years and a very gradual rise overall. Most prominent is the seasonality in the year which is very pronounced and consistent through the years. There a few anomalies like the WI_2 store shows a great surge towards the end and a similar surge is notice in CA_2. We can see that there are distinct differences between the stores.

### Sum of sales by category :
Similar to the previous section we will try to asses the effect of the category on the sum of sales.

In [None]:
category_wise=df[['date','cat_id','value']].groupby(['date','cat_id']).agg({'value':'sum'})
category_wise.reset_index(inplace=True)
category_wise.pivot(index="date", columns="cat_id", values="value").rolling(window=90).mean().plot(figsize=(20,8),grid=True,title='Sum of sales by category');

Here we can see a few interesting trends. The food category's sales are fairly seasonal and has a steady increase over the past few years and has much higher sales as compared to  household items and hobbies which is to be expected. The other two also show some extend of growth and seasonality but nowehere as pronounced as food.

### Sum of sales by state:

Now we will take a look at how sales varies by state.

In [None]:
statewise=df[['date','state_id','value']].groupby(['date','state_id']).agg({'value':'sum'})
statewise.reset_index(inplace=True)
statewise.pivot(index="date", columns="state_id", values="value").rolling(window=90).mean().plot(figsize=(20,8),grid=True,title='Sum of sales by state');

Again, as expected, we see similar trends and seasonal compenents as the other graphs above. The main point to note is the difference between the sales in terms of sales and how WI has such a high growth rate in comparison to TX.

### Sales by price:
This is potentially one of the most important exogenous variables given to us. The price often influences greatly how much people purchase an item.  For this particular section we will not be using all of the data to plot, instead we will look at  a random item to check how price changes affects the sales of an item and then check the general trends by category.

In [None]:
item1=df[['item_id','sell_price','value']].loc[df['item_id']=='HOBBIES_1_008']
sns.barplot(x='sell_price',y='value',data=item1).set_title('Item1')
sns.set(rc={'figure.figsize':(10,5)})
plt.show()


Straightaway we can see how price influences the sale of a particular item and it's importance in the analysis. Let us take a  look at how price affects the sales of items for categories as whole.

In [None]:
item1= df[['cat_id','sell_price','value']].loc[df['cat_id']=='FOODS']
sns.scatterplot(x='sell_price',y='value',data=item1).set_title('Effect of price on sales for food')
sns.set(rc={'figure.figsize':(10,5)})
plt.show()

We observe that more expensive items are much less often bought and the graph is highly right skewed. A similar trend is observed in all three categories. 

We can see that prices play a big role in the sales as expected.
Next we will see how events affect the sales.

### Effect of events on sales:

In [None]:
ax=sns.barplot(x='event_name_1',y='value',data=df[['event_name_1','value']].groupby('event_name_1').agg({'value':'mean'}).sort_values(['value']).reset_index())
ax.tick_params(which='both',rotation=90)
sns.set(rc={'figure.figsize':(20,8)})

We can notice a few things right away, the major American holidays like christmas, thankgiving,New year etc have a reduced sales. This is most likely due to the fact that these days are spent with family with purchases for the day done much earlier, perhaps a look into a week earlier than these dates would have a relative spike .There is almost no sales on christmas as it is the only holiday for all the  stores. Some other days like the SuperBowl, another important American event is marked with high sales. We can see how events play a role in the sales data. 
And in our final part of our EDA, we will take a look at how SNAP affects sales.

### Effect of SNAP(Supplemental Nutrition Assistance Program) on sales:
SNAP provides a monthly supplement for purchasing nutritious food. The columns for the three states tell us whether the store allows purchase using this program on any given date. Our goal is to see how much SNAP affects sales.

In [None]:
fig, ax =plt.subplots(1,3)
sns.barplot(x='snap_CA',y='value',data=df[['snap_CA','value']].groupby('snap_CA').agg({'value':'mean'}).sort_values(['value']).reset_index(),ax=ax[0])
sns.set(rc={'figure.figsize':(10,6)})
sns.barplot(x='snap_TX',y='value',data=df[['snap_TX','value']].groupby('snap_TX').agg({'value':'mean'}).sort_values(['value']).reset_index(),ax=ax[1])
sns.barplot(x='snap_WI',y='value',data=df[['snap_WI','value']].groupby('snap_WI').agg({'value':'mean'}).sort_values(['value']).reset_index(),ax=ax[2])
plt.show()

In all three there is a notable difference in terms of sales showing that there are more sales when SNAP is allowed which is intuitve but the scale is also quite notable and hence will be important for our analysis to include.


## Choosing a model: 
Now we will decide which model to choose. For doing so we will run the model on a single item for a single store and we will check 
1. The accuracy
2. The run time
If the run time for one of the models far exceeds the other, regardless of the accuracy we will be using the faster model.
First we will start off with forming a dataset with one item.

In [None]:
df['snap']=np.where(df['state_id']=='CA',df['snap_CA'] ,np.where( df['state_id']=='TX',df['snap_TX'],np.where(df['state_id']=='WI',df['snap_WI'],0 )))

In [None]:
item1=df[['date', 'id', 'cat_id', 'sell_price','event_name_1',  'event_name_2','snap_CA', 'snap_TX', 'snap_WI','snap', 'value']].loc[df['id']=='HOBBIES_1_001_CA_1_evaluation']

Given above is the dataframe for one item, now we will run the SARIMAX model.

### SARIMAX:
There are a few drawbacks of arima, for one, SARIMAX does not support multiple seasonality. Only one seasonality is taken into consideration. There is a way to overcome this by adding Fourier terms in the exogenous variable,however, for ease we will avoid it and use a much smaller timeframe. The past 365 days. The advantage of this, the size of the dataset is much reduced and also has the advantage of not needing to take into consideration, yearly seasonality. It also has drawbacks but for simplicity and lack of computational power we will stick to using this highly simplified model.

To find the ARIMA orders and the seasonality order we will be using auto_arima.

In [None]:
train=item1[:-28]
test=item1.iloc[-28:]

In [None]:
len(test)

In [None]:
item1.set_index('date')
auto_arima(train['value'],seasonal=True,m=7,start_Q=0,start_P=0).summary()

We have got the orders to be (0,1,1)X(0,0,0,7), now we will input these orders into the model and form our predictions.

In [None]:
model = SARIMAX(train['value'],exog=train[['sell_price','snap']].astype('float'),order=(0,1,1))
results = model.fit()
results.summary()

In [None]:
exog=test[['snap', 'sell_price']].astype(float)
predictions=results.predict(start=len(train),end=len(train)+len(test)-1,exog=exog)

In [None]:
predictions=predictions.to_numpy()

In [None]:
test=test.copy()

test['predictions']=predictions

In [None]:
test.reset_index(drop=True,inplace=True)

We have run the model with the exogenous variables and now we will try plotting the predicted values against the actual ones and after that we will find the RMSE values.

In [None]:
test[['value','predictions']].plot()

In [None]:
from statsmodels.tools.eval_measures import mse,rmse
RMSE= rmse(test['value'],test['predictions'])
print("RMSE:",RMSE)

Straight away we can see that the predictions are very innacurate, this could be due to the fact that there are so many 0 values and that there is a lot of noise and barely any signal. One thing to note is that even using SARIMA, the model did not detect any weekly seasonality. The very sparse data points can lead to the predictions going so off. Some modifications can be made, such as introduce Fourier Terms, denoising and maybe even using the top down method so as to get overall trends etc will help.
Next we will take a look at prophets predictions.

### Prophet:

A major advantage prophet holds is that it can take into consideration multiple seasonality, a part of the function is that it produces its own Fourier terms for seasonality.

In [None]:
#Importing required libraries:
import pandas as pd
from fbprophet import Prophet
from tqdm.notebook import tqdm

In [None]:
# Forming the holidays dataframe:
holidays_df=df[['date','event_name_1']].loc[df['event_name_1']!=-1]
holidays_df.drop_duplicates(inplace=True)

In [None]:
holidays_df.columns=['ds','holiday']
holidays_df

In [None]:
# Making the columns in the required format:
cols=['ds','y','sell_price','snap']
id=sample_submission['id']

In [None]:
# Training and fitting the data:
item=train[['date','value','sell_price','snap']]
item.columns=cols  
m = Prophet(weekly_seasonality=True,holidays=holidays_df)
m.add_regressor('sell_price')
m.add_regressor('snap')
m.fit(item[-365:])

In [None]:
#  Forming the dataframe for our forecast:
future = m.make_future_dataframe(periods=28)
future['sell_price']=item1['sell_price'][-393:].to_numpy()
future['snap']=item1['snap'][-393:].to_numpy()

In [None]:
# Forecasting the data:
forecast = m.predict(future)[-28:]

In [None]:
# Putting the forecasted data with the actual values to test:
preds=forecast['yhat']
preds=preds.to_numpy()
test['yhat']=preds

In [None]:
# Plotting the data:
m.plot(m.predict(future));

In [None]:
test[['yhat','value']].plot();

There is a marked difference in this case. The model seems to performs quite a lot better. This does not mean that our model is anywhere close to being optimum. There are many points for improvement and more sophisticated, better tuned models are likely to perfrom a lot better. Let us take a look at the rmse values.

In [None]:
RMSE = rmse(test['yhat'],test['value'])
print("RMSE:",RMSE) 

The RMSE values are a lot lower than that than using ARIMA, that too without using exogenous variables. Hence we will proceed with Facebook's Prophet as our final model. We will include exogenous models for our final model. Even the run times are considerably lower for Prophet.

## Modelling using Facebook's Prophet:
For our final model,we will be using Facebook's Prophet. Now how we will implement is as so:
Running on only one vm/machine will take a long time. Hence we are dividing the work into three, and for convinience we will be dividing by state. Colab will be predicting for the items in stores in Texas, I will be using my own machine for California and Wisconsin will be run on Kaggle's notebook. The code will be the same for all with a slight modification. After that all the predictions will be combined together on this notebook made in Google's Colaboratory.

In [None]:
# Forming the necessary dataframes for our final model:
cols=['ds','y','sell_price','snap_TX']
id=sample_submission['id']
# Forming the submission dataframe:
submission=pd.DataFrame(index=('F1', 'F2', 'F3', 'F4', 'F5', 'F6', 'F7', 'F8', 'F9', 'F10',
       'F11', 'F12', 'F13', 'F14', 'F15', 'F16', 'F17', 'F18', 'F19', 'F20',
       'F21', 'F22', 'F23', 'F24', 'F25', 'F26', 'F27', 'F28'))
future_data=pd.merge(prices[['id','sell_price','wm_yr_wk']],calender[['date','wm_yr_wk']],left_on='wm_yr_wk',right_on='wm_yr_wk')

In [None]:
# To find where 'ID' for California ends and Texas begins:
print(id[id=='HOBBIES_1_001_TX_1_evaluation'].index)
print(id[id=='HOBBIES_1_001_WI_1_evaluation'].index)

Now we know how to divide the predictions between the various. We will be doing our predictions below and will run it on a for loop. Note that we save the data every 100 loops so as to not lose the data in case the runtime gets disconnected.

In [None]:
def predict(i) :
   item=df[['date','value','sell_price','snap_TX']].loc[df['id']==i]
   future_id=future_data[['sell_price','date']].loc[future_data['id']==i].sort_values('date')
   item.columns=cols  
   m = Prophet(yearly_seasonality=False,daily_seasonality=False,holidays=holidays_df)
   m.add_regressor('sell_price')
   m.add_regressor('snap_TX')
   m.fit(item[-365:])
   future = m.make_future_dataframe(periods=28)[-28:]
   future['sell_price']=future_id['sell_price'][-28:].to_numpy()
   future['snap_TX']=calender['snap_TX'][-28:].to_numpy()
   forecast = m.predict(future)[['yhat']]
   submission[i]=forecast.to_numpy()
   if n%100==0:
     submission.to_csv('submission_TX.csv')
for i,n in zip(tqdm(id[47590:47592]),range(0,2)):
  predict(i)

Given below is simply a sample of how the output will look. All the subsets are combined and made into the desired format and submitted. To get more values change the id range to whatever you feel in the code above. Also the model above is not perfect either. Feel free to tweak the parameters in the library for better results.

In [None]:
submission.tail()

### Combination of all the predictions to  a single dataframe:
Having done the predictions in sections there are quite a few to combine. Feel free to skip this section as it provides no further info. All the code below was done on colab and will not  run in this notebook, but its put to give an idea of how to format the final dataset into the output format expected in the competition and how to join the various predictions we have made from the various sources.

In [None]:
# Transposing them from being horizontal to vertical:
TX1=TX1.transpose()
TX2=TX2.transpose()
WI1=WI1.transpose()
WI2=WI2.transpose()
WI3=WI3.transpose()
CA1=CA1.transpose()
CA2=CA2.transpose()
CA3=CA3.transpose()
CA4=CA4.transpose()
CA5=CA5.transpose()
CA6=CA6.transpose()
CA7=CA7.transpose()
CA8=CA8.transpose()

In [None]:
#Combining all of them into one dataframe:
submission=CA1.append([CA2,CA3,CA4,CA5,CA6,CA7,CA8,TX1,TX2,WI1,WI2,WI3])

In [None]:
# Formatting to specifications for kaggle submission:
submission.reset_index(inplace=True)
submission.columns=[sample_submission.columns]
submission.to_csv('submission.csv')
submission=pd.read_csv('/content/submission.csv')
final=sample_submission[:30490].append(submission)
final.drop('Unnamed: 0',inplace=True,axis=1)
final.reset_index(drop=True,inplace=True)
final.to_csv('final_submission.csv')

Note that the first 30490 rows are filled with 0s as they are the validation data set predictions which have not been done. Only the evaluation dataset predictions are taken into consideration for the final score.

With that,we have come to an end to the project! Hope you enjoyed it!