<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

## Motivation

Sales forecasting is not uncommon for making rough revenue predictions in small and large businesses. Although sales forecasting is a great method of using past data and current goals to determine revenue potential, the process has a few major disadvantages. The longer a company has existed and has recorded past sales data, the more accurate they can forecast into the future and plan for unexpected events. New companies, however, are using guesswork when they use sales forecasting strategies, because they still do not have enough data.

As you saw in the workshop, many companies rely uniquely on technical analysis and forecasting techniques as the one way to get estimations towards the future. Most CEO's and people wrongly believe that if a product reveals some past patterns or trends, then in the future, it will certainly behave like this. It is a mistake to believe that if a product displayed some revenues in the past, it will also present the same revenues in the future. 

This does not mean that we should drop forecasting techniques. On the contrary, we can take advantage of them to identify internal strengths and weaknesses in our business, but we must be aware that we cannot rely 100% in these tools to make predictions. If this was true, we could all predict the stock market by now and we would all be rich, right?

## Task 1: Understanding Forecasting Techniques in Sales

In this lab session, we will analyse the behaviour of 10 stores that sell 50 different items. 

You are given 5 years of store-item sales data, and you are asked to predict 3 months of sales for 50 different items at 10 different stores.

What's the best way to deal with seasonality? Should stores be modeled separately, or can you pool them together? These are all very interesting questions, but in this lab we will keep it simple by analysing only one store and one item: a specific item that is sold in the store that has the highest sales. 


In [0]:
# Load the required libraries

# Data Manipulation
import numpy as np
import pandas as pd
from pandas.tools.plotting import autocorrelation_plot

# Data Visualization 
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns

# Forecasting libraries
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose

from sklearn.metrics import mean_squared_error

# Ignore warning messages
import warnings
warnings.filterwarnings('ignore')

from fbprophet import Prophet

In [0]:
# load the dataset containing the sales of different stores with different items
df = pd.read_csv( 'data/sales.csv' )
print( df.head() )

How many stores do we have in this dataset?

In [0]:
### YOUR CODE HERE
# Store your result in variable num_stores

num_stores = 
print( 'Total number of stores is: ' + str(num_stores) )


What is the time period that this data has been collected?

In [0]:
### YOUR CODE HERE
# Store your result in variables max_period and min_period

max_period = 
min_period = 

print( 'This dataset has been collecte from ' + str(min_period) + ' to ' + str(max_period)  )

For plotting purposes, we are going to index this dataset over the date column. We do it in the follwoing way:

In [0]:
df['date'] = pd.to_datetime(df.date,format='%Y-%m-%d')
df.index = df['date']


Let's visualise the number of sales that each store obtained thoughout time:

In [0]:
fig, axes = plt.subplots(num_stores, figsize=(12, 16))

# for each store...
for store in df['store'].unique():

  sales = df.loc[df['store'] == store, 'sales']
  
  # specifying figure options
  ax = sales.plot(ax=axes[store-1])
  ax.set_title('Store')
  ax.set_ylabel('sales')
  ax.grid()
fig.tight_layout();

Plotting the daily number of sales of each store seems a little bit confusing and hard to understand. Let's repeaat the visualisation, but this time with **weekly** number of sales that each store obtained throughout time:

In [0]:
fig, axes = plt.subplots(num_stores, figsize=(12, 16))

# for each store...
for store in df['store'].unique():
  
  # the resample function is a convenience method for frequency conversion of data
  # basically we are converting our daily item sales into a weekly frequency
  # this makes the data more clear for analysis
  week_sales = df.loc[df['store'] == store, 'sales'].resample('W').sum()
  
  # specifying figure options
  ax = week_sales.plot(ax=axes[store-1])
  ax.set_title('Store')
  ax.set_ylabel('sales')
  ax.grid()
fig.tight_layout();

It seems that the performace of each store is similar. As you can see, every year the products show an initiation growth, they reach a maturity (the highest value) and then they start declining. This is usually the cycle of any product. the only difference is how long it takes to reach these increasing and decreasing phases and what is the highest number of sales the item reaches.
We can see this better if we plot the sales of each store all together:

In [0]:
c = 0
# for each store...
for store in df['store'].unique():

  # select the weekly number of sales 
  week_sales = df.loc[df['store'] == store, 'sales'].resample('W').sum()
  # and plot the results with different colors for each store
  week_sales.plot( color='C'+str(c), label='store'+str(store), figsize=(12, 8))
  
  # specifying figure options
  plt.title('Store')
  plt.ylabel('sales')
  plt.legend()
  
  c = c + 1

plt.show()

Given this information, we can see that store number 2 has the highest sales, so let's make a dataframe that contains only the items and the number of sales in store number 2:

In [0]:
# get a dataframe that reprsents the average sales per store 
# YOUR CODE HERE
avg_sales_per_store = 

renaming the grpuped column to 'avg sales'
avg_sales_per_store = avg_sales_per_store.rename(columns={'sales':'avg sales'}) 


We can take a look at the entire data:

In [0]:
# identify the store with the largest amount of sales (this should be store 2)
# YOUR CODE HERE
store_with_max_sales = 


plt.figure(figsize=(12,6))

# plot the results on a bar plot
stores = avg_sales_per_store['store']
barplot = plt.bar( stores, avg_sales_per_store['avg sales'] )

# highllight store with hoghest number of sales
barplot[ store_with_max_sales - 1 ].set_color('r')

# figure options
plt.title('Average sales per store')
plt.show()


Let's focus on store number 2, since it has the highest number of sales. Let's pick up an individual item to analyse: item 50.

In [0]:
# get the total sales of item 50 of the store with the highest volume of sales:
# YOUR CODE HERE
# put your result in a variable max_sales



Let's visualise the results of item 50  in store 2:

In [0]:
plt.figure(figsize=(12,8))
plt.title('Number of sales of item ' + str(item) + ' in store ' + str(store_with_max_sales))
plt.ylabel('Number of sales')


weekly_sales = max_sales['sales'].resample('W').sum()
weekly_sales = weekly_sales.to_frame()

weekly_sales['sales'].plot()

What is your understanding about the volume of sales in the above figure? We can notice two important things:
1. There seems to be a **trend**: every year the number of sales is increasing. You can see this if you imagine a straight line connecting the peaks of the graph

2. There are also lots of fluctuations in the volume of sales. There is a pattern that shows a decline in the amount of sales that reaches its minimum in the end of the year. It seems to be an item that has its peak of sales during summer, but starts to have a decline in sales towards winter. Can you guess what kind of item is it? 

## Task 2. The Forecasting Sales Technique

The above graph illustrates the volume of sales of a specific item in a given store. How do we predict the trends and the future sales in this  store?

As you saw in the workshop, to determine this, we need to apply the general learning technique that was presented in the workshop:

0. Analyse your data
1. Separate our dataset into two samples:

  1.1. A training set, which will be used to fit our data
  
  1.2. A test set, which will be used to test how good our model is predicting the data
2. Select a predictive algorithm (a model to learn from our data)
3. Fit our sample of data to this model
4. Make prediction
5. Evaluate how good the model is

### Analyse your data

So far, we identified the store that contains the maximum number of sales and we also selected within that store, the item with the highest volume of sales. As a CEO of a company, who constantly needs to be aware of market changes and conditions, what we did so far is not enough! It would be desirable to understand the trends in terms of sales of the products that the store is selling. This can be helpful for many reasons that were covered in the workshop, but as we saw, this analysis of trends does not solve or help us much: it indeed gives us a general overview of how the sales are going, but we cannot uniquely rely on this information. 

Also, as you learned in the workshop, different periods of the year stimulate the purchase of different items: in christmas we buy more foods, sweets and gifts; in summer we buy more outdoor activity products and so on. So it is also fundamental to take into consideration the market changes that occur due to season, weather, etc.

What we have seen so far was the time series representation of our sales. As you could see, it looked like a very noisy signal from which we could not take much insight. A method to make it clear, and which we covered in the workshop, is the time series decomposition method.  Time series decomposition involves thinking of a series as a **combination of level, trend, seasonality, and noise components**.

Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.

In this lab session, you will discover time series decomposition and how to automatically split a time series into its components with Python.

To help us identify product trends and seasonal changes, Python has a very good statistics library that can assist us on that, more precisely the function *seasonal_decompose*.


In [0]:
# Apply the seasonal decomposition method to the sales dataset for a yearly frequency
decomposed_sales = 


We can now observe the trends regarding the sales of our product. Note that the trend component is supposed to capture the slowly-moving overall level of the sales.


In [0]:
# plot the general trend of the sales



What interpretations do you take from this graph?

**YOUR ANSWER HERE:**

We can also analyse the seasonal sales variation of our item:

In [0]:
plt.figure(figsize=(15,6))

# plotting the seasonal changes of the sales
decomposed_sales.seasonal.plot()

Now that we have a general understanding of our data, we can start the forecast method!

### Separate our Dataset

We are going to forecast our data using our daily sales dataset. The first thing that it is important to know is the size of this datastet. Since this is a forecast approach, we will use the majority of our data to train our model and we will reserve the last 3 months to test it. 

In [0]:
# 1. SEPARATE OUR DATASET:
total_sales = len(max_sales)

# we will split our data by selecting the last 3 months for prediction and the 
# remaining data for training
data_split = 90 # 90 days corresponds to the three months

# Allocate the data for training
train = max_sales.iloc[0: total_sales - data_split - 1]

# Put the remaining data of our dataset 
expected = max_sales.iloc[total_sales - data_split:]

print('Total datapoints in the dataset: ' + str( total_sales )) 
print('Datapoints reserved for training: ' + str(len(train)))
print('Datapoints reserved for testing: ' + str(len(expected)))

In [1]:
# transform the train and test sets in the format that was presented in the studio session
# column 1: name 'ds' with the dates
# column 2: name 'y' with the values we want to predict

# YOUR CODE HERE




### Define Learning Algorithm

Next, we define the type of learning algorithm that we want to apply. Python's statistical Libraries offer us a wide range of learning algorithms that you can explore. For the purposes of this unit, we will explore Facebook's Prophet algorithm, which is based on a particular statistical learning method called Linear Regression.



In [None]:
# 2. DEFINE THE LEARNING ALGORITHM

model = 

### Fit the data to the model

After defining our learning model and pluging in the right earning parameters, we need to fit our data to the model (learn from data)

In [0]:
# FIT THE MODEL TO THE DATA
# YOUR CODE HERE




### Use our testset on the learned model to forecast our sales

Let's use Prophet to try to estimate the last 3 months of our dataset. This way we can see how well the algorithm performed, since we have the true sales of the store for that time period


In [3]:
# 4. MAKE PREDICTIONS

# YOUR CODE HERE

In order to have a numerical validation of the model, instead of jus a graphical one, we can use 

In [4]:
# Compute the forecast error:
# YOUR CODE HERE:



In [None]:
# Apply the seasonal decomposition method to our sales dataset for a monthly frequency
# to determine the trend
decomposed_sales = 



## TASK 3: Make Predictions for Different Items and Different Years

Try repeating the exercise, but now focusing on specific items over a certain timeframe.
For instance, for store 10, predict the trends of iTEM 7 using the last 2 years as historical data.


In [6]:
# Analyse your data
#    Make appropriate visualisations or data manipulations
#    The goal is to understand the kind of data you are dealing with
#    Check if there are seasonal trends
#    And other things that you might find relevant to augment your business concern


# Forecasting:
# Separate our dataset into two samples:
#    A training set, which will be used to fit our data
#    A test set, which will be used to test how good our model is predicting the data
# 1. SEPARATE OUR DATASET:

# we will split our data by selecting the last 2 months for prediction and the 
# remaining data for training


# Allocate the data for training

# Put the remaining data of our dataset 


# Select a predictive algorithm (a model to learn from our data)



# Fit our sample of data to this model


# Make prediction

# Apply the seasonal decomposition method to our sales dataset for a monthly frequency
# to determine the trend


# Evaluate how good the model is


# After being happy with your parameters and your results, train the model with 
# the entire dataset and predict the next 60 days as asked in the task
# what are your findings? What do you recommend to the CEO?

