<div class="alert alert-block alert-info"><b>IAB303</b> - Data Analytics for Business Insight</div>

# STUDIO :: Forecasting

Forecasting is the process of making predictions of the future based on past and present data and most commonly by analysis of trends. By identifying trends, decisions can take into account possible future outcomes, which if prove correct might give a business a strategic advantage.

**Predictive Analytics** has been traditionally viewed as one of 3 types of analytics (usually analytics based on numerical data):

1. **Descriptive analytics**
    - Interpretation of historical data to better understand changes that have happened in a business. 
    - Describes the past using a range of data to draw comparisons
    - Usually consists in reports such as year-over-year pricing changes, month-over-month sales growth, the number of users, or the total revenues
    - Performance metrics can be used to flag areas of **strength** and **weakness** in order to inform management’s strategy.
    
    
    
    

2. **Predictive analytics**
    - Used to make **predictions** about unknown future events.
    - Describes the use of statistics and modeling to determine **future performance** based on current and historical data.
    - Looks at **patterns** in data to determine if those patterns are likely to emerge again, which allows businesses and investors to **adjust** where they use their **resources** in order to take **advantage** of possible future events.
    - Example:  marketers look at how consumers have reacted to the overall economy when planning on a new campaign, and can use shifts in demographics to determine if the current mix of products will attract consumers to make a purchase.
   

3. **Prescriptive analytics**
  - Uses technology to help businesses make better decisions about how to handle specific situations by factoring in knowledge of possible situations, available resources, past performance and what is currently happening. 
  - Uses statistics and modeling to determine future performance based on current and historical data — to improve business decisions despite uncertainty and changing conditions, and to help companies determine what action to take.
  - Can help prevent fraud, limit risk, increase efficiency, meet business goals and create more loyal customers. 

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c7/Three_Phases_of_Analytics.png" />
  

### Predictive Anaytics: Forecasting

Forecasting is the process of making predictions of the future based on past and present data and most commonly by analysis of trends

<img src="https://i2.cdn.turner.com/money/2012/01/02/markets/stock_market_outlook_survey/chart-sp500-stock-outlook.top.gif" />

**Strengths**
*  Forces a company to think about how it intends to monitor and track sales beyond the current period
* Adjust the business strategy based on its prediction for sales growth
* If a seasonal pattern in sales is noticed, one can hire or reduce staff accordingly
* Track sales per item and use this information to focus stronger selling products and services.

**Weaknesses**
* Limited data mitigates the effectiveness of putting together a sales forecast.
* Past sales results are not always indicative of future sales results (very important)!
* Sales forecasting uses some form of projection about future demand interpreted through consumer preferences, opinions and attitudes.
* Consumer demand is a moving target, which makes hard future projections

### Things to take into consideration in Forecasting:

* Variation - seasonal data (e.g. Winter, Summer, Autumn, Spring, Easter, Holidays, Christmas, etc)
* Forecast period - the longer that we try to forecast, the higher the errors (always go for short term forecasting)

## Forecasting with FaceBook Prophet

We will use the FaceBook Prophet library to do some basic forecasting on example data.

#### Import libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime 

# Forecasting libraries
from statsmodels.tsa.seasonal import seasonal_decompose
from prophet import Prophet

In [None]:
# load dataset - example sales data from https://www.kaggle.com/kyanyoga/sample-sales-data/version/1
# file = 'data/sales_train.csv'
# df = pd.read_csv( file )
#
# This dataset is 2.9M rows long and is about 90MB
# We've created a smaller version for this exercise
# Only 29k rows - 1.2M
# This is the code we used to do it.
#
# df_small = df.sample(frac=0.01, replace=False, random_state=1)
# df_small.to_csv("data/sales_train_small.csv")
# df_small
#

For the purposes of this exercise, we don't need the original full dataset. 

Also note that we are not doing this as a machine learning exercise, but as a business analytics exercise, so we will only use the training data. If you wish to check accuracy of the model and try additional algorithms, you will need to download the test data as well.

In [None]:
# load small dataset - example sales data from https://www.kaggle.com/kyanyoga/sample-sales-data/version/1
file = 'data/sales_train_small.csv'
df = pd.read_csv( file )
df

In [None]:
# get the time period of this dataset

print( 'The data ranges from ' + str(df['date'].min()) + ' to ' + str(df['date'].max())  )

So this date range looks like a problem as the last row in the dataframe is `15.03.2103`. These values are probably not in date format, so `min()` and `max()` will be treating the dates as text!

In [None]:
# Fix dates
df['date'] = pd.to_datetime(df.date,format='%d.%m.%Y')
df['date']

In [None]:
print( 'The data ranges from ' + str(df['date'].min()) + ' to ' + str(df['date'].max())  )

Better. Now we can used the dates to do a `groupby` and aggregate using `sum()`. This will give us both total items sold in a day (`item_cnt_day`) and total sales \$ amount (`item_price`). 

In [None]:
sales_df = df.groupby('date').sum()
sales_df

Let's visualise the key data so that we can get a handle on what it might be telling us

In [None]:
items_fig = sales_df['item_cnt_day'].plot(figsize=(15,5))

In [None]:
sales_fig = sales_df['item_price'].plot(figsize=(15,5))

There are obvious patterns in the data. To make this more obvious, we can use the `seasonal_decompose()` function to decompose the sales data into `Trend`, `Seasonal`, and `Residual` components.

In [None]:
plt.figure()
plt.rcParams['figure.figsize'] = [15, 12]
dsales = seasonal_decompose(sales_df['item_price'], period=365)
ax = dsales.plot()

Sometimes, it can be easier to see what is happening by using less granular data. By re-sampling the data we can look at weeks instead of days. However, in this instance it is not really necessary.

In [None]:
ax = plt.figure( figsize=(12, 5) )
week_sales = sales_df.item_price.resample('W').sum()
ax = week_sales.plot()
week_plot = week_sales.plot()

### Forecasting

We can already get a sense about where the data may be going if the overall trend and seasonality persist. However, a forecasting algorithm like FaceBook's `Prophet` can make this more obvious.

For this example, we are going to assume that the business wants to predict possible sales through till the end of the financial year. For this data, that will be 30 June 2016.

#### Prep

First we need to prepare the data for the forecasting algorithm. It requires a dataframe that has a `ds` column for the date and a `y` column for the values that we want to predict on. As the sales in this data are in millions, we're going to make them a bit easier to read by making 1 unit \$1,000 - that is \$1.5M will be shown as 1500.

In [None]:
# Create the forecast dataframe with y column
forecast_df = sales_df
forecast_df['y'] = [round(p/1000,) for p in forecast_df['item_price']]
forecast_df

In [None]:
# Add the ds column, and just select ds and y columns for final dataframe
forecast_df['ds'] = [d for d in forecast_df.index]
forecast_df = forecast_df[['ds','y']]
forecast_df

In [None]:
# Use the prophet algorithm to create an empty model to be trained on existing data
model = Prophet()


In [None]:
# Train the model to fit our forecast dataframe
model.fit(forecast_df)

In [None]:
# Now use the model to do a prediction - starting with the existing data
predict_df = model.predict()

In [None]:
predict_df[['ds','yhat','yhat_lower','yhat_upper']]

The prediction results are the forecasting model's statistical guess as to what the sales data should be for a given date (`yhat`) as well as the upper (`yhat_upper`) and lower (`yhat_lower`) bounds of each guess.

We can plot the output of the model to visualise how the statistical guesses (in blue) match up with the the actual data (black dots).

In [None]:
fig = model.plot(predict_df)

We can also view the various components of the prediction (like the seasonality decompose function above)

In [None]:
figs = model.plot_components(predict_df)

### Predicting the future

However, what the business really wants to know is *NOT* predictions of what has actually happened - what we already know - but what may happen in the future!

We can do this by requesting the model to predict over dates into the future. To do this we add dates to our existing data.

In [None]:
# Create dates for prediction

pred_date = pd.DataFrame()
pred_date['ds'] = pd.DataFrame(pd.date_range('2015-11-01', periods=243, freq='1D'))
pred_date

In [None]:
# Add the new dates to our existing dataframe
pred_df = predict_df.append(pred_date,ignore_index=True)

In [None]:
# Now call the predict function again on the revised dataframe
predict_df = model.predict(pred_df)

#### Results

When we visualise the results as before, we see that the predictions (guesses in blue) extend beyond our actual data (black).

In [None]:
fig = model.plot(predict_df)

In [None]:
figs = model.plot_components(predict_df)

Note that trend is continuing as there is nothing in the training data to say that it might turn around. However, yearly seasonality is still reflected in the peak in January.

If you wanted to try some advanced analysis, you could check these predictions and compare them against the actually data that occured (which are downloadable from Kaggle with the original data).