## Tutorial :: Internal data (Strengths and weaknesses)

Recommended reading:

[A review and future direction of agile, business intelligence, analytics and data science](https://doi-org.ezp01.library.qut.edu.au/10.1016/j.ijinfomgt.2016.04.013)

 Criteria | Analytics Type | Analytics Objective | Data Type | Data Age
---|---|---|---|---
Traditional Business Intelligence | Descriptive, Predictive |Decision Support, Performance Management | Structured and Defined | >24 h
Fast Analytics with Big Data | Predictive, Prescriptive | Drive the Business | Unstructured, Undefined | <Min
  

### Kinds of Analytics

* **Descriptive analytics**
    - Interpretation of historical data to better understand changes that have happened in a business. 
    - Describes the past using a range of data to draw comparisons
    - Usually consists in reports such as year-over-year pricing changes, month-over-month sales growth, the number of users, or the total revenues
    - Performance metrics can be used to flag areas of **strength** and **weakness** in order to inform management’s strategy.
    
    
    
    

* **Predictive analytics**
    - Used to make **predictions** about unknown future events.
    - Describes the use of statistics and modeling to determine **future performance** based on current and historical data.
    - Looks at **patterns** in data to determine if those patterns are likely to emerge again, which allows businesses and investors to **adjust** where they use their **resources** in order to take **advantage** of possible future events.
    - Example:  marketers look at how consumers have reacted to the overall economy when planning on a new campaign, and can use shifts in demographics to determine if the current mix of products will attract consumers to make a purchase.
   

* **Prescriptive analytics**
  - Uses technology to help businesses make better decisions about how to handle specific situations by factoring in knowledge of possible situations, available resources, past performance and what is currently happening. 
  - Uses statistics and modeling to determine future performance based on current and historical data — to improve business decisions despite uncertainty and changing conditions, and to help companies determine what action to take.
  - Can help prevent fraud, limit risk, increase efficiency, meet business goals and create more loyal customers. 

<img src="https://upload.wikimedia.org/wikipedia/commons/c/c7/Three_Phases_of_Analytics.png" />
  

### Predictive Anaytics: Forecasting

Forecasting is the process of making predictions of the future based on past and present data and most commonly by analysis of trends

<img src="https://i2.cdn.turner.com/money/2012/01/02/markets/stock_market_outlook_survey/chart-sp500-stock-outlook.top.gif" />

* The more recorded past data there is, the more accurate the forecast model is. Although, it is always hard to forecast unexpected events.

* New companies, using guesswork when they use sales forecasting strategies, because they do not hold enough data

**Strengths**
*  Forces a company to think about how it intends to monitor and track sales beyond the current period
* Adjust the business strategy based on its prediction for sales growth
* If a seasonal pattern in sales is noticed, one can hire or reduce staff accordingly
* Track sales per item and use this information to focus stronger selling products and services.

**Weaknesses**
* Limited data mitigates the effectiveness of putting together a sales forecast.
* Past sales results are not always indicative of future sales results (very important)!
* Sales forecasting uses some form of projection about future demand interpreted through consumer preferences, opinions and attitudes.
* Consumer demand is a moving target, which makes hard future projections

### Things to take into consideration in Forecasting:

* Seasons (Winter, Summer, Autumn, Spring, Easter, Holidays, Christmas, etc)
* Unpredictable Revenue
* Revenue based on sales forecasts is only moderately predictable.
* The longer that we try to forecast, the higher the errors (always go for short term forecasting)

#### Analyse Data

In [None]:
# Data Manipulation
import numpy as np
import pandas as pd

# Data Visualization 
import matplotlib.pyplot as plt
from matplotlib import pyplot
import seaborn as sns

# Forecasting libraries
import statsmodels.api as sm
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_squared_error

# Ignore warning messages
import warnings
warnings.filterwarnings('ignore')

In [None]:
# load dataset
file = 'data/week-10/sales_ex.csv'
data = pd.read_csv( file )

data.head()

In [None]:
# get the time period of this dataset
max_period = data['date'].???
min_period = data['date'].???

print( 'This dataset has been collecte from ' + str(min_period) + ' to ' + str(max_period)  )

In [None]:
# index data by date
data['date'] = pd.to_datetime(data.date,format=???)
data.set_index('date', inplace=True)
data

In [None]:
# Check sales throughout time
data['sales'].plot( figsize=(10, 5) )

The sales plot seems to indicate a pattern: it reaches its peak in the middle of the year (summer) and its minimum in the end of the year (December). Can you imagine an item that displays such behaviour?

Also, we see that our sales graph seems to be like a signal composed of three elements: the actual sales behaviour, the seasonal trends and some noise (data that does not fit in the patterns of the sales neither in seasonal trends).

We can decompose this signal by using Python's *seasonal_decompose* function

In [None]:
# check the general trends and sesonality of the sales
decomposed_sales = sm.tsa.seasonal_decompose(data, period=365)
ax = decomposed_sales.plot( )

By anaysing past data, we can make some descriptive anaytics about the sales. We see that there is a general trend for the sales to increase, but there are lots of fluctuations due to seasons and also some considerable amount of noise

#### Separate Data into Training set and Test set

We are going to forecast our data using our daily sales dataset. The first thing that it is important to know is the size of this datastet. Since this is a forecast approach, we will use the majority of our data to train our model and we will reserve the last 3 months to test it.

In [None]:
# 1. SEPARATE OUR DATASET:
total_sales = len(data)

# we will split our data by selecting the last 3 months for prediction and the 
# remaining data for training
data_split = 90 # 90 days corresponds to the three months

# Allocate the data for training
train = data.iloc[0: ???]

# Put the remaining data of our dataset for testing
expected = data.iloc[???: ]

print('Total datapoints in the dataset: ' + str( total_sales )) 
print('Datapoints reserved for training: ' + str(len(train)))
print('Datapoints reserved for testing: ' + str(len(expected)))

#### Define the Learning Algorithm

Next, we define the type of learning algorithm that we want to apply. Python's statistical libraries offer us a wide range of learning algorithms that you can explore. We will explore the Seasonal ARIMA algorithm, also known as SARIMAX, which is based on a particular statistical learning method called Linear Regression.

<img src="https://josef-pkt.github.io/pages/slides/images/airpassenger_forecast.png" />

There are four distinct integers (p, d, q, s) that are used to parametrize the SARIMAX model

* p (AR parameters) is the auto-regressive part of the model. It allows us to incorporate the effect of past values into our model. Intuitively, this would be similar to stating that it is likely to be warm tomorrow if it has been warm the past 3 days.

* d (differences) is the integrated part of the model. This includes terms in the model that incorporate the amount of differencing (i.e. the number of past time points to subtract from the current value) to apply to the time series. Intuitively, this would be similar to stating that it is likely to be same temperature tomorrow if the difference in temperature in the last three days has been very small.

* q (MA parameters) is the moving average part of the model. This allows us to set the error of our model as a linear combination of the error values observed at previous time points in the past.

* s is the periodicity of the time series (4 for quarterly periods, 12 for yearly periods, etc.).

Here, (p, d, q) are the non-seasonal parameters described above, while (P, D, Q) follow the same definition but are applied to the seasonal component of the time series.

In [None]:
#2. DEFINE THE LEARNING ALGORITHM
# In this case we will use Python's forecast algorithm, SARIMAX
# If you are curious, you can find more information here (including information about params):
# http://people.duke.edu/~rnau/411home.htm
# We fit the model by instantiating a new SARIMAX class with the following arguments:
# trainig set
# order: contains a set of 3 parameters that will help to fit the dats (p, d, q). These parameters
#        need to be manually chosen and are usually found by conducting several trials
# seasonal_order: contains a set of 4 parameters that will help to fit the dats (P, D, Q, s). These parameters
#                 need to be manually chosen and are usually found by conducting several trials
# enforce_stationarity: Whether or not to transform the AR parameters to enforce stationarity in the autoregressive component of the model
# enforce_invertibility: Whether or not to transform the MA parameters to enforce invertibility in the moving average component of the model 
model_sarimax = SARIMAX(train, order=(4,2,4), 
                                        seasonal_order=(5,1,5,5),
                                        enforce_stationarity=False,
                                        enforce_invertibility=True)
model_sarimax

#### Fit data to model
After defining our learning model and pluging in the right learning parameters, we need to fit our data to the model

In [None]:
# FIT THE DATA
model_fit = model_sarimax.fit()

# You can see the general results of this model including:
# coef: the coeffiecients that were learned to for the data
# std err: the error obtained while fitting the data
# some extra information about statistical evaluation and confidence intervals
print( model_fit.summary())

#### Predict Results (Forecast Data)
Let's use SARIMAX to try to estimate the last 3 months of our dataset. This way we can see how well SARIMAX performed, since we have the true sales of the store for that time period

In [None]:
# 4. MAKE PREDICTIONS
start = len(train)
end = total_sales

expected['forecast_SARIMAX'] = model_fit.predict(start = start, end = end, dynamic= True)  
fig = expected[['sales','forecast_SARIMAX']].plot(figsize=(12, 8))

# Apply the seasonal decomposition method to our sales dataset for a monthly frequency
# to determine the trend
decomposed_sales_sarimax = sm.tsa.seasonal_decompose(expected['sales'], model='adjust', period=30)
decomposed_sales_sarimax.trend.plot()

#### Evaluate the Forecast
In order to have a numerical validation of the model, instead of jus a graphical one, we can use an error metric called *Mean Squared Error*

In [None]:
# Compute the error:
error_sarimax = mean_squared_error( expected['sales'], expected['forecast_SARIMAX'])
print('SARIMAX model Mean Squares Error: ' + str(error_sarimax))


### Try it yourself

Try changing the parameters that we used to train the model. Can you obtain a forecast with an error smaller than 337?

### Forecasting
Assuming that this was the best model that we could obtain with a certain set of parameters, then we can use the entire dataset now to train it and see the futture sales:

In [None]:
# Define Learning Model
model = SARIMAX(data['sales'], order=(4,2,4), 
                                        seasonal_order=(5,1,5,5),
                                        enforce_stationarity=False,
                                        enforce_invertibility=True)
model

In [None]:
# Fit Data
model_fit = model.fit()


In [None]:
# Make predictions for 3 months
start = len(data)
end = total_sales + 90

predictions = model_fit.predict(start = start, end = end, dynamic= True)

full_data = data['sales'].append(predictions)
full_data.plot( figsize=(12, 8) )

In [None]:
# check the last year for better visualisation
last_year_train = full_data.iloc[0: len(full_data) - 305]
last_year_pred = full_data.iloc[len(full_data) - 305:]
last_year_train.plot( figsize=(12, 8) )
last_year_pred.plot( figsize=(12, 8) )

## Further Information

[Top 10 Analytics and BI Software Vendors and Market Forecast 2016-2021](https://www.appsruntheworld.com/top-10-analytics-and-bi-software-vendors-and-market-forecast/)


* [SAP](https://www.sap.com/products/analytics/business-intelligence-bi.html)
* [SAS](https://www.sas.com/en_au/solutions/business-intelligence.html)
* [IBM](https://www.ibm.com/au-en/marketplace/business-intelligence)
* [Oracle](https://www.oracle.com/solutions/business-analytics/business-intelligence/index.html)
* [Tableau](https://www.tableau.com)
* [Microsoft](https://powerbi.microsoft.com/en-us/)
* [Qlik](https://www.qlik.com/us/)

### The Big Data Connection

* Hadoop - HDFS and MapReduce - [Big Data, Hadoop, and Spark: An Explanation for the Rest of Us (part 1)](https://community.alteryx.com/t5/Engine-Works-Blog/Big-Data-Hadoop-and-Spark-An-Explanation-for-the-Rest-of-Us-Part/ba-p/2796)
* Apache Spark - [Big Data, Hadoop, and Spark: An Explanation for the Rest of Us (part 2)](https://community.alteryx.com/t5/Engine-Works-Blog/Big-Data-Hadoop-and-Spark-An-Explanation-for-the-Rest-of-Us-Part/ba-p/16560)



