# Introduction

I'm sure you're well aware of the value of accurate forecasts, but producing them isn't easy. In this document I'll try to outline two basic univariate time series forecasting methods in simple and easy to understand language, assuming you have a basic knowledge of statistics and python.

**Time series data definition**: Data collected on the same metrics or same objects at regular time intervals. It could be stock market records or sales records.

**Univariate Time Series Forecasting**: Only using the previous values in a time series to predict future values (not using any outside variables).

# Data Handling

### Importing Packages

In [None]:
import numpy as np, pandas as pd, seaborn as sns
import matplotlib.pyplot as plt
from statsmodels.tsa.stattools import adfuller
from pandas import Series
import datetime

### Reading in Data

In [None]:
item_cats = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
sales_train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')
test_df = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

### Inspecting the data

In [None]:
sales_train.head()

We need to change the date into a datetime variable

In [None]:
sales_train.dtypes

In [None]:
sales_train.date = sales_train.date.apply(lambda x: datetime.datetime.strptime(x, '%d.%m.%Y'))
sales_train.dtypes

Let's take a deeper look at our sales dataframe:

In [None]:
from IPython.display import display
display(sales_train.head()) # show first 5 rows
display(sales_train.shape) # number of rows, columns
display(sales_train.isnull().any()) # How many empty values in each column
display(sales_train.describe()) # Summary statistics

## Data Exploration

In [None]:
"""
In this cell we are having a look at the total sales for the company 1C.
It appears as though there is a downward trend and seasonality.
"""
ts = sales_train.groupby(['date_block_num'])['item_cnt_day'].sum()
ts.astype(float)

rolling_mean = ts.rolling(window = 12).mean() # rolling average of 12 months
rolling_std = ts.rolling(window = 12).std() # rolling std of 12 months

plt.figure(figsize=(16,8))
plt.title('Total Sales of 1C')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(ts, color = 'blue', label = 'Sales')
plt.plot(rolling_mean, color = 'red', label = 'Rolling Mean')
plt.plot(rolling_std, color = 'black', label = 'Rolling Std')
plt.legend(loc = 'best')
plt.show()

# Time series Analysis

## Stationarity

**definition**: The statistical properties of a stationary time series do not change over time. i.e. 2 points in a time series are related to each other by only how far apart they are & not by the direction (each point is independent).

Essentially, the mean, variance, and covariance should remain constant over time. If the data has a trend, it isn't stationary.

The reason it's important, without going into the math, is that many models rely on stationarity and assume that the data is too.

You can test for stationarity with the following tests:
* Augmented Dicky Fuller (ADF)
* KPSS
* Philips-Perron (PP)

For our data I will be performing an ADF test.

In [None]:
"""
In this cell we perform the ADF test to check for stationarity. The
ADF tests the null hypothesis that a unit root is present in the
time series. i.e. if the p-value is less than 5%, you can reject the
null hypothesis and assume that the data is stationary.
"""

def adf_test(ts):
    print('ADF test results:')
    adf = adfuller(ts, autolag  = 'AIC')
    adf_out = pd.Series(adf[0:4], index=['Test Statistic',
                                        'p-value','#Lags Used',
                                        'Number of Observations Used'])
    for key, val in adf[4].items():
        adf_out['Critical Value (%s)' %key] = val
    print(adf_out)
    
adf_test(ts)

The p-value is 14.3%, we therefore can't assume stationarity. 

## Differencing

**definition**: Differencing is a transformation of a time series, taking the difference between consecutive terms in a series. It can be used to remove time dependency and stabilise the mean, reducing trends and seasonality.



In [None]:
def difference(df, interval=1):
    diff = [] # Create empty list
    for i in range(interval, len(df)): # Iterate over every lag
        val = df[i] - df[i - interval] # Take the difference between consective terms
        diff.append(val) # Add the new values to the end of the list
    return Series(diff) # Return the differenced values as a time series

In [None]:
"""
Below the original time series is plotted, the same as the plot above.
"""
ts.astype(float)
plt.figure(figsize=(16,16))
plt.subplot(311)
plt.title('Original')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(ts) # Plot the original time series

"""
Below the new differenced time series is plotted.
"""
new_ts = difference(ts) # difference the time series
plt.subplot(312)
plt.title('Post-differencing')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(new_ts)
plt.plot()

"""
Below the time series is de-seasonalised (assuming the seasonality
12 months long)
"""
ds_ts = difference(ts, interval = 12)
plt.subplot(313)
plt.title('After De-seasonalising')
plt.xlabel('Month')
plt.ylabel('Units Sold')
plt.plot(ds_ts)
plt.plot

Let's test the differenced and deseasonalised series:

In [None]:
print('Differenced')
adf_test(new_ts)

print('\nDeseasonalised')
adf_test(ds_ts)

The ADF test of the deseasonalised data is below 5%, we can therefore reject the null hypothesis and assume the deseasonalised series is stationary. 

### Considerations
You have to be careful not to over-difference the time series. An over-differenced series may still be stationary, but will affect the model parameters (settings).

You should aim to use the minimum necessary differences to achieve stationarity.

**How do you know if a time series is over differenced?** Optimaly, the Autocorrelation Function (ACF) plot should reach 0 quickly, as seen below. If the first lag (the second pole on the PACF plot) is too far in the negative, then it is probably over-differenced.

Ok so let's have a look at an over-differenced series:

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(difference(difference(ts)));
plt.title('2nd Order Differencing ACF')
plt.show()

As previously described, the first lag goes far into the negative, suggesting that it is over-differenced.

The deseasonalised series is a much better series to work on:

In [None]:
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(ds_ts);
plt.title('Deseasonalised ACF')
plt.show()

**Autocorrelation:** autocorrelation summarises the strength of a relationship with an observation in a time series with observations at previous steps.

In simpler terms: correlation is the strength of a relationship between 2 variables (-1 -> 1), because the correlation of the time series observations are calculated with values of the same series at prior time steps, this is called a serial correlation or *autocorrelation*.

**How to read the above graph:** The ACF plot shows the lag value along the x-axis & the correlation on the y-axis (betweeen -1 and 1). By default the plot_acf function has a 95% confidence interval cone in light blue, suggesting that values outside of this cone are likely a correlation and not a statistical fluke.

# SARIMA Modeling

Now that the time series is differenced, we can move on to building our models.

Seasonal AutoRegressive Integrated Moving Average modeling is an old statistical model that combines a moving average (MA), an auto regressive (AR) model and a seasonal component.
* MA: Assumes that the next value in the series is a function of the average of the previous n values.
* AR: Assumes that the next value in the series is a function of the errors (difference in the mean) in the previous n values.

Pros:
* Very effective; remains close to cutting edge performance
* Simple to implement and not computationally intensive

Cons:
* Not very intuitive
* No way to build in our understanding about how our data works:
    * random walk element
    * external regressors

## How does the SARIMA model work?
There are 3 important terms in ARIMA models: p, d & q
* **p** is the order of the AR term
* **q** is the order of the MA term
* **d** is the number of times differencing is required to make the time series stationary
* **s** the seasonal component is comprised of:
    * P - The seasonal autoregressive order
    * D - The seasonal difference order
    * Q - The Seasonal moving average order
    * m - The number of time steps in a single seasonal period

**What do these terms mean?**
The AR part in ARIMA is a linear regression model that uses its own lags (previous time steps) as predictors. For a linear regression model to be effective you need the predictors to be independent of each other (not correlated), i.e. the time series needs to be stationary.

A common and effective way to make a time series stationary is to difference it (subtract the previous value from the current value). Depending on how complex the series is you may need more than one differencing. **d** is the minimum number of differences needed for the data to be stationary, so if it is stationary by default; d = 0.

**p** is the order of the AR term and refers to the number of lags (time steps) of Y (the dependent (the variable you're trying to forecast)) to be used as predictors.

**q** is the order of the MA terms and refers to the number of lagged forecast errors that should go into the ARIMA model.

An ARIMA model is a model that is differenced at least once and combines the MA and AR terms.

predicted Yt = Constant + linear combination of lags of Y (up to p lags) + linear combination of lagged forecast errors (up to q lags)

[source: https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/ ]

### Estimating the differencing term (d)

It is possible to use packages to estimate the number of differences required. We can use the function "ndiffs()" to perform a test of stationarity for different levels of d (and different tests) and estimate the number of differences required to make the time series stationary. As seen by the results below it doesn't always work, we know from the above tests that d is neither 1,2 or 0.

In [None]:
'''
from pmdarima.arima.utils import ndiffs, nsdiffs

# Normal Differencing:

# ADF test
d_adf = ndiffs(ts, test='adf') # = 1

# KPSS test
d_kpss = ndiffs(ts, test='kpss') # = 2

# PP test
d_pp = ndiffs(ts, test='pp') # = 0

print('Difference Estimations:\nADF:%s KPSS:%s PP:%s' % (d_adf,d_kpss,d_pp))
'''

## Finding the AR term (p)
We find p by analysing the Partial AutoCorrelation Function (PACF) plot.

**PACF explanation:** Autocorrelation for an observation & another observation at a prior time step is comprised of both the direct correlation & indirect correlations. The indirect correlations are a linear function of the correlation of the observation with observations at intervening time steps.

It is these indirect correlations that the PACF seeks to remove. The correlation between point Y0 and Y1 will have seome inertia and affect points later on.

In short, the PACF kind of conveys the pure correlation between an observation and the series. That way you will know if the obsevation is needed in the AR term or not.

**How do we find p?:** Any autocorrelation in a stationary time series can be fixed by adding enough AR terms. So we initially take the order of the AR term to be equal to the number of lags that cross the significance limit in the PACF plot.

Time series analysis is a bit of an art, there isn't a set methodlogy that you have to follow, many people analyse the ACF and PACF plots to find certain patterns that may give away the right order, but it is also possible to systematically find the correct order, although it is rather computationally intensive. If you'd like to read more about that, there's a fantastic article on the topic [here](https://www.machinelearningplus.com/time-series/arima-model-time-series-forecasting-python/)

In [None]:
'''
Looping over possible values of p and q and measuring their AIC.

AIC can be thought of like mean squared error, it measures on average
how far off the prediction is from the actual result.
'''
import statsmodels.api as sm
import warnings

rng = range(5)
best_aic = np.inf
best_model = None
best_order = None

warnings.filterwarnings('ignore')

for p in rng:
    for q in rng:
        temp_model = sm.tsa.statespace.SARIMAX(ds_ts, order = (p, 0, q)) # Try out different vals of p & q
        results = temp_model.fit()
        temp_aic = results.aic
        if temp_aic < best_aic: # If model outperforms prev attempts, save the order
            best_aic = temp_aic
            best_order = (p, 0, q)
            best_model = temp_model

print('Best AIC: %s | Best order: %s' % (best_aic, best_order))

warnings.warn('Reinstating warnings')

In [None]:
"""
So in the above code cell we determined that p & q were best set at 1.
Earlier on with the ADF test we found that we needed to perform a seasonal difference with the interval set to 12.

We supplied the SARIMAX function with 3 parameters here; order, trend and seasonal order.
* The order parameter is just a copy of the results above.
* I chose the trend through trial and error, setting it to 't' gave me the best results.
* The seasonal order is (P,D,Q,m) where m is the number of time steps, 12 in our case. We set d to 1 because we only need 1
  seasonal difference and p & q are already used in the order parameter. We could supply seasonal P & Q but it's important
  not to make the model too complex and cause overfitting.
"""
sarima_model = sm.tsa.statespace.SARIMAX(ts, order = (1,0,1),trend = 't', seasonal_order=(0,1,0,12))
results = sarima_model.fit()
print(results.aic)

The best practice is to split the data into a training and testing set prior to fitting the model to validate it's accuracy, however I do want to keep this brief.

## Forecasting Sales for 1C

In [None]:
'''
We'll predict from the 22nd month, 2 years into the future.
'''
from statsmodels.tsa.statespace.sarimax import SARIMAXResults


preds = SARIMAXResults.predict(results, start = 33, end = 46)


ax = ts.plot(label = 'Observed')
preds.plot(ax = ax, label = 'SARIMA forecast')
plt.legend()
plt.title('1C Sales')
ax.set_xlabel('Month')
ax.set_ylabel('Units Sold')
plt.show()

# Prophet Forecasting

In February 2017 Facebook's Data Science team open sourced their forecasting library "Prophet". It's a highly optimised package to quickly perform forecasting on non-stationary data.

In [None]:
'''
Before forecasting we need to add the dates back into the time-series
'''
ts.index = pd.date_range(start = '2013-01-01', 
                         end = '2015-10-01', 
                         freq = 'MS')
ts = ts.reset_index()
ts.head()

In [None]:
from fbprophet import Prophet # Import the package

# Prophet requires you to name your columns the following:
ts.columns = ['ds','y']
prophet_model = Prophet(yearly_seasonality = True) # As determined in stationarity testing
prophet_model.fit(ts)

# We'll predict 12 months into the future
# 'MS' = month start
future = prophet_model.make_future_dataframe(periods = 12, freq = 'MS')
forecast = prophet_model.predict(future)
forecast.head()

In [None]:
prophet_model.plot(forecast);
plt.title('1C Sales - Prophet Forecast')
plt.xlabel('Date')
plt.ylabel('Units Sold')
plt.show()

In [None]:
prophet_model.plot_components(forecast)

In [None]:
ts = sales_train.groupby(['date_block_num'])['item_cnt_day'].sum()
ax = ts.plot(label = 'Observed')
preds.plot(ax = ax, label = 'SARIMA forecast', alpha = 0.9, linestyle = '-')
forecast.yhat[33:46].plot(ax = ax, label = 'Prophet forecast', alpha = 0.9, linestyle = '--')

plt.legend()
plt.title('1C Sales')
ax.set_xlabel('Month')
ax.set_ylabel('Units Sold')
plt.show()

# Finishing up

It seems as though SARIMA does a better job of generalising and appears to be the simpler model, although Prophet is much easier to implement.

Let's use this model to predict revenue rather than units sold as it's far more useful information.

In [None]:
"""
Sorry for the lack of annotations in this cell, almost all of it is copied and pasted from previous cells, aside from some 
pretty formatting.
"""
import matplotlib

fig=plt.figure(figsize=(12,8), dpi= 60, facecolor='w', edgecolor='k')

sales_train['Rev'] = sales_train['item_cnt_day'] * sales_train['item_price']
ts = sales_train.groupby(['date_block_num'])['Rev'].sum()
ax = ts.plot(label = 'Observed')

sarima_rev_model = sm.tsa.statespace.SARIMAX(ts, order = (1,0,1),trend = 't', seasonal_order=(0,1,0,12))
rev_results = sarima_rev_model.fit()
sarima_rev_preds = SARIMAXResults.predict(rev_results, start = 33, end = 46)


ts.index = pd.date_range(start = '2013-01-01', 
                         end = '2015-10-01', 
                         freq = 'MS')
ts = ts.reset_index()
ts.head()
ts.columns = ['ds','y']
prophet_model = Prophet(yearly_seasonality = True) # As determined in stationarity testing
prophet_model.fit(ts)
future = prophet_model.make_future_dataframe(periods = 12, freq = 'MS')
forecast = prophet_model.predict(future)

forecast.yhat[33:46].plot(ax = ax, label = 'Prophet forecast', alpha = 0.9, linestyle = '--')
sarima_rev_preds.plot(ax = ax, label = 'SARIMA forecast', alpha = 0.9, linestyle = '-')


plt.ticklabel_format(style = 'plain')
ax.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))

plt.legend()
plt.title('1C Sales')
ax.set_xlabel('Month')
ax.set_ylabel('Revenue ($)')
plt.show()

In [None]:
# Verifying the length of the dataset
print(len(sales_train))


# Roughly 3 million records
