# 1. Introduction

When working with time series, the question most often arises - what will happen to us with our indicators in the next day / week / month / etc. - how much will be online, how many actions users will perform, and so on. The forecasting problem can be approached in different ways, depending on what quality should be pronase, for what period we want to build it, and, of course, how long it takes to select and adjust model parameters to obtain it.

> A time series is a sequence of values ​​describing a process proceeding in time, measured at consecutive moments of time, usually at regular intervals

Thus, the data are ordered relative to nonrandom points in time, and, therefore, unlike random samples, they may contain additional information that we will try to extract.

In general, tasks related to time series can be divided into several groups
- Forecasting - when we want to know what will happen next
- Search for anomalies - when we want to understand where there were problems in the past
- Clustering and classification - when time series themselves are signs of objects

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

from dateutil.relativedelta import relativedelta
from scipy.optimize import minimize

import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs

from itertools import product
from tqdm import tqdm_notebook

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## Forecast Quality Metrics

Let us consider the main and most widespread metrics for the quality of forecasts, which, by and large, are metrics for the regression problem and are used not only in time series.

- [R squared](http://scikit-learn.org/stable/modules/model_evaluation.html#r2-score-the-coefficient-of-determination), $(-\infty, 1]$

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ 

```python
sklearn.metrics.r2_score
```
---
- [Mean Absolute Error](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-absolute-error), $[0, +\infty)$

$MAE = \frac{\sum\limits_{i=1}^{n} |y_i - \hat{y}_i|}{n}$ 

```python
sklearn.metrics.mean_absolute_error
```
---
- [Median Absolute Error](http://scikit-learn.org/stable/modules/model_evaluation.html#median-absolute-error), $[0, +\infty)$

$MedAE = median(|y_1 - \hat{y}_1|, ... , |y_n - \hat{y}_n|)$

```python
sklearn.metrics.median_absolute_error
```
---
- [Mean Squared Error](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-error),  $[0, +\infty)$

$MSE = \frac{1}{n}\sum\limits_{i=1}^{n} (y_i - \hat{y}_i)^2$

```python
sklearn.metrics.mean_squared_error
```
---
- [Mean Squared Logarithmic Error](http://scikit-learn.org/stable/modules/model_evaluation.html#mean-squared-logarithmic-error),  $[0, +\infty)$

$MSLE = \frac{1}{n}\sum\limits_{i=1}^{n} (log(1+y_i) - log(1+\hat{y}_i))^2$

```python
sklearn.metrics.mean_squared_log_error
```
---
-  $[0, +\infty)$

$MAPE = \frac{100}{n}\sum\limits_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{y_i}$ 

```python
def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100
```

In [None]:
from sklearn.metrics import r2_score, median_absolute_error, mean_absolute_error
from sklearn.metrics import median_absolute_error, mean_squared_error, mean_squared_log_error

def mean_absolute_percentage_error(y_true, y_pred): 
    return np.mean(np.abs((y_true - y_pred) / y_true)) * 100

Well, now we know how to measure the quality of our forecast, what metrics should be used and how to explain what the customer has done, it’s all left to do - we need to build a forecast

In [None]:
full_df = pd.read_csv('/kaggle/input/time-series-starter-dataset/Month_Value_1.csv', sep=',')

In [None]:
full_df.head()

In [None]:
df = full_df[['Revenue']].dropna()

# 2. We move, smooth and evaluate

We begin the simulation with the naive assumption - “tomorrow will be like yesterday”, but instead of a model of the form $\hat{y}_{t} = y_{t-1}$ we assume that the future value of the variable depends on the average of $n$ previous values, which means we will use the moving average.

$$\hat{y}_{t} = \frac{1}{k} \displaystyle\sum^{k-1}_{n=0} y_{t-n}$$

In [None]:
plt.figure(figsize=(18, 6))
plt.plot(df.Revenue)
plt.title('Revenue (month data)')
plt.grid(True)
plt.show()

In [None]:
def moving_average(series, n):
    """
        Calculate average of last n observations
    """
    return np.average(series[-n:])

moving_average(df, 12) # prediction for the last observed day (past 24 hours)

Unfortunately, such a forecast cannot be made long-term; in order to obtain a one-step prediction, the previous value must be an actually observed value. But the moving average has another application - smoothing the initial series for revealing trends, there is a ready-made implementation in the ramp - [`DataFrame.rolling (window) .mean ()`] (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.rolling.html). The more we set the interval width, the smoother the trend will be. In case the data is very noisy, which is especially common, for example, in financial indicators, such a procedure can help to see common patterns.

In [None]:
def plotMovingAverage(series, window, plot_intervals=False, scale=1.96, plot_anomalies=False):

    """
        series - dataframe with timeseries
        window - rolling window size 
        plot_intervals - show confidence intervals
        plot_anomalies - show anomalies 

    """
    rolling_mean = series.rolling(window=window).mean()

    plt.figure(figsize=(15,5))
    plt.title("Moving average\n window size = {}".format(window))
    plt.plot(rolling_mean, "g", label="Rolling mean trend")

    # Plot confidence intervals for smoothed values
    if plot_intervals:
        mae = mean_absolute_error(series[window:], rolling_mean[window:])
        deviation = np.std(series[window:] - rolling_mean[window:])
        lower_bond = rolling_mean - (mae + scale * deviation)
        upper_bond = rolling_mean + (mae + scale * deviation)
        plt.plot(upper_bond, "r--", label="Upper Bond / Lower Bond")
        plt.plot(lower_bond, "r--")
        
        # Having the intervals, find abnormal values
        if plot_anomalies:
            anomalies = pd.DataFrame(index=series.index, columns=series.columns)
            anomalies[series<lower_bond] = series[series<lower_bond]
            anomalies[series>upper_bond] = series[series>upper_bond]
            plt.plot(anomalies, "ro", markersize=10)
        
    plt.plot(series[window:], label="Actual values")
    plt.legend(loc="upper left")
    plt.grid(True)

In [None]:
plotMovingAverage(df, 4) 

In [None]:
plotMovingAverage(df, 12) 

You can also display confidence intervals for our averages.

In [None]:
plotMovingAverage(df, 4, plot_intervals=True)

In [None]:
plotMovingAverage(df, 12, plot_intervals=True)

A modification of a simple moving average is a weighted average, inside which different weights are assigned to the observations, which can be equal to one, and usually the last observations are assigned more weight.


$$\hat{y}_{t} = \displaystyle\sum^{k}_{n=1} \omega_n y_{t+1-n}$$

In [None]:
def weighted_average(series, weights):
    """
        Calculate weighter average on series
    """
    result = 0.0
    weights.reverse()
    for n in range(len(weights)):
        result += series.iloc[-n-1] * weights[n]
    return float(result)

In [None]:
weighted_average(df, [0.6, 0.3, 0.1])

## Exponential smoothing

Now let's see what happens if, instead of weighing the last $n$ values of the series, we start to weigh all available observations, while exponentially reducing weights as we go deeper into historical data. The simple [exponential smoothing] formula will help us with this.(http://www.machinelearning.ru/wiki/index.php?title=Экспоненциальное_сглаживание):

$$\hat{y}_{t} = \alpha \cdot y_t + (1-\alpha) \cdot \hat y_{t-1} $$


Here, the model value is the weighted average between the current true and previous model values. The weight of $\alpha$ is called the smoothing factor. It determines how quickly we “forget” the last available true observation. The smaller $\alpha$, the more influence the previous model values have, and the more smooth the series is.

The exponentiality lies in the recursiveness of the function - each time we multiply $(1-\alpha)$ by the previous model value, which, in turn, also contained $(1-\alpha)$, and so on to the very beginning.

In [None]:
def exponential_smoothing(series, alpha):
    """
        series - dataset with timestamps
        alpha - float [0.0, 1.0], smoothing parameter
    """
    result = [series[0]] # first value is same as series
    for n in range(1, len(series)):
        result.append(alpha * series[n] + (1 - alpha) * result[n-1])
    return result

In [None]:
def plotExponentialSmoothing(series, alphas):
    """
        Plots exponential smoothing with different alphas
        
        series - dataset with timestamps
        alphas - list of floats, smoothing parameters
        
    """
    with plt.style.context('seaborn-white'):    
        plt.figure(figsize=(15, 7))
        for alpha in alphas:
            plt.plot(exponential_smoothing(series, alpha), label="Alpha {}".format(alpha))
        plt.plot(series.values, "c", label = "Actual")
        plt.legend(loc="best")
        plt.axis('tight')
        plt.title("Exponential Smoothing")
        plt.grid(True);

In [None]:
plotExponentialSmoothing(df.Revenue, [0.3, 0.05])

## Double exponential smoothing

Until now, we could only get one point forward from our methods at best (and smooth the series nicely), which is great, but not enough, so let's move on to expanding exponential smoothing, which will allow us to build the forecast two points ahead (and nice to smooth a row too).

This will help us to split the series into two components - the level (level, intercept) $\ell$ and the trend $b$ (trend, slope). We predicted the level or expected value of the series using the previous methods, and now the same exponential smoothing is applicable to the trend, naively or not really believing that the future direction of the series changes depends on weighted previous changes.

$$\ell_x = \alpha y_x + (1-\alpha)(\ell_{x-1} + b_{x-1})$$

$$b_x = \beta(\ell_x - \ell_{x-1}) + (1-\beta)b_{x-1}$$

$$\hat{y}_{x+1} = \ell_x + b_x$$

As a result, we get a set of functions. The first describes the level - it, as before, depends on the current value of the series, and the second term is now divided into the previous value of the level and trend. The second is responsible for the trend - it depends on the level change at the current step, and on the previous trend value. Here, the coefficient $\beta$ acts as the weight in exponential smoothing. Finally, the final prediction is the sum of the model values of the level and trend.

In [None]:
def double_exponential_smoothing(series, alpha, beta):
    """
        series - dataset with timeseries
        alpha - float [0.0, 1.0], smoothing parameter for level
        beta - float [0.0, 1.0], smoothing parameter for trend
    """
    # first value is same as series
    result = [series[0]]
    for n in range(1, len(series)+1):
        if n == 1:
            level, trend = series[0], series[1] - series[0]
        if n >= len(series): # forecasting
            value = result[-1]
        else:
            value = series[n]
        last_level, level = level, alpha*value + (1-alpha)*(level+trend)
        trend = beta*(level-last_level) + (1-beta)*trend
        result.append(level+trend)
    return result

def plotDoubleExponentialSmoothing(series, alphas, betas):
    """
        Plots double exponential smoothing with different alphas and betas
        
        series - dataset with timestamps
        alphas - list of floats, smoothing parameters for level
        betas - list of floats, smoothing parameters for trend
    """
    
    with plt.style.context('seaborn-white'):    
        plt.figure(figsize=(20, 8))
        for alpha in alphas:
            for beta in betas:
                plt.plot(double_exponential_smoothing(series, alpha, beta), label="Alpha {}, beta {}".format(alpha, beta))
        plt.plot(series.values, label = "Actual")
        plt.legend(loc="best")
        plt.axis('tight')
        plt.title("Double Exponential Smoothing")
        plt.grid(True)

In [None]:
plotDoubleExponentialSmoothing(df.Revenue, alphas=[0.9, 0.02], betas=[0.9, 0.02])

## Cross-validation on time series

Before constructing a model, we will finally talk about a non-manual estimation of parameters for models.

There is nothing unusual here, as before, you must first select the loss function suitable for this task, which will monitor the quality of fitting the model to the source data. Then we will evaluate the cross-validation value of the loss function with these parameters of the model, look for the gradient, change the parameters in accordance with it and vigorously fall towards the global minimum of the error.

Since the time series suddenly has a time structure, it is impossible to randomly mix the values of the entire series in folds without preserving this structure, otherwise in the process all the relationships of observations with each other will be lost. Therefore, you will have to use a slightly trickier way to optimize the parameters, the official name of which I did not find, but on the site [CrossValidated] (https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation -for-time-series-model-selection), offer the name "cross-validation on a rolling basis", which can be literally translated as cross-validation on a sliding window.

The essence is quite simple - we begin to train the model on a small segment of the time series, from the beginning to some $ t $, make a forecast for $t+n$ steps forward and consider the error. Next, we expand the training set to $t+n$ values and predict from $t+n$ to $t+2*n$, so we continue to move the test segment of the series until we run into the last available observation. As a result, we get as many folds as $n$ fits in the gap between the initial training segment and the entire row length.

<img src="https://habrastorage.org/files/f5c/7cd/b39/f5c7cdb39ccd4ba68378ca232d20d864.png"/>

In [None]:
from sklearn.model_selection import TimeSeriesSplit # you have everything done for you

def timeseriesCVscore(params, series, loss_function=mean_squared_error, slen=24):
    """
        Returns error on CV  
        
        params - vector of parameters for optimization
        series - dataset with timeseries
        slen - season length for Holt-Winters model
    """
    # errors array
    errors = []
    
    values = series.values
    alpha, beta, gamma = params
    
    # set the number of folds for cross-validation
    tscv = TimeSeriesSplit(n_splits=3) 
    
    # iterating over folds, train model on each, forecast and calculate error
    for train, test in tscv.split(values):

        model = HoltWinters(series=values[train], slen=slen, 
                            alpha=alpha, beta=beta, gamma=gamma, n_preds=len(test))
        model.triple_exponential_smoothing()
        
        predictions = model.result[-len(test):]
        actual = values[test]
        error = loss_function(predictions, actual)
        errors.append(error)
        
    return np.mean(np.array(errors))

In [None]:
class HoltWinters:
    
    """
    Holt-Winters model with the anomalies detection using Brutlag method
    
    # series - initial time series
    # slen - length of a season
    # alpha, beta, gamma - Holt-Winters model coefficients
    # n_preds - predictions horizon
    # scaling_factor - sets the width of the confidence interval by Brutlag (usually takes values from 2 to 3)
    
    """
    
    
    def __init__(self, series, slen, alpha, beta, gamma, n_preds, scaling_factor=1.96):
        self.series = series
        self.slen = slen
        self.alpha = alpha
        self.beta = beta
        self.gamma = gamma
        self.n_preds = n_preds
        self.scaling_factor = scaling_factor
        
        
    def initial_trend(self):
        sum = 0.0
        for i in range(self.slen):
            sum += float(self.series[i+self.slen] - self.series[i]) / self.slen
        return sum / self.slen  
    
    def initial_seasonal_components(self):
        seasonals = {}
        season_averages = []
        n_seasons = int(len(self.series)/self.slen)
        # let's calculate season averages
        for j in range(n_seasons):
            season_averages.append(sum(self.series[self.slen*j:self.slen*j+self.slen])/float(self.slen))
        # let's calculate initial values
        for i in range(self.slen):
            sum_of_vals_over_avg = 0.0
            for j in range(n_seasons):
                sum_of_vals_over_avg += self.series[self.slen*j+i]-season_averages[j]
            seasonals[i] = sum_of_vals_over_avg/n_seasons
        return seasonals   

          
    def triple_exponential_smoothing(self):
        self.result = []
        self.Smooth = []
        self.Season = []
        self.Trend = []
        self.PredictedDeviation = []
        self.UpperBond = []
        self.LowerBond = []
        
        seasonals = self.initial_seasonal_components()
        
        for i in range(len(self.series)+self.n_preds):
            if i == 0: # components initialization
                smooth = self.series[0]
                trend = self.initial_trend()
                self.result.append(self.series[0])
                self.Smooth.append(smooth)
                self.Trend.append(trend)
                self.Season.append(seasonals[i%self.slen])
                
                self.PredictedDeviation.append(0)
                
                self.UpperBond.append(self.result[0] + 
                                      self.scaling_factor * 
                                      self.PredictedDeviation[0])
                
                self.LowerBond.append(self.result[0] - 
                                      self.scaling_factor * 
                                      self.PredictedDeviation[0])
                continue
                
            if i >= len(self.series): # predicting
                m = i - len(self.series) + 1
                self.result.append((smooth + m*trend) + seasonals[i%self.slen])
                
                # when predicting we increase uncertainty on each step
                self.PredictedDeviation.append(self.PredictedDeviation[-1]*1.01) 
                
            else:
                val = self.series[i]
                last_smooth, smooth = smooth, self.alpha*(val-seasonals[i%self.slen]) + (1-self.alpha)*(smooth+trend)
                trend = self.beta * (smooth-last_smooth) + (1-self.beta)*trend
                seasonals[i%self.slen] = self.gamma*(val-smooth) + (1-self.gamma)*seasonals[i%self.slen]
                self.result.append(smooth+trend+seasonals[i%self.slen])
                
                # Deviation is calculated according to Brutlag algorithm.
                self.PredictedDeviation.append(self.gamma * np.abs(self.series[i] - self.result[i]) 
                                               + (1-self.gamma)*self.PredictedDeviation[-1])
                     
            self.UpperBond.append(self.result[-1] + 
                                  self.scaling_factor * 
                                  self.PredictedDeviation[-1])

            self.LowerBond.append(self.result[-1] - 
                                  self.scaling_factor * 
                                  self.PredictedDeviation[-1])

            self.Smooth.append(smooth)
            self.Trend.append(trend)
            self.Season.append(seasonals[i%self.slen])

### Stationarity

Before moving on to modeling, it is worth mentioning such an important property of the time series as [**stationarity**] (https://ru.wikipedia.org/wiki/Stationarity).
Stationarity is understood as the property of a process not to change its statistical characteristics over time, namely, the constancy of expectation, the constancy of dispersion (it is [homoskedasticity] (https://ru.wikipedia.org/wiki/Gomoskedasticity)) and the covariance function is independent of time ( should depend only on the distance between observations). You can visually look at these properties in the pictures taken from the post [Sean Abu] (http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/):

- The time series on the right is not stationary, since its expectation increases with time

<img src = "https://habrastorage.org/files/20c/9d8/a63/20c9d8a633ec436f91dccd4aedcc6940.png" />

- There is no luck with dispersion - the range of values of the series varies significantly depending on the period

<img src = "https://habrastorage.org/files/b88/eec/a67/b88eeca676d642449cab135273fd5a95.png" />

- Finally, the last graph shows that the values of the series suddenly become closer to each other, forming a certain cluster, and as a result we get inconstancy of covariances

<img src = "https://habrastorage.org/files/2f6/1ee/cb2/2f61eecb20714352840748b826e38680.png" />

Why is stationarity so important? For a stationary series, it is easy to make a forecast, since we believe that its future statistical characteristics will not differ from the observed current ones. Most models of time series in one way or another model and predict these characteristics (for example, expectation or variance), so if the initial series is not stationary, the predictions will turn out to be wrong. Unfortunately, most of the time series that you have to deal with beyond the scope of training materials are not stationary, but you can (and should) deal with this.

In [None]:
def tsplot(y, lags=None, figsize=(12, 7), style='bmh'):
    """
        Plot time series, its ACF and PACF, calculate Dickey–Fuller test
        
        y - timeseries
        lags - how many lags to include in ACF, PACF calculation
    """
    if not isinstance(y, pd.Series):
        y = pd.Series(y)
        
    with plt.style.context(style):    
        fig = plt.figure(figsize=figsize)
        layout = (2, 2)
        ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
        acf_ax = plt.subplot2grid(layout, (1, 0))
        pacf_ax = plt.subplot2grid(layout, (1, 1))
        
        y.plot(ax=ts_ax)
        p_value = sm.tsa.stattools.adfuller(y)[1]
        ts_ax.set_title('Time Series Analysis Plots\n Dickey-Fuller: p={0:.5f}'.format(p_value))
        smt.graphics.plot_acf(y, lags=lags, ax=acf_ax)
        smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax)
        plt.tight_layout()

## Get rid of non-stationarity and build SARIMA

Now let's try to build a SARIMA model, having gone through all the ~~circles of hell~~ stage of bringing the series to a stationary form. Details about the model itself can be found here - [Building a SARIMA model using Python + R] (https://habrahabr.ru/post/210530/), [Time series analysis using python] (https://habrahabr.ru/ post / 207160 /)

In [None]:
tsplot(df.Revenue, lags=12)

In [None]:
df_diff = df.Revenue - df.Revenue.shift(12)
tsplot(df_diff[12:], lags=6)

Well, this series is stationary according to the Dickey-Fuller criterion, according to its graph it is also visible that the trend, as such, is absent, i.e., the expectation is constant, the spread around the average is also about the same, which means that the dispersion is also constant. There remains a seasonality that needs to be dealt with before building the model.
To do this, we perform the transformation under the cunning name of "seasonal differentiation", under which lies a simple subtraction from a series of oneself with a lag equal to the seasonality season.

# To be continued