# Forecasting

- Statistical and machine learning models on time-series data
- Credits: DSCI 574 Spatial & Temporal Models, Quan Nguyen, Feb 2022
- Detailed local version: [Note 1](http://localhost:8888/doc/tree/mds/Block5/simon-block5/574-Note-1.ipynb) and [Note 2](http://localhost:8888/doc/tree/mds/Block5/simon-block5/574-Note-2.ipynb)

## Intro to Time-Series

### What is time series?

- A time series is a collection of observations recorded sequentially in time
    - For the statistically inclined, we define a time series as a collection of random variables indexed by time
        - For example, consider the sequence of random variables, $y_1, y_2, y_3$ etc., where the random variable $y_i$ denotes the value of the times series at the $i$th time point.
    - In general, a collection of random variables, $y_t$ indexed by $t$ is referred to as a *stochastic process*.
- Observations in a series may be evenly spaced (**regular time series**) or unevenly space (**irregular time series**)
    - We will be focusing on regular time series in this course
    - If you encounter an irregular time series, typically you could aggregate it to a regular interval, and/or to impute missing values
- Generally there are two main things we want to do with a time series:
    1. Explanatory modelling to **understand the past**
    2. Predictive modelling to **forecast the future**
    - (In this section we will be focusing on the explanatory modelling to understand the past)

### Time Series Features

Visualization and temporal dependency:

- The key difference between other data and time series data: temporal dependency!
    - We can quantify this dependency by looking at the correlation of a time series with "lagged" values of itself. 
        - We call this **autocorrelation**
    - We can easily "lag" a time series in Pandas with the `.shift()` method: 
        - `df["time (lag=1)"] = df["time"].shift(1)`
        - Correlation between the two: `df["time"].corr(df["time (lag=1)"])`
- A **correlogram** plots the autocorrelation function (ACF) on the y-axis and lags on the x-axis 
    - (we call it the autocoreelation *function* because it is a function of lag)
    - useful package and functions: `from statsmodels.graphics.tsaplots import plot_acf`
- We'll explore this notion of trends more later, but for now, some key observations about the correlograms:
    - The ACF will almost always decay with the lag (observations farther apart in time are less correlated)
    - If a series alternates (i.e., consecutive values tend to be on the opposite sides of the mean, like our sunspots data), then the ACF alternates too.
    - If a series has seasonal or cyclical fluctuations, the ACF will oscillate at the same frequency.
    - If the series has a trend, the ACF will have a very slow decay due to high correlation of the consecutive values (which tend to lie on the same side of the mean)
    - In general, experience is required to glean much from an ACF plot. We will use the correlogram as a model selection tool later in the course.
    
Time Series Pattern:
- There are 3 main patterns of a time series you should be aware of:
    1. **Trend**: long term increases or decreases in the series.
    2. **Seasonality**: regular variation in the series at some fixed interval, e.g., month, day of week, time of day, etc.
    3. **Cyclicity**: variations in the series that repeat with some regularity but of unknown and changing period.

White noise:
- Time series that show no autocorrelation with zero mean and constant variance are called **white noise**
    - often we assume that the white noise is iid and Gaussian distributed denoted as $w_t\sim\mathcal{N}(0,\sigma^2)$
- Think of white noise as completely uninteresting with no predictable patterns
    - if our series is white noise, it means it is a series of random numbers and cannot be predicted
    - white noise can also help us to check whether there is information/dependency in our time series that we can model
        - (as we will see in the next section)
- As a result, we expect the ACF of a white noise series to be close to 0 (no correlation) for all lags

### Time Series Decomposition

- When we decompose a time series, we usually split it into 3 components:
    1. **Trend-cycle** ($T$)
        - Comparing to seasonal component, trend component is a long term change in the mean of the series, whereas the seasonality is a regular, repeating variation that repeats at known periods
        - **Curve-fitting and moving average**
    2. **Seasonal** ($S$)
        - The difference between "seasonality" and "cyclicity" in a time series is that
            - Seasonality is a regular, repeating variation that repeats at known periods
            - Cyclicity is a repeating variation of varying period and magnitude
        - Subtracting/Dividing the trend-cycle effect(i.e. detrending) first then take average
    3. **Remainder** ($R$) (also called the "residual")
        - Calculate by following the formulae below (use observation to subtract or divide $S$ and $T$)
- There are two main ways we can combine/decompose these components to make up a time series:
    1. **Additive**: $y_t = S_t + T_t + R_t$. 
        - Appropriate if the magnitude of the seasonal fluctuations, or the variation around the trend-cycle, does not vary with the value of the series.
    2. **Multiplicative**: $y_t = S_t \times T_t \times R_t$. 
        - Appropriate if variation in the seasonal pattern, or the variation around the trend-cycle, appears to be proportional (i.e. variation increases as the time goes by) to the value of series.
    - Usually we would use additive method since multiplicative one has a relatively strong assumption on the proportional relationship between time and the variation in the seasonal pattern.
- `statsmodels.tsa` has many great tools that handle decomposition for us
- Probably the most common is an STL decomposition which uses Loess (kind of like kNN and OLS regression had a baby)

## Intro to Forecasting

- Unlike predicting, forecasting means that we are predicting values in the future.

### Baseline methods

Before we get into each method, some notations:
- $y_t$: value of a time series at time $t$
- $h$: a forecast horizon
    - i.e. $h=1$ means we want to predict one time step ahead
- $T$: length of a time series
- $\hat{y}$: forecast value
- $y$: observed value
- $\hat{y}_{t|t-1}$: the value of $\hat{y}_t$ given $y_{t-1}$

Now the baseline methods:
- **Average**
    - Use the average of the series for all future forecasts
    - i.e. mathematically $\hat{y}_{T+h}=\bar{y}$ for all $h$
- **Naive**
    - Use the last observation for all forecasts
    - i.e. $\hat{y}_{T+h|T}=y_T$ for all $h$
- **Seasonally-adjusted naive**
    - Similar to naive method, but the data are seasonally adjusted by applying a classical decomposition
    - i.e. $\hat{y}_{T+h|T}=y'_T$ for all $h$ where $y'$ stands for the transformed seasonally adjusted data
        - e.g. if we are using multiplicative decomposition, then $y' = \frac{y}{\text{model.seasonal}}$
- **Seasonal naive**
    - Set each forecast as the last observed value from the same season of the year (e.g. the same month of the previous year)
    - More specifically for our monthly data, the forecasts for all future January is the last observed January value
- **Drift**
    - Forecasts equal to last value in the series plus the average (global, not step-wise) change of the series
    - i.e. $\hat{y}_{T+h|T}=y_T+h\left(\frac{y_T-y_1}{T-1}\right)$

### Exponential models

- In Simple Exponential Smoothing(**SES**), our forecast is an exponentially weighted average of past values:
$$\hat{y}_{t+1} = \alpha{}y_t + \alpha{}(1 - \alpha{})y_{t-1} + \alpha{}(1 - \alpha{})^2y_{t-2} + \cdots$$
    - where $0\le\alpha{}\le1$ and $\hat{y}$ refers to a forecasted value
    - We can re-write that in the recursive form:
$$\hat{y}_{t+1|t} = \alpha{}y_t + (1-\alpha{})\hat{y}_{t|t-1}$$
    - (TODO)
- In **Holt's method**, we extend smoothing based on the SES:
$$\hat{y}_{t+h|t}=\ell_t+hb_t$$

$$\ell_t=\alpha{}y_t+(1-\alpha{})(\ell_{t-1}+b_{t-1})$$

$$b_t=\beta(\ell_t-\ell_{t-1})+(1-\beta)b_{t-1}$$
    - We now have two key parameters $\alpha$ (to control smoothness of the level) and $\beta$ (control smoothness of trend). Read more about it [here](https://otexts.com/fpp3/holt.html) if you wish.
    - All we're doing here is forecasting the next values as an exponentially weighted average of past values and past trend
- **Holt-Winter's method** extends even further:
$$\hat{y}_{t+h|t}=\ell_t+hb_t+s_{t+h-m(k+1)}$$

$$\ell_t=\alpha{}(y_t-s_{t-m})+(1-\alpha{})(\ell_{t-1}+b_{t-1})$$

$$b_t=\beta(\ell_t-\ell_{t-1})+(1-\beta)b_{t-1}$$

$$\text{additive: }s_t=\gamma(y_t-\ell_{t-1}-b_{t-1})+(1-\gamma)s_{t-m}$$ 
$$\text{multiplicative: }s_t=\gamma\frac{y_t}{\ell_{t-1}+b_{t-1}}+(1-\gamma)s_{t-m}$$

### ETS models
- The methods above providing point forecasts, but what if we want to have prediction intervals?
- The generalization of exponential smoothing algorithms to statistical models that model distributions, include an error term, and can generate prediction intervals are known as **ETS** models (**E**rror, **T**rend, **S**easonal), where:
    - E = {additive, multiplicative}
    - T = {none, additive, additive damped}
    - S = {none, additive, multiplicative}
- You can read more about ETS models and their derivation [here](https://otexts.com/fpp3/ets.html) and in [Appendix B](https://pages.github.ubc.ca/MDS-2021-22/DSCI_574_spat-temp-mod_students/lectures/appendixB_state-space-models.html) but their derivation is not really important to know and beyond the scope of this course.
    - However do note that an ETS model will usually give the same/similar results to the algorithms above, but with the added bonus of being able to generate prediction intervals
    - In Rob Hyndman's [own words](https://robjhyndman.com/hyndsight/estimation2/) *"the results from ETS are usually more reliable (than the algorithmic exponential models)"*
- Rather than optimizing based on minimizing the SSE, ETS models optimize by maximizing the likelihood (we assume errors are normally distributed).
    - For more details on that also see [Appendix B](https://pages.github.ubc.ca/MDS-2021-22/DSCI_574_spat-temp-mod_students/lectures/appendixB_state-space-models.html)
|Trend Component|Seasonal Component|
|---|---|
|None `(N)`|None `(N)`|
|Additive `(A)`|Additive `(A)`|
|Additive damped `(Ad)`|Multiplicative `(M)`|

|Notation|Method|
|---|---|
|`(N,N)`|Simple exponential smoothing|
|`(A,N)`|Holt's method|
|`(A,A)`|Additive Holt-Winter's method|

### Selecting a model

#### In-sample methods
- Metrics
    - The most common metrics are:
        - Akaike information criterion(AIC)
        - Bayesian information criterion(BIC)
        - ![](../images/aic_bic.png)
        - Sum of Squared Errors (SES)/Mean Squared Errors (MSE)/Root Mean Squared Errors(RMSE)
    - We can extract them from most models in `statsmodels` with `model.summary()` or extract with `model.aic/model.bic/model.mse` 
- Residuals
    - We can use residuals to reflect how well our model captures information in the data by
        1. Visual inspection (residuals are uncorrelated, have zero mean, and ideally normally distributed)
            - We can use `plot_acf` on `model.resid` and see if residuals are significantly different from white noise
            - If it has structure and different from noise, then it is not good
        2. Running diagnostic Portmanteau tests (e.g. Ljung-Box-Perce test, etc.)
            - The *[Ljung–Box test](https://en.wikipedia.org/wiki/Ljung%E2%80%93Box_test)* tests whether a group of autocorrelations is significantly different from white noise
            - The *[Jarque-Bera test](https://en.wikipedia.org/wiki/Jarque%E2%80%93Bera_test)* tests whether residuals are significantly different from a normal distribution (based on skewness and kurtosis)
            - `statsmodels` provides the test statistics for us with `model.test_serial_correlation('ljungbox', lags=24)` and `model.test_normality('jarquebera')`
    - However these tests tend to be not that useful in practice... but could consider them since they are included in the most packages anyway
- See examples with code details on the [course note section](https://pages.github.ubc.ca/MDS-2021-22/DSCI_574_spat-temp-mod_students/lectures/lecture2_intro-to-forecasting.html#in-sample-methods)

#### Out-of-sample methods
- Usually we are interested in using models to forecast, hence we care about its performance on unseen data
- Typically method would be having a training set and a validation set
- To measure the performance of model forecasts, the most common regression metrics are:
    1. Mean Absolute Error (MAE): $\frac{1}{n}\sum^{n}_{i=1}|y_i - \hat{y}_i|$
    2. Root Mean Squared Error (RMSE): $\sqrt{\frac{1}{n}\sum^{n}_{i=1}(y_i - \hat{y}_i)^2}$
    3. Mean Absolute Percentage Error (MAPE): $\frac{1}{n}\sum^{n}_{i=1}|\frac{y_i - \hat{y}_i}{y_i}|$
    4. Mean Absolute Scaled Error (MASE): $\frac{MAE}{\frac{1}{T-1}\sum^{T}_{t=2}|y_t-y_{t-1}|}$
- Some notes on these metrics:
    - MAE and RMSE are popular in practice because they are easier to interpret. Closer to 0 is better.
    - MAPE is scale-free and aims to proportionalize errors, such that the error for $\hat{y}=12$ and $y=10$ (MAPE = 20%, MSE = 4) is the same as $\hat{y}=120$ and $y=100$ (MAPE = 20%, MSE = 400).  Closer to 0 is better. 
        - MAPE is problematic if 0 values are expected (divide by 0 error) and it is also not symmetrical, i.e., $\hat{y}=150$ and $y=100$ gives $MAPE=\frac{|100-150|}{100}=33.33\%$, but $\hat{y}=100$ and $y=150$ gives $MAPE=\frac{|150-100|}{150}=50\%$. 
        - There is a version, `sMAPE` available that is symmetrical, but MASE is often preferred.
    - MASE scales the MAE based on the MAE of a naive forecast on the training data. I think of this as the "r-squared" of the forecasting world. It corrects the above errors. Value < 1 indicate forecasts are better than in-sample naive forecasts.

## ARIMA Models

(TODO)

### Stationary
- A stationary time series is one whose properties do not depend on time
    - Is roughly horizontal
    - Has a constant mean & variance
    - Does not show predictable patterns (e.g., seasonality)
    - Note that a time series can be non-stationary even if it has no trend
        - The expected value of a series may depend on time as a result of seasonality and changing variance
    - On the other than, a time series can be stationary even if it has non-zero autocorrelation for some lags higher than 0
        - A series can be stationary, yet autocorrelation can still arise when observations are influenced by previous observations; consider a stationary AR(1) process as an example

AR: AutoRegressive model

MA: Moving Average

ARMA: Autoregression + moving average

ARIMA: Autoregressive Integrated Moving Average

Seasonal ARIMA (SARIMA) and adding explanatory to the ARMA (SARIMAX)

Choosing order:
- Picking `p` for the AR component
    - Recall that a corellogram (ACF plot) shows autocorrelations at different lags
        - however ACF only tells us how correlated it is between $y_t$ and $y_{t-h}$
            - The problem is that if $y_t$ and $y_{t-1}$ are correlated, then $y_{t-1}$ and $y_{t-2}$ are also correlated, and therefore, $y_t$ and $y_{t-2}$ have indirect correlation (via $y_{t-1}$).
            - This makes it hard to isolate exactly which lags are important in our series. We can clearly see this in the model above which should only show one significant lag.
    - The solution to the problem with ACF is **partial autocorrelation** (PACF) which removes intermediate effects between $y_t$ and $y_{t-h}$. 
        - The partial autocorrelation is estimated as the last coefficient in an autoregressive model. So $PACF(k)$ is the $k$th estimated coefficient in an `AR(k)` model
    - By look at the correllogram with PACF, we can pick the `p` (the order of AR model) where `p` equals to the number of lags it takes to decrease down to the confidence interval range
        - we want to see how many lags are heavily correlated
- Picking `q` for the MA component
    - Recall that `MA(q)` models are based on white noise. 
    - As we saw before, the value of an `MA(q)` model at time $t$ is a weighted sum of the last `q` values of the white noise process. 
    - Since there is no dependence structure on values at lags higher than `q`, it's just white noise, so we would expect the ACF to "cut off" at lag `q`, and pick that `q`
- See [coded example](https://pages.github.ubc.ca/MDS-2021-22/DSCI_574_spat-temp-mod_students/lectures/lecture3_arima-models.html#choosing-orders) in the course notes.
    - Notice that in the sliding bar example with phi (the correlation coefficient $\phi$), when we have high phi, we tend to have larger correlation and we can see graduate decrease in the ACF and quick reduce in PACF with some bouncing around the zero line

|Model|ACF|PACF|
|---|---|---|
|MA(q)|Cuts-off at lag `q`|Tails off, no pattern|
|AR(p)|Tails off (exponentially or like a "damped" sine wave)|Cuts-off at lag p|

- Auto-arima
- Box-Jenkins
![](../images/box_jenkins.png)

## Forecasting with ML

### 

## Other Forecasting Techniques

## Advanced Forecasting Modelling