# [Deep Learning Timeseries](https://medium.com/towards-data-science/the-best-deep-learning-models-for-time-series-forecasting-690767bc63f0)

## 1. [N-BEATS](https://arxiv.org/pdf/1905.10437.pdf) (Neural Basis Expansion Analysis Time Series)

### Contribution
- show that pure DL using no time-series specific conmponents outperforms well-established stat approaches
- Interpretable DL for Time series

### Problem Statement 
Univariate point forcasting problem in discrete time
- H: forecast horizon
- T: observed series history $[y_1, \ldots, y_T]$
- t: lookback window $x \in \mathbb{R}^t = [y_{T-t+1},\ldots,y_T]$
- tasks: to predict the vector of future values $\textbf{y} \in \mathbb{R}^H = [y_{T+1},\ldots, y_{T+H}]$
- $\hat y$: forecast of y

### N-BEATS
![alt_text](https://miro.medium.com/max/700/1*qWe4P1BDLlDw79bjuhGXXQ.png)

#### Basic Block
- backcast: produce an estimate of $x_l$ with the ultimate goal of helping the downsteream blocks by removing components of their input that are helpful for forecasting
- forecast: ultimate gola of optimizing the accuracy of the partial forcast $\hat y_l$ by properly mixing the bassis vector supplied by $g_l^f$

#### Doubly Residual Stacking
$$ x_l = x_{l-1} - \hat x_{l-1}$$
$$ \hat y = \sum_l \hat y_l$$

#### Interpretablity
The interpretable architecture can be constructed by adding strcutrue to basis layers at stack level. practitioners often use the decomp of time series into trend and seasonablity. 

## 2. [DeepAR](https://arxiv.org/pdf/1704.04110.pdf)


### Contribution 

- RNN for probabilistic forcasting, which incorporates a negative binomial likelihood for count data as well as special treatment for the case where the magnitudes of the time series vary widely
- demonstrate this model produces accurate proba forecasts across a range of input characteristics
- Key advantages
    - learns seasonal beh and dependencies on given covariates across time series, minimizing manual feature engineering is needed to capture complex, group-dependent havaior
    - makes probabilistic forecasts in the form of Monte Carlo samples that can be used to compute consistent quantile estimates for all sub-ranges in prediction horizon
    - provides forcasts for items with little or no history by learning from similar items
    - does not assume Gaussian noise, but can incoporate a wide range of likelihood functions

### Model 
- $z_{i,t}$: time series i at time t
- $x_{i,1:T}$: covariates that are assumed to be known for all time points

An autoregressive recurrent network airchitecture Model the conditional distribution 
$$P(z_{i,t_0:T}|z_{i,1:t_0-1},x_{i,1:T})$$
assuming the model distribution $Q_\theta$ consists of a product of likelihood factors 
$$Q_\Theta(z_{i,t_0:T}|z_{i,1:t_0-1},x_{i,1:T}) = \prod_{t=t_0}^T Q_\Theta(z_{i,t}|z_{i,1:t-1},x_{i,1:T}) = \prod_{t=t_0}^T \ell(z_{i,t}|\theta(\mathbf{h_{i,t},\Theta}))$$

parameterized by the output of an autoregressive recurrent network 
>$$\mathbf{h}_{i,t} = h(\mathbf{h}_{i,t}, z_{i,t-1}, x_{i,t},\Theta)$$

![alt_text](https://miro.medium.com/max/700/1*RJV1g4pH5TuFH9VXXpRJUg.png)

#### Likelihood model 
The likelihood $\ell(z|\theta)$ shoudl be chosen to match the stat prop of the data, in this approach, the network directly predicts all parameters $\theta$ (mean and variance) of the prob distribution for the next time point

Gaussian likelihood for real-valued data

negative-hinomial likelihood for positive count data

#### Scaling handling 

## 3. [Spacetimeformer](https://arxiv.org/pdf/2109.12218.pdf)
Key pieces
- Long-range Transformers
    - Performer for long-range sequence
- Spatial-Tempoeral Forecasting
![](.\img\spatio-temporal_attention.png)

![](https://miro.medium.com/max/700/1*zNsA32-eJHeP1Am9ipLB9A.png)


## 4.[Temporal Fusion Transformer](https://arxiv.org/pdf/1912.09363.pdf)
all autoregressive methods (ARIMA) including DeepAR lacks the capability to incorporate covariates only up to present time

https://towardsdatascience.com/temporal-fusion-transformer-googles-model-for-interpretable-time-series-forecasting-5aa17beb621

- rich number of features
    - time-dependent data with known inputs into the future
    - time-dependent data known only up to the present
    - static variables (time-invariant) features


### Contribution 
- **rich features**: supports 3 types of features: 
    - temporal data with known inputs into the future
    - temporal data known only up to the present 
    - exogenous categorical/static varaibles, known as time-invariant features
- **Heterogeneous time series**: supports training on multiple time series, coming from different distr. TFT splits processing into 2 parts: local processing which focuses on the characteristics of specific events and global processing which captures the collective characteristics of all time series
- **multi-horizon forecasting**
- **interpretability**: Transformer-based architecture, by taking adv of self-attention, this model presents a novel multi-head attention mechanism which when analyzed, provides extra insights on feature importances. (counterexample of MQRNN)


### Model 
1. $\textbf{Gating Mechanism}$ to skip over any unused compo- nents of the architecture, providing adaptive depth and network complexity to accommodate a wide range of datasets and scenarios
2. $\textbf{Variable selection network}$ to select relevant input variables at each time step
3. $\textbf{Static covariate encoders}$ to integrate static features into the network, through the encoding of context vectors to condition temporal dynamics
4. $\textbf{Temporal processing}$ to learn both long- and short- term temporal relationships from both observed and known time-varying inputs. A sequence-to-sequence layer is employed for local processing, whereas long- term dependencies are captured using a novel inter- pretable multi-head attention block
5. $\textbf{Prediction intervals}$ via quantile forecasts to deter- mine the range of likely target values at each prediction horizon

![](https://miro.medium.com/max/700/1*7rXe_MVn5QI9oLP2vrMdvQ.png)

**Model inputs**:

- $k$ lookback window: 
- $\tau_{max}$ step ahead window
- Observed past input $x$ in the time period $[t-k,\ldots,t]$
- Future known inputs $x$ in the time period $[t+1,\ldots,t+\tau_{max}]$
- a set of static variables $s$
- Target variable $y$ also spans the time window $[t+1,\ldots,t+\tau_{max}]$


#### Gated Residual Network
![](https://miro.medium.com/max/342/1*9eIgK7rVAwnXyHje2YKjkA.png)
- two types of activation functions called ELU (Exponential Linear Unit) and GLU (Gated Linear Units). GLU was first used in the [Gated Convolutional Networks](https://arxiv.org/pdf/1612.08083v3.pdf) architecture for selecting the most important features for predicting the next word. In fact, both of these activation functions help the network understand which input transformations are simple and which require more complex modeling
- The final output passes through standard Layer Normalization. The GRN also contain a residual connection, meaning that the network could learn, if necessary, to skip the input entirely. In some cases, depending where the GRN is situated, the network can also make use of static variables

#### Variable Selection Network (VSN)

VSN utilized GRN under the hood for its filtering capability
- At time t the flattened vector of all past inputs (called $\Xi_t$) of the corresponding lookback period is fed through a GRN unit (in blue) and then a softmax function, producing a normalized vector of weights $u$
- each feature passes thru its own GRN, producing a processed vector called $\xi_t$, one for every variable
- output a linear combo of $\xi_t$ and $u$
- The GRN for each feature is the same across all time steps during the same lookback period
- VSN for static variables does not take into acount the context vector $c$

![](https://miro.medium.com/max/344/1*mcf8w_N_kT6Jln94TXFC7w.png)

#### Interpretable Multi-Head Attention
shared value weights across all heads
>$$ \tilde{H} = \frac{1}{m_H} \sum^{m_H}_{h=1} A(Q W^{(h)}_Q, K W^{(h)}_K) V W_V$$

#### LSTM Encoder Decoder Layer
1. produce context-aware embeddings by feeding known inputs to encoder while unknown future inputs into decoder. Similar to the positional encoding used in Transformer
2. merge the context-aware embeddings produced by LSTM with context vectors $c$ of static variables, by initializing the hidden state and cell state with $c_h$ and $c_c$ vectors from static covariate encoder

##### 1. locality enhancement with seq2seq layer
This is similar to positional encoding used in classic Transformer in order to account for all types of inputs. The known inputs are fed into the encoder, while the unknown future inputs are fed into the decoder. 
##### 2. static enrichment layer

##### 3. Temporal self-attention layer
all static-enriche temporal features are first grouped into a single matrix $\Theta = [\theta(t,-k),\ldots,\theta(t,\tau)]^T$ and multi-head attention is applied 
>$$ B(t) = interpretableMultiHead(\Theta(t),\Theta(t),\Theta(t))$$

##### 4. position-wise feed-forward layer
apply aditional non-linear processing to outputs of self-attetntion layer, where the weights of GRN are shared across teh entire layer. We also apply a gated residual connection which skps over the entire transformer block, providing direct path to the seq2seq layer

## [Multi-horizon Quantile R(C)NN](https://arxiv.org/abs/1711.11053)
a Seq2Seq framework than genrates multi-horizon Quantile forecasts. It is designed to solve the large scale tme series regression problem:
>$$p(y_{t+k,\dots, y_{t+1}}|y_{:t}, x^h_{:t}, x^f_{t:}, x^s)$$
- $x^h_{:t}$ termporal covariates available in history
- $x^f_{t:}$ temporal covariates in the future
- $x^s$ static time-invariant features 

### contribution
- efficient training schedume for the combination of sequential NN and Multi-horizon forecast
- network sub-structure to accommondate future information, including the alignment of shifting seasonality and know events that cause large spikes and dips

### Method
![](img\MQRNN.jpg)


#### Architecture
- LSTM encoder: encode all history into hidden states $h_t$

- global MLP: summarize the encoder output plus all future inputs into 2 contexts: a series of horizon-specific contexts $c_{t+k}$ for each of the $K$ future points, and a horizon-agnostic context $c_a$ which captures common information

- local MLP combines corresponding future input and the two contexts from global MLP, outputs required quatiles for that sepecific fture time steps.
    - carries network-structural awareness of the temporal distance btw a forecast creation time point and a specific horizon

#### Training Scheme
our framework creates Multi-Horizon forecasts by placing a series of decoders, with shared parameters, at each recurrent layer (time point) in the encoder, and computes the loss against the cor- responding targets (future series relative to that time point

Then one back-propagation-through-time can gather the multi-horizon error gradients of different FCTs in one pass over a sample, with little additional cost

## [Deep State Space Models for Time Series Forecasting](https://proceedings.neurips.cc/paper/2018/file/5cf68969fb67aa6082363a6d4e6468e2-Paper.pdf)