# Paper: An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

## Summary

In the deep learning community, there is a general tendency to focus on recurrent and recursive neural networks for sequence models. However, recent papers show that that CNN-based architectures perform as well as RNNs for sequence modelling. The authors introduce the Temporal Convolutional Networks (TCN) as a family of CNN-based architectures that use 1D dilated causal convolutions and residual blocks for deeper networks. They show that TCP outperforms recurrent architectures such as LSTMs and GRUs for many sequence modelling tasks. They also show that TCNs exhibit longer memory than RNNs with the same capacity. They state that TCN is a relatively new architecture with the potential to advance its performance as more researchers look into it. Because of the simplicity and performance of TCN, the authors argue that the architecture may be a more appropriate starting point when applying deep learning methods to sequences. 


## Sequence Modeling Task

The sequence modelling task is given as:

<div style="background-color:#E0FFFF;border:solid 1px #ddd; padding:10px">
Given an input sequence $[x_0, x_1, \cdots, x_T]$, we want to predict some corresponding output sequence $[y_0, y_1, \cdots, y_T]$ at each time step $t$.
<br><br>

The key constraint is that the prediction of output value $y_t$ at time $t$ must only use the input values observed until time $t$ i.e., we can only predict $y_t$ using the "past" input values $x_0, x_1, \cdots, x_t$ and not "future" values $x_{t+1}, x_{t+2}, \cdots, x_T$.
</div>

This is known as the __causal__ sequence modelling. 
 - It captures many settings such as autoregressive prediction (where we try to predict some signal given its past) by setting the target output to be simply the input shifted by one time step.
 - It does not capture machine translation or sequence-to-sequence prediction in general because these use the entire input sequence (including "future" states) to predict each output.

## Related Work

Recurrent Neural Networks
- Often used for dedicated sequence models
- Used for language modeling and machine translation
- Maintain a vector of hidden activations that are propagated through time
- The hidden state can act as a representation of everything that has been seen so far in the sequence
- Basic RNN architectures are notoriously difficult to train. Instead elaborate architectures like LSTM and GRU is used
- Architectures that are much better than the LSTM are not trivial to find

Combination of RNN and CNN to extrac the desirable aspects of both types of architectures
- Convolutional LSTM: Replaces the fully-connected layers in an LSTM with convolutional layers to allow for additional structure in the recurrent layers
- Quasi-RNN: interleaves convolutional layers with simple recurrent layers
- Dilated RNN: adds dilations to recurrent architectures

## Temporal Convolutional Networks (TCN)

The TCN is used as a simple descriptive term for a family of architectures. It is supposed to be a ConvNet design that distill the best practices for modelling sequences.

A TCN consists of two main building blocks:
- 1D Fully-Convolutional Network (FCN) because we want the network to an output of the same length as the input
- Causal convolutions to only allow predictions to be based on past values

TCN = 1D FCN + causal convolutions

A network that consists of the above-mentioned two building blocks is essentially a Time Delay Neural Network invented in 1989.

To improve it, TCN uses following modern techniques:
- __Dilated Convolutions__ allows the network to have an exponentially large receptive field giving the TCN very long effective memory.
- __Residual Blocks__ allows the network to be very deep

### Fully-Convolutional Network (FCN)

Each hidden layer in a 1D fully-convolutional network (FCN) architecture has the same length as the input layer.

<img src="figures/p3/1d-fcn-no-connections.png" width="500" />

















### Causal Convolutions

- Causal convolution means that the output of time $t$ is convolved only with elements from time $t$ and earlier
- Causal convolution can be implemented by shifting the output of a normal convolution by a few timesteps

```python
def CausalConv1d(in_channels, out_channels, kernel_size, **kwargs):
    pad = (kernel_size - 1)
    return Conv1d(in_channels, out_channels, kernel_size, padding=pad, **kwargs)
```


<img src="figures/p3/causal-convolutional-layers.png" width="800" />


Source: https://jeddy92.github.io/JEddy92.github.io/ts_seq2seq_conv/

### Dialated Causal Convolutions

Dilated convolutions allows the network to have an exponentially large receptive field:

The figure below shows an architecture with dilated causal convolutions with dilation factors $d = 1, 2, 4, 8$ and filter size $k = 2$. The receptive field is able to cover all values from the input sequence.

<img src="figures/p3/dialated-causal-convolutional-layers.png" width="800" />





The dialation factor $d$ is increased exponentially with the depth of the network

### Residual Blocks

A residual block consists of two branches of computation:
- One branch performs a series of transformations to the input of the block. 
- The second branch is an identity mapping i.e., meaning it does not perform any transformation. 

The output of the block is the addition of the two branches.

Suppose $\mathbf{z}^{(i-1)}$ is the input to the block, $\mathcal{F}$ describes the set of transformations performed by the residual block and $g$ is the activation function. The output of the block is then:

$$
o = g \left[ \mathbf{z}^{(i-1)} + \mathcal{F} \left(\mathbf{z}^{(i-1)} \right) \right]
$$

__Why use residual blocks?__
- allows layers to learn modifications to the identity mapping rather than the entire transformation
- it has repeatedly been shown to benefit very deep networks
  - We want deep network because TCN's receptive field depends on the network depth


The figure illustrates how the residual block can look like.

<img src="figures/p3/residual-block-illustration.png" width="300" />








The generic TCN model uses residual module instead of a convolutional layer:
- two layers of dilated causal convolution 
- non-linearity with ReLU
- weight normalization is applied to the convolutional filters
- a spatial dropout after each dialted convolution for regularisation
  - at each training step, a whole channel is zeroed out

### Discussion

Advantages:
- Unlike RNN, convolutions can be done in parallel since the same filter is used in each layer
- TCNs allow for better control of the model's memory size, and are easy to adapt to different domains.
- TCNs avoids the problem of exploding/vanishing gradients
- Since the filters are shared across a layer less memory is required during training
- TCNs can handle sequential input data of arbitrary length

Disadvantages:
- TCNs requires more memory during evaluation because it takes in the raw sequence up to the effective history length
- TCN may perform poorly for network trained in one domain and used in another domain

## Sequence Modeling Tasks

The authors use a variety of sequence modeling tasks that are commonly used to benchmark the performance of recurrent architectures.

__Sequence addition:__ a toy RNN problem used to stress test sequence models.


<img src="figures/p3/problem-of-sequence-addition.png" width="250" />

__Sequential MNIST__: frequently used to test a recurrent network’s ability to retain information from the distant past 

<img src="figures/p3/sequential-mnist.png" width="400" />

__P-MNIST__:  more challenging, the order of the sequence is permuted at random

## Results

<img src="figures/p3/results.png" width="600" />










The results strongly suggest that the generic TCN architecture with minimal tuning outperforms canonical recurrent architectures.

Word-level language modeling: 
- On the smaller PTB corpus, an optimized LSTM architecture outperforms the TCN
- On the larger datasets (Wikitext-103 and LAMBADA), the TCN outperforms the LSTM results

Character-level language modelling:
- TCN outperforms the others
- __Specialized architectures exist that outperform all of these!__