# Deep Recurrent Neural Networks

:label:`sec_deep_rnn`

Up to now, we only discussed RNNs with a single unidirectional hidden layer.
In it the specific functional form of how latent variables and observations interact is rather arbitrary.
This is not a big problem as long as we have enough flexibility to model different types of interactions.
With a single layer, however, this can be quite challenging.
In the case of the linear models,
we fixed this problem by adding more layers.
Within RNNs this is a bit trickier, since we first need to decide how and where to add extra nonlinearity.

In fact,
we could stack multiple layers of RNNs on top of each other. This results in a flexible mechanism,
due to the combination of several simple layers. In particular, data might be relevant at different levels of the stack. For instance, we might want to keep high-level data about financial market conditions (bear or bull market) available, whereas at a lower level we only record shorter-term temporal dynamics.


Beyond all the above abstract discussion 
it is probably easiest to understand the family of models we are interested in by reviewing :numref:`fig_deep_rnn`. It describes a deep RNN with $L$ hidden layers.
Each hidden state is continuously passed to both the next time step of the current layer and the current time step of the next layer.

![Architecture of a deep RNN.](https://raw.githubusercontent.com/d2l-ai/d2l-en/master/img/deep-rnn.svg)
:label:`fig_deep_rnn`

## Functional Dependencies

We can formalize the 
functional dependencies 
within the  deep architecture
of $L$ hidden layers
depicted in :numref:`fig_deep_rnn`.
Our following discussion focuses primarily on
the vanilla RNN model,
but it applies to other sequence models, too.

Suppose that we have a minibatch input
$\mathbf{X}_t \in \mathbb{R}^{n \times d}$ (number of examples: $n$, number of inputs in each example: $d$) at time step $t$.
At the same time step,
let
the hidden state of the $l^\mathrm{th}$ hidden layer  ($l=1,\ldots,L$) be $\mathbf{H}_t^{(l)}  \in \mathbb{R}^{n \times h}$ (number of hidden units: $h$)
and 
the output layer variable be $\mathbf{O}_t \in \mathbb{R}^{n \times q}$ (number of outputs: $q$).
Setting $\mathbf{H}_t^{(0)} = \mathbf{X}_t$,
the hidden state of 
the $l^\mathrm{th}$ hidden layer
that uses the activation function $\phi_l$
is expressed as follows:

$$\mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)}  + \mathbf{b}_h^{(l)}),$$
:eqlabel:`eq_deep_rnn_H`

where the weights $\mathbf{W}_{xh}^{(l)} \in \mathbb{R}^{h \times h}$ and $\mathbf{W}_{hh}^{(l)} \in \mathbb{R}^{h \times h}$, together with 
the bias $\mathbf{b}_h^{(l)} \in \mathbb{R}^{1 \times h}$, are the model parameters of
the $l^\mathrm{th}$ hidden layer.

In the end,
the calculation of the output layer is only based on the hidden state of the final $L^\mathrm{th}$ hidden layer:

$$\mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{hq} + \mathbf{b}_q,$$

where the weight $\mathbf{W}_{hq} \in \mathbb{R}^{h \times q}$ and the bias $\mathbf{b}_q \in \mathbb{R}^{1 \times q}$ are the model parameters of the output layer.

Just as with MLPs, the number of hidden layers $L$ and the number of hidden units $h$ are hyperparameters.
In other words, they can be tuned or specified by us.
In addition, we can easily
get a deep gated RNN
by replacing 
the hidden state computation in 
:eqref:`eq_deep_rnn_H`
with that from a GRU or an LSTM.


## Concise Implementation

Fortunately many of the logistical details required to implement multiple layers of an RNN are readily available in high-level APIs.
To keep things simple we only illustrate the implementation using such built-in functionalities.
Let us take an LSTM model as an example.
The code is very similar to the one we used previously in :numref:`sec_lstm`.
In fact, the only difference is that we specify the number of layers explicitly rather than picking the default of a single layer. 
As usual, we begin by loading the dataset.


In [6]:
%use @file[../djl.json]
%use lets-plot
@file:DependsOn("../D2J-1.0-SNAPSHOT.jar")
import jp.live.ugai.d2j.timemachine.RNNModelScratch
import jp.live.ugai.d2j.timemachine.TimeMachine.trainCh8
import jp.live.ugai.d2j.timemachine.TimeMachineDataset
import jp.live.ugai.d2j.timemachine.Vocab
import jp.live.ugai.d2j.RNNModel
import jp.live.ugai.d2j.util.StopWatch
import jp.live.ugai.d2j.util.Accumulator
import jp.live.ugai.d2j.util.Training

// %load ../utils/djl-imports
// %load ../utils/plot-utils
// %load ../utils/Functions.java
// %load ../utils/Functions.java
// %load ../utils/PlotUtils.java

// %load ../utils/StopWatch.java
// %load ../utils/Accumulator.java
// %load ../utils/Animator.java
// %load ../utils/Training.java
// %load ../utils/timemachine/Vocab.java
// %load ../utils/timemachine/RNNModel.java
// %load ../utils/timemachine/RNNModelScratch.java
// %load ../utils/timemachine/TimeMachine.java
// %load ../utils/timemachine/TimeMachineDataset.java
import kotlin.random.Random
import kotlin.collections.List
import kotlin.collections.Map
import kotlin.Pair

In [7]:
val manager = NDManager.newBaseManager()

In [8]:
    val batchSize = 32
    val numSteps = 35

    val dataset = TimeMachineDataset.Builder()
        .setManager(manager)
        .setMaxTokens(10000)
        .setSampling(batchSize, false)
        .setSteps(numSteps)
        .build()
    dataset.prepare()
    val vocab = dataset.vocab

The architectural decisions such as choosing hyperparameters are very similar to those of :numref:`sec_lstm`. 
We pick the same number of inputs and outputs as we have distinct tokens, i.e., `vocabSize`.
The number of hidden units is still 256.
The only difference is that we now select a nontrivial number of hidden layers by specifying the value of `numLayers`.


In [9]:
    val vocabSize = vocab!!.length()
    val numHiddens = 256
    val numLayers = 1
    val device = manager.device
    val lstmLayer = LSTM.builder()
        .setNumLayers(numLayers)
        .setStateSize(numHiddens)
        .optReturnState(true)
        .optBatchFirst(false)
        .build()

    val model = RNNModel(lstmLayer, vocabSize)

## Training and Prediction

Since now we instantiate two layers with the LSTM model, this rather more complex architecture slows down training considerably.


In [10]:
    val numEpochs = Integer.getInteger("MAX_EPOCH", 500)

    val lr = 2
    trainCh8(model, dataset, vocab, lr, numEpochs, device, false, manager)

10 : 17.39932625027811
20 : 16.089348554931536
30 : 13.4281528019319
40 : 11.66945778841635
50 : 10.820766413583
60 : 10.16574856716526
70 : 9.65346173287775
80 : 9.052177295131559
90 : 8.410114279418833
100 : 7.983969604001039
110 : 7.418158213134253
120 : 6.936154754194083
130 : 6.508763594862006
140 : 6.173950845168075
150 : 5.478487334927586
160 : 4.927928552617288
170 : 4.614484113842275
180 : 4.025784766242529
190 : 3.714780666095821
200 : 3.114169075003432
210 : 2.781974519631217
220 : 2.2522220462448153
230 : 2.01033895436346
240 : 1.5883644473157814
250 : 1.2865786346456112
260 : 1.1733677296070035
270 : 1.1509330544985235
280 : 1.0821561664757422
290 : 1.0480600673032296
300 : 1.0367754337800466
310 : 1.5776907241173883
320 : 1.0368373867540075
330 : 1.025505407940473
340 : 1.0205873428359824
350 : 1.017684478135957
360 : 1.0154552159779402
370 : 1.0141483683234964
380 : 1.0130217856532309
390 : 1.4367042499586051
400 : 1.0248978175363357
410 : 1.0167143879727454
420 : 1.0136

## Summary

* In deep RNNs, the hidden state information is passed to the next time step of the current layer and the current time step of the next layer.
* There exist many different flavors of deep RNNs, such as LSTMs, GRUs, or vanilla RNNs. Conveniently these models are all available as parts of the high-level APIs of deep learning frameworks.
* Initialization of models requires care. Overall, deep RNNs require considerable amount of work (such as learning rate and clipping) to ensure proper convergence.

## Exercises

1. Try to implement a two-layer RNN from scratch using the single layer implementation we discussed in :numref:`sec_rnn_scratch`.
2. Replace the LSTM by a GRU and compare the accuracy and training speed.
3. Increase the training data to include multiple books. How low can you go on the perplexity scale?
4. Would you want to combine sources of different authors when modeling text? Why is this a good idea? What could go wrong?
