# 7. Deep RNNs

In [1]:
import torch 
import torch.nn as nn
from torch.utils import data
from torch.nn import functional as F

import re
import collections

import math
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt

In [2]:
from d2l import torch as d2l

## Multiple Layers

The standard method for building **deep RNN** is strikingly simple: we **stack the RNNs on top of each other**. 

Any **RNN cell** at each time step depends on both the **same layer’s value at the previous time step** and the **previous layer’s value at the same time step**:

![](http://d2l.ai/_images/deep-rnn.svg)

Consider a **mini-batch** $\mathbf{X}_t \in \mathbb{R}^{n \times d}$, **hidden states** at the $l^\mathrm{th}$ hidden layer  $\mathbf{H}_t^{(l)}  \in \mathbb{R}^{n \times h}$, and the corresponding **output** $\mathbf{O}_t \in \mathbb{R}^{n \times q}$.

We have $\mathbf{H}_t^{(0)} = \mathbf{X}_t$ and let the **activation** for the $l^\mathrm{th}$ hidden layer be $\phi_l$, then:

$$\mathbf{H}_t^{(l)} = \phi_l(\mathbf{H}_t^{(l-1)} \mathbf{W}_{xh}^{(l)} + \mathbf{H}_{t-1}^{(l)} \mathbf{W}_{hh}^{(l)}  + \mathbf{b}_h^{(l)})$$

where $\mathbf{W}_{xh}^{(l)} \in \mathbb{R}^{h \times h}$ and $\mathbf{W}_{hh}^{(l)} \in \mathbb{R}^{h \times h}$ are the **weights** and $\mathbf{b}_h^{(l)} \in \mathbb{R}^{1 \times h}$ is the bias.

The **final output** $\mathbf{O}_t$ of the network depends only on the hidden state $\mathbf{H}_t^{(L)}$ of the $L^\mathrm{th}$ layer:

$$\mathbf{O}_t = \mathbf{H}_t^{(L)} \mathbf{W}_{hq} + \mathbf{b}_q$$

where $\mathbf{W}_{hq} \in \mathbb{R}^{h \times q}$ is the **weight** and $\mathbf{b}_q \in \mathbb{R}^{1 \times q}$ is the **bias** of the output layer.

## Implementation

In [None]:
vocab_size, num_hiddens, num_layers = len(vocab), 256, 2
num_inputs = vocab_size

device = torch.device('mps')
lstm_layer = nn.LSTM(num_inputs, num_hiddens, num_layers)

model = d2l.RNNModel(lstm_layer, vocab_size)
model = model.to(device)