# Recurrent Neural Networks

**RNN** have a key flaw, as capturing relationships that span more than 8 or 10 steps back is practically impossible. This flaw stems from the **vanishing gradient** or **exploding gradient** problem in which the contribution of information decays geometrically over time.

**LSTM** is one option to overcome the Vanishing Gradient problem in RNNs.

### RNN vs. FFNN

There are two main differences between FFNNs and RNNs. The Recurrent Neural Network uses:

1. **sequences** as inputs in the training phase, and
2. **memory** elements

Memory is defined as the output of hidden layer neurons, which will serve as additional input to the network during next training step.

### Simple RNN (aka Elman Network)

In RNNs, our output at time $t$, depends not only on the current input and the weight, but also on previous inputs. In this case the output at time $t$ will be defined as:

$$\bar{y}_t = F(\bar{x}_t, \bar{x}_{t-1}, \ldots, \bar{x}_{t-t_0}, W)$$

![](images/rnn.svg)

with the **input vector** $\bar{x}$, the **output vector** $\bar{y}$ and the **state vector** $\bar{s}$.

$W_x$ is the weight matrix connecting the inputs to the **state layer**, $W_y$ is the weight matrix connecting the state layer to the output layer and $W_s$ represents the weight matrix connecting the state from the previous timestep to the state in the current timestep.

In **FFNNs** the hidden layer depended only on the current inputs and weights, as well as on an **activation function** $\Phi$:

$$\bar{h} = \Phi(\bar{x} W)$$

In **RNNs** the state layer depended on the current inputs, their corresponding weights, the activation function and **also on the previous state**:

$$\bar{h} = \bar{s}_t = \Phi(\bar{x}_t W_x + \bar{s}_{t-1} W_s)$$

The output vector is calculated exactly the same as in FFNNs. It can be a linear combination of the inputs to each output node with the corresponding weight matrix $W_y$ or a **softmax function** $\sigma$ of the same linear combination:

$$\bar{y}_t = \bar{s}_t W_y \qquad \text{or} \qquad \bar{y}_t = \sigma(\bar{s}_t W_y)$$

### Backpropagation Through Time (BPTT)

For the error calculations we use a **loss function**, where $E_t$ represents the **output error** at time $t$, $\bar{d}_t$ represents the **desired output** at time $t$ and $\bar{y}_t$ represents the calculated **output** at time $t$:

$$E_t = (\bar{d}_t - \bar{y}_t)^2$$




Supposed we calculate the output with $\bar{y}_t = \bar{s}_t W_y$ and calculate the gradients at time $t = N$.

**Gradient Calculations to adjust Output Weight Matrix $W_y$**:

![](images/rnn-BPTT_Wy.svg)

$$\frac{\partial{E_N}}{\partial{W_y}} = \frac{\partial{E_N}}{\partial{\bar{y}_N}} \cdot \frac{\partial{\bar{y}_N}}{\partial{W_y}}$$

**Gradient Calculations to adjust State Weight Matrix $W_s$**:

to find the final gradient calculation, we **accumulate** the contributions from **all states**: $\bar{s}_t, \bar{s}_{t-1}, ..., \bar{s}_{t-N}$:

![](images/rnn-BPTT_Ws.svg)

$$\frac{\partial{E_N}}{\partial{W_s}} = \sum_{i=1}^{N} \frac{\partial{E_N}}{\partial{\bar{y}_N}} \cdot \frac{\partial{\bar{y}_N}}{\partial{\bar{s}_i}} \cdot \frac{\partial{\bar{s}_i}}{\partial{W_s}}$$

**Gradient Calculations to adjust Input Weight Matrix $W_x$**:

to find the final gradient calculation, we **accumulate** the contributions from **all inputs**: $\bar{x}_t, \bar{x}_{t-1}, ..., \bar{x}_{t-N}$:

![](images/rnn-BPTT_Wx.svg)

$$\frac{\partial{E_N}}{\partial{W_x}} = \sum_{i=1}^{N} \frac{\partial{E_N}}{\partial{\bar{y}_N}} \cdot \frac{\partial{\bar{y}_N}}{\partial{\bar{s}_i}} \cdot \frac{\partial{\bar{s}_i}}{\partial{W_x}}$$

# Example using nn.RNN

In [1]:
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [2]:
data = torch.Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])
print("Data: ", data.shape, "\n\n", data)

Data:  torch.Size([20]) 

 tensor([ 1.,  2.,  3.,  4.,  5.,  6.,  7.,  8.,  9., 10., 11., 12., 13., 14.,
        15., 16., 17., 18., 19., 20.])


In [4]:
# Number of features used as input. (Number of columns)
INPUT_SIZE = 1

# Number of previous time stamps taken into account.
SEQ_LENGTH = 5

# Number of features in last hidden state ie. number of output time-
# steps to predict.See image below for more clarity.
HIDDEN_SIZE = 2

# Number of stacked rnn layers.
NUM_LAYERS = 1

# We have total of 20 rows in our input. 
# We divide the input into 4 batches where each batch has only 1
# row. Each row corresponds to a sequence of length 5. 
BATCH_SIZE = 4