# Recurrent Neural Networks

**RNN** have a key flaw, as capturing relationships that span more than 8 or 10 steps back is practically impossible. This flaw stems from the **vanishing gradient** or **exploding gradient** problem in which the contribution of information decays geometrically over time.

**LSTM** is one option to overcome the Vanishing Gradient problem in RNNs.

### RNN vs. FFNN

There are two main differences between FFNNs and RNNs. The Recurrent Neural Network uses:

1. **sequences** as inputs in the training phase, and
2. **memory** elements

Memory is defined as the output of hidden layer neurons, which will serve as additional input to the network during next training step.

### Simple RNN (aka Elman Network)

![](images/rnn-FFNN-RNN.svg)

In RNNs, our output at time $t$, depends not only on the current input and the weight, but also on previous inputs. In this case the output at time $t$ will be defined as:

$$\bar{y}_t = F(\bar{x}_t, \bar{x}_{t-1}, \ldots, \bar{x}_{t-t_0}, W)$$

![](images/rnn.svg)

with the **input vector** $\bar{x}$, the **output vector** $\bar{y}$ and the **state vector** $\bar{s}$.

$W_x$ is the weight matrix connecting the inputs to the **state layer**, $W_y$ is the weight matrix connecting the state layer to the output layer and $W_s$ represents the weight matrix connecting the state from the previous timestep to the state in the current timestep.

In **FFNNs** the hidden layer depended only on the current inputs and weights, as well as on an nonlinear **activation function** $\Phi$ like $\text{tanh}$ or $\text{ReLU}$:

$$\bar{h} = \Phi(\bar{x} W)$$

In **RNNs** the state layer depended on the current inputs, their corresponding weights, the activation function and **also on the previous state**:

$$\bar{h} = \bar{s}_t = \Phi(\bar{x}_t W_x + \bar{s}_{t-1} W_s)$$

The output vector is calculated exactly the same as in FFNNs. It can be a linear combination of the inputs to each output node with the corresponding weight matrix $W_y$ or a **softmax function** $\sigma$ of the same linear combination:

$$\bar{y}_t = \bar{s}_t W_y \qquad \text{or} \qquad \bar{y}_t = \sigma(\bar{s}_t W_y)$$

### Backpropagation Through Time (BPTT)

For the error calculations we use a **loss function**, where $E_t$ represents the **output error** at time $t$, $\bar{d}_t$ represents the **desired output** at time $t$ and $\bar{y}_t$ represents the calculated **output** at time $t$:

$$E_t = (\bar{d}_t - \bar{y}_t)^2$$




Supposed we calculate the output with $\bar{y}_t = \bar{s}_t W_y$ and calculate the gradients at time $t = N$.

**Gradient Calculations to adjust Output Weight Matrix $W_y$**:

![](images/rnn-BPTT_Wy.svg)

$$\frac{\partial{E_N}}{\partial{W_y}} = \frac{\partial{E_N}}{\partial{\bar{y}_N}} \cdot \frac{\partial{\bar{y}_N}}{\partial{W_y}}$$

**Gradient Calculations to adjust State Weight Matrix $W_s$**:

to find the final gradient calculation, we **accumulate** the contributions from **all states**: $\bar{s}_t, \bar{s}_{t-1}, ..., \bar{s}_{t-N}$:

![](images/rnn-BPTT_Ws.svg)

$$\frac{\partial{E_N}}{\partial{W_s}} = \sum_{i=1}^{N} \frac{\partial{E_N}}{\partial{\bar{y}_N}} \cdot \frac{\partial{\bar{y}_N}}{\partial{\bar{s}_i}} \cdot \frac{\partial{\bar{s}_i}}{\partial{W_s}}$$

**Gradient Calculations to adjust Input Weight Matrix $W_x$**:

to find the final gradient calculation, we **accumulate** the contributions from **all inputs**: $\bar{x}_t, \bar{x}_{t-1}, ..., \bar{x}_{t-N}$:

![](images/rnn-BPTT_Wx.svg)

$$\frac{\partial{E_N}}{\partial{W_x}} = \sum_{i=1}^{N} \frac{\partial{E_N}}{\partial{\bar{y}_N}} \cdot \frac{\partial{\bar{y}_N}}{\partial{\bar{s}_i}} \cdot \frac{\partial{\bar{s}_i}}{\partial{W_x}}$$

### Example torch.nn.RNN

In [37]:
import torch
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers):
        super().__init__()
        self.hidden_size = hidden_size
        self.rnn = nn.RNN(input_size, hidden_size, num_layers, batch_first=True)
        self.output = nn.Linear(hidden_size, 1) 
        
    def forward(self, x, hidden):
        out, hidden = self.rnn(x, hidden)
        out = out.view(-1, self.hidden_size)
        out = self.output(out)
        return out, hidden

In [None]:
model = RNN()
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.05)

epochs = 1
for epoch in range(1, epoch + 1):
    hidden = None
    for inputs, targets in batches:
        

### Example using torch.nn.RNN

As an example we have a **sequence** of **Input Size = 1** (Since only one dimension)
```
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
```

We divide it into **4 batches** of **sequence length = 5**:

```
[[1, 2, 3, 4, 5],
 [6, 7, 8, 9, 10],
 [11, 12, 13, 14, 15],
 [16, 17, 18, 19, 20]]
```

Our aim is to looking at 5 (`seq_len`) previous value to **predict the next 2 values**.

In [1]:
import numpy as np

import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

In [36]:
data = torch.Tensor([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20])

# input size : (batch, seq_len, input_size)
inputs = data.view(4, 5, 1)

# initialize the RNN
rnn = nn.RNN(input_size=1, hidden_size=2, num_layers=1, batch_first=True)

# out shape = (batch, seq_len, hidden_size)
# h_n shape  = (num_layers, batch, hidden_size)
out, h_n = rnn(inputs)

In [29]:
inputs.shape

torch.Size([4, 5, 1])

In [31]:
out.shape

torch.Size([4, 5, 2])

In [12]:
h_n.shape

torch.Size([1, 4, 2])

In [None]:
class SimpleRNN(nn.Module):
    def __init__(self):
        super(SimpleRNN, self).__init__()
        
        self.hidden_dim=hidden_dim

        # define an RNN with specified parameters
        # batch_first means that the first dim of the input and output will be the batch_size
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
        
        # last, fully-connected layer
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x, hidden):
        # x (batch_size, seq_length, input_size)
        # hidden (n_layers, batch_size, hidden_dim)
        # r_out (batch_size, time_step, hidden_size)
        batch_size = x.size(0)
        
        # get RNN outputs
        r_out, hidden = self.rnn(x, hidden)
        # shape output to be (batch_size*seq_length, hidden_dim)
        r_out = r_out.view(-1, self.hidden_dim)  
        
        # get final output 
        output = self.fc(r_out)
        
        return output, hidden


In [None]:
# decide on hyperparameters
input_size=1 
output_size=1
hidden_dim=32
n_layers=1

# instantiate an RNN
rnn = SimpleRNN(input_size, output_size, hidden_dim, n_layers)
print(rnn)