![](pics/header.png)

# Deep Learning: Recurrent Neural Network (RNN)

Kevin Walchko

---

RNN look at sequencess of data and allow the incorperation of memory into a NN because of temporal dependancies. RNNs are also typically associated with text processing. However, it can be any sequence of data, even drawing a line!

- Feed forward networks (CNNs) cannot capture temporal dependancies
- Modelling temporal data is important, since things like speach and video have time dependant properties and are characterized as having dependancies across time
    - [Vanishing gradient problem](https://en.wikipedia.org/wiki/Vanishing_gradient_problem) made it difficult to capture relationships 8 or 10 time steps back
    - [LSTMs](http://www.bioinf.jku.at/publications/older/2604.pdf) overcome vanishing gradients
- Siri, Alexa, and Hello Google are examples of RNNs
- OpenAI's [Dota 2 bot](https://openai.com/blog/dota-2/)


You can even combine CNNs and RNNs together:

![](pics/rnn/cnn-rnn.png)

## Feedforward Summary

1. Training Phase
    1. Feedforward: 
        - $h_m = \phi \left( \sum_i^n x_i W_im \right)$
            - $i$ is the ith input neuron from previous layer
            - $n$ total input neurons in the previous layer
            - $m$ is the mth output neuron in the current layer
            - results in m+n+mn multiplications
        - $Error = LossFunction = 1/2(desired - output)^2$
            - MSE typically used in regression problems
            - cross entropy typically used in classification problems
    1. Backpropagation
        - $W_{new} = W_{previous} + \alpha \left(- \frac {\partial E}{\partial W} \right)$
1. Evaluation

# RNN

<table> 
    <tr>
        <td><img src="pics/rnn/rnn-neuron.png" width="60%"></td>
        <td><img src="pics/rnn/rnn-neurons.png"></td>
    </tr>
</table>

Two main differences between RNN and feedforward networks (MLP or CNN):

- sequences as inputs in the training phase *instead* of random samples like we did for feedforward networks
- memory elements that remember previous inputs
    - These are generally donted as $S$ for state, since they have memory
    - $W_s$ is typically the weight matrix for the state information that feedsback on itself
    - This is manifested as output from another hidden layer using data from a previous time step
    - $y_t = \sigma (s_t W_y)$ where $s_t = \Phi (x_t W_x + s_{t-1} W_s)$, $\sigma$ is the softmax function (if used) and $x_t$ is the input at time t

# Training

- Backpropagation through time (BPTT)
- Use mini-batches to train RNNs to reduce complexity and remove noise from weight updates
- $Error_t = LossFunction = (desired - output)^2$ and use MSE
    - $\delta_{ij} = 1/M \sum_{k=1}^M \delta_{ij_k}$ where $\delta_{ij}$ is the gradient calculated once every input sample and we have M samples in each batch
    - If M is too large (~10) then we updates become too small and we get the **vanishing gradient problem**
- $W_x$ is weight matrix connecting inputs to state
- $W_s$ is the weight matrix connecting state from one time step to another
- $W_t$ is the weight matrix connecting the state to the output

$$
\frac {\partial E_N}{\partial W_y} =
\frac {\partial E_N}{\partial y_N}
\frac {\partial y_N}{\partial W_y} \\
\frac {\partial E_N}{\partial W_s} = \sum_{i=1}^N
\frac {\partial E_N}{\partial y_N}
\frac {\partial y_N}{\partial s_i}
\frac {\partial s_i}{\partial W_s} \\
\frac {\partial E_N}{\partial W_x} = \sum_{i=1}^N
\frac {\partial E_N}{\partial y_N}
\frac {\partial y_N}{\partial s_i}
\frac {\partial s_i}{\partial W_x} \\
\frac {\partial y}{\partial W} = \sum_{i=t+1}^{t+N} 
\frac {\partial y}{\partial s_{t+N}}
\frac {\partial s_{t+N}}{\partial s_i}
\frac {\partial s_{i}}{\partial W}
$$

### Example BPTT

Given the RNN in the first image, we want to find $\frac {\partial E_{t+1}}{\partial U}$ which is the update matrix for U at t+1

<table> 
    <tr>
        <td><img src="pics/rnn/example.png"></td>
        <td><img src="pics/rnn/1.png"></td>
        <td><img src="pics/rnn/2.png"></td>
        <td><img src="pics/rnn/3.png"></td>
    </tr>
</table>

Tracing back through the network, there are three paths we need to take to calculate the update for the weights. First, looking at the second picture:

$$
\frac {\partial E_{t+1}}{\partial U} = A = 
\frac {\partial E_{t+1}}{\partial y_{t+1}}
\frac {\partial y_{t+1}}{\partial z_{t+1}}
\frac {\partial z_{t+1}}{\partial s_{t+1}}
\frac {\partial s_{t+1}}{\partial U}
$$

Next, looking at the thrid picture:

$$
\frac {\partial E_{t+1}}{\partial U} = B =
\frac {\partial E_{t+1}}{\partial y_{t+1}}
\frac {\partial y_{t+1}}{\partial z_{t+1}}
\frac {\partial z_{t+1}}{\partial s_{t+1}}
\frac {\partial s_{t+1}}{\partial s_{t}}
\frac {\partial s_{t}}{\partial U}
$$

Finally, looking at the fourth picture:

$$
\frac {\partial E_{t+1}}{\partial U} = C =
\frac {\partial E_{t+1}}{\partial y_{t+1}}
\frac {\partial y_{t+1}}{\partial z_{t+1}}
\frac {\partial z_{t+1}}{\partial z_{t}}
\frac {\partial z_{t}}{\partial s_{t}}
\frac {\partial s_{t}}{\partial U}
$$

Total solution is:

$$
\frac {\partial E_{t+1}}{\partial U} = A + B + C
$$

## PyTorch

Typical for RNN:
    
```python
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(rnn.parameters(), lr=0.01) 
```

In [1]:
import torch
from torch import nn
import numpy as np
from helper import summary

In [2]:
class RNN(nn.Module):
    def __init__(self, input_size, output_size, hidden_dim, n_layers):
        super(RNN, self).__init__()
        
        self.hidden_dim=hidden_dim

        # define an RNN with specified parameters
        # batch_first means that the first dim of the input and output will be the batch_size
        self.rnn = nn.RNN(input_size, hidden_dim, n_layers, batch_first=True)
        
        # last, fully-connected layer
        self.fc = nn.Linear(hidden_dim, output_size)

    def forward(self, x, hidden):
        # x (batch_size, seq_length, input_size)
        # hidden (n_layers, batch_size, hidden_dim)
        # r_out (batch_size, time_step, hidden_size)
        batch_size = x.size(0)
        
        # get RNN outputs
        r_out, hidden = self.rnn(x, hidden)
        # shape output to be (batch_size*seq_length, hidden_dim)
        r_out = r_out.view(-1, self.hidden_dim)  
        
        # get final output 
        output = self.fc(r_out)
        
        return output, hidden
    
rnn = RNN(input_size=1, output_size=1, hidden_dim=10, n_layers=2)
summary(rnn)

Layer (type (var_name))                  Kernel Shape              Param #
RNN                                      --                        --
├─RNN (rnn)                              --                        350
├─Linear (fc)                            [10, 1]                   11
Total params: 361
Trainable params: 361
Non-trainable params: 0


## RNN Training

```python
hidden = None

for x in range(data):
    # get data ... create x and y
    # convert data into Tensors
    # unsqueeze gives a (1, batch_size) dimension
    x_tensor = torch.Tensor(x).unsqueeze(0)
    y_tensor = torch.Tensor(y)

    # outputs from the rnn
    # hidden comes from the previous time through the loop ...
    # this is the memory of the RNN
    prediction, hidden = rnn(x_tensor, hidden)

    ## Representing Memory ##
    # make a new variable for hidden and detach the hidden state from its history
    # this way, we don't backpropagate through the entire history
    hidden = hidden.data

    # calculate the loss
    loss = criterion(prediction, y_tensor)
    # zero gradients
    optimizer.zero_grad()
    # perform backprop and update weights
    loss.backward()
    optimizer.step()
```