$$
\newcommand{\mat}[1]{\boldsymbol {#1}}
\newcommand{\mattr}[1]{\boldsymbol {#1}^\top}
\newcommand{\matinv}[1]{\boldsymbol {#1}^{-1}}
\newcommand{\vec}[1]{\boldsymbol {#1}}
\newcommand{\vectr}[1]{\boldsymbol {#1}^\top}
\newcommand{\rvar}[1]{\mathrm {#1}}
\newcommand{\rvec}[1]{\boldsymbol{\mathrm{#1}}}
\newcommand{\diag}{\mathop{\mathrm {diag}}}
\newcommand{\set}[1]{\mathbb {#1}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\pderiv}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\bb}[1]{\boldsymbol{#1}}
$$

# CS236605: Deep Learning
# Tutorial 5: Recurrent Neural Networks

## Introduction

In this tutorial, we will cover:

TODO

In [1]:
# Setup
%matplotlib inline
import os
import sys
import torch
import matplotlib.pyplot as plt

plt.rcParams['font.size'] = 20
data_dir = os.path.expanduser('~/.pytorch-datasets')

## Theory Reminders

Thus far, our models have been composed of fully connected (linear) layers or convolutional layers.

- Fully connected layers
    - Each layer $l$ operates on the output of the previous layer ($\vec{y}_{l-1}$) and calculates,
        $$
        \vec{y}_l = \varphi\left( \mat{W}_l \vec{y}_{l-1} + \vec{b}_l \right),~
        \mat{W}_l\in\set{R}^{n_{l}\times n_{l-1}},~ \vec{b}_l\in\set{R}^{n_l}.
        $$
    - FC's have completely pre-fixed input and output dimensions.
    
    <img src="img/mlp.png" />

- Convolutional layers
    - Each layer operates on an input tensor $\vec{x}$ containing $M$ feature maps. The $k$-th feature map of the output tensor $\vec{y}$ is:
        $$
        \vec{y}^k = \sum_{m=1}^{M} \vec{w}^{km}\ast\vec{x}^m+b^k,\ k\in[1,K]
        $$
      Where $\ast$ denotes convolution, and $K$ is the number of output feature maps.
      
      <img src="img/cnn_filters.png" width="500"/>
    - This time the weight dimensions are not dependent on the input dimensions.
    - Weights are shared across the spatial dimensions of the input.
    - Output dimension changes based on input dimension.


However,
- Models based on these types of layers lack **persistent state**. 
- The current output is not affected by **previous inputs** (or outputs).

How can we model a dynamical system?
E.g., a linear system such as
$$\vec{y}_t = a_0 + a_1 \vec{y}_{t-1}+\dots+a_P \vec{y}_{t-P} + b_0 \vec{x}_t+\dots+b_{t-Q}\vec{x}_{t-Q}$$

Many use cases and examples: text translation, sentiment classification, scene analysis in video, etc.

## Recurrent layers

An RNN layer is similar to a regular FC layer, but it has two inputs:
- Current sample, $\vec{x}_t \in\set{R}^{d_{i}}$.
- Previous **state**, $\vec{h}_{t-1}\in\set{r}^{d_{h}}$.

and it produces two outputs which depend on both:
- Current layer output, $\vec{y}_t\in\set{R}^{d_o}$.
- Current **state**, $\vec{h}_{t}\in\set{r}^{d_{h}}$.

<img src="img/rnn_cell.png" width="300"/>

Crucially,
- The function $\varphi(\cdot)$ itself is not time-dependent (but is parametrized).
- The same layer (function) is applied at successive time steps, propagating the hidden state.

A basic RNN can be defined as follows.

$$
\begin{align}
\forall t \geq 0:\\
\vec{h}_t &= \varphi_h\left( \mat{W}_{hh} \vec{h}_{t-1} + \mat{W}_{xh} \vec{x}_t + \vec{b}_h\right) \\
\vec{y}_t &= \varphi_y\left(\mat{W}_{hy}\vec{h}_t + \vec{b}_y \right)
\end{align}
$$

where,
- $\vec{x}_t \in\set{R}^{d_{i}}$ is the input at time $t$.
- $\vec{h}_{t-1}\in\set{R}^{d_{h}}$ is the **hidden state** of a fixed dimension.
- $\vec{y}_t\in\set{R}^{d_o}$ is the output at time $t$.
- $\mat{W}_{hh}\in\set{R}^{d_h\times d_h}$, $\mat{W}_{xh}\in\set{R}^{d_h\times d_i}$, $\mat{W}_{hy}\in\set{R}^{d_o\times d_h}$, $\vec{b}_h\in\set{R}^{d_h}$ 
- $\varphi_h$ and $\varphi_y$ are some non-linear functions. In many cases $\varphi_y$ is not used.

and $\vec{b}_y\in\set{R}^{d_o}$ are the model weights and biases.

### Modeling time-dependence

If we imagine **unrolling** a single RNN layer through time,
<img src="img/rnn_unrolled.png" width="800" />

We can see how late outputs can now be influenced by early inputs, through the hidden state.

How would **backpropagation** work, though?

### Layered RNN

RNNs layers can be stacked to build a deep RNN model.

<img src="img/rnn_layered.png" width="800"/>

- As with MLPs, adding depth allows us to model intricate hierarchical features.
- However, now we also have a time dimension which makes the representation time-dependent.

## RNN Implementation

Based on the above equaitions, let's create a simple layer RNN with PyTorch.

In [80]:
import torch.nn as nn

class RNNLayer(nn.Module):
    def __init__(self, in_dim, h_dim, out_dim, phi_h=torch.tanh, phi_y=torch.sigmoid):
        super().__init__()
        self.phi_h, self.phi_y = phi_h, phi_y
        
        self.fc_xh = nn.Linear(in_dim, h_dim, bias=False)
        self.fc_hh = nn.Linear(h_dim, h_dim, bias=True)
        self.fc_hy = nn.Linear(h_dim, out_dim, bias=True)
        
    def forward(self, xt, h_prev=None):
        if h_prev is None:
            h_prev = torch.zeros(xt.shape[0], self.fc_hh.in_features)
        
        ht = self.phi_h(self.fc_xh(xt) + self.fc_hh(h_prev))
        
        yt = self.fc_hy(ht)
        
        if self.phi_y is not None:
            yt = self.phi_y(yt)
        
        return yt, ht
        

In [85]:
# Instantiate our model

N = 3 # batch size
in_dim, h_dim, out_dim = 1024, 128, 1

rnn = RNNLayer(in_dim, h_dim, out_dim)
rnn

RNNLayer(
  (fc_xh): Linear(in_features=1024, out_features=128, bias=False)
  (fc_hh): Linear(in_features=128, out_features=128, bias=True)
  (fc_hy): Linear(in_features=128, out_features=1, bias=True)
)

In [86]:
# Manually "run" a few time steps

# t=1
x1 = torch.randn(N, in_dim)
y1, h1 = rnn(x1)
print(f'y1: {y1}')

# t=2
x2 = torch.randn(N, in_dim)
y2, h2 = rnn(x2, h1)
print(f'y2: {y2}')

# t=3
x3 = torch.randn(N, in_dim)
y3, h3 = rnn(x3, h2)
print(f'y3: {y3}')

y1: tensor([[0.4339],
        [0.4022],
        [0.5431]], grad_fn=<SigmoidBackward>)
y2: tensor([[0.4161],
        [0.4258],
        [0.4816]], grad_fn=<SigmoidBackward>)
y3: tensor([[0.5188],
        [0.5181],
        [0.5371]], grad_fn=<SigmoidBackward>)


In [5]:
print(y3.shape, h3.shape)

torch.Size([3, 1]) torch.Size([3, 128])


## Application: Sentiment analysis

In [168]:
from torchtext import data
from torchtext import datasets


from torch.utils.data.dataset import Subset
import numpy as np

#train_data, test_data = datasets.IMDB.splits(TEXT, LABEL, root=data_dir)


In [169]:
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()#(dtype=torch.float)

train_data, valid_data, test_data = datasets.SST.splits(TEXT, LABEL, root=data_dir)


In [170]:
print(f'Number of training examples: {len(train_data)}')
print(f'Number of validation examples: {len(valid_data)}')
print(f'Number of testing examples: {len(test_data)}')

Number of training examples: 8544
Number of validation examples: 1101
Number of testing examples: 2210


In [171]:
print(vars(train_data[111]))
print(vars(train_data[7777]))

{'text': ['The', 'film', 'aims', 'to', 'be', 'funny', ',', 'uplifting', 'and', 'moving', ',', 'sometimes', 'all', 'at', 'once', '.'], 'label': 'positive'}
{'text': ['An', 'ugly', ',', 'revolting', 'movie', '.'], 'label': 'negative'}


In [172]:
# Vocab simply gives each unique work (token) an index an maps it to a 1-hot vector
MAX_VOCAB_SIZE = 25_000
TEXT.build_vocab(train_data, max_size=MAX_VOCAB_SIZE)
LABEL.build_vocab(train_data)

In [173]:
print(f"Unique tokens in TEXT vocabulary: {len(TEXT.vocab)}")
print(f"Unique tokens in LABEL vocabulary: {len(LABEL.vocab)}")

print(TEXT.vocab.freqs.most_common(20))
print(TEXT.vocab.itos[:10])

print(LABEL.vocab.stoi)

Unique tokens in TEXT vocabulary: 17200
Unique tokens in LABEL vocabulary: 3
[('.', 8041), (',', 7131), ('the', 6087), ('and', 4474), ('of', 4446), ('a', 4423), ('to', 3024), ('-', 2739), ("'s", 2544), ('is', 2540), ('that', 1916), ('in', 1817), ('it', 1781), ('The', 1265), ('as', 1203), ('film', 1156), ('but', 1076), ('with', 1071), ('movie', 999), ('for', 977)]
['<unk>', '<pad>', '.', ',', 'the', 'and', 'of', 'a', 'to', '-']
defaultdict(<function _default_unk_index at 0x127e27ea0>, {'positive': 0, 'negative': 1, 'neutral': 2})


In [177]:
# data iterators

BATCH_SIZE = 4

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, valid_data, test_data), 
    batch_size = BATCH_SIZE,
    shuffle=True,
    device = device)


batch = next(iter(train_iterator))

# batch.text is a sentence_length by batch_size tensor
print(batch.text)
print(batch.label)

tensor([[ 6455,    69,  1805,    15],
        [  496,  1277,    38,    81],
        [   12,  3896,   670,  1562],
        [   85,    35,    24,    25],
        [   68,    33,   566,   202],
        [ 1099,     4,    53,   148],
        [16380,   140,     6,    16],
        [   19,     3,  7845,     4],
        [    4,    34,  3431,  1513],
        [  108,     4,     5,  1947],
        [    6,    83,  4319,    11],
        [  149,    35,  1639,    87],
        [    5,    11,     6,  1453],
        [ 7598,     7,  1494, 13573],
        [    7,   325,     2,    21],
        [   39,    60,     1,    80],
        [ 8624,   657,     1,  2207],
        [ 2486,     3,     1, 13568],
        [  974,    71,     1,     2],
        [  192,    14,     1,     1],
        [  205,    10,     1,     1],
        [16345,  1277,     1,     1],
        [   82,  3896,     1,     1],
        [    6,    12,     1,     1],
        [   47,    25,     1,     1],
        [ 1687,   261,     1,     1],
        [ 82

In [178]:
class SentimentRNN(nn.Module):
    def __init__(self, in_dim, embedding_dim, h_dim, out_dim):
        super().__init__()
        
        # Embedding converts from token index to dense tensor
        self.embedding = nn.Embedding(in_dim, embedding_dim)
        
        # Our custom RNN layer without phi_y outputs a class score
        self.rnn = RNNLayer(embedding_dim, h_dim, out_dim, phi_y=None)
        
        # To convert class scores to log-probability
        self.log_softmax = nn.LogSoftmax(dim=0)
        
    def forward(self, text):

        #text = [sent len, batch size]
        
        embedded = self.embedding(text)
        
        #embedded = [sent len, batch size, emb dim]
        
        # Loop over tokens in the sentence
        ht = None
        for xt in embedded:
            yt, ht = self.rnn(xt, ht)
        
        #output = [sent len, batch size, hid dim]
        #hidden = [1, batch size, hid dim]
        #assert torch.equal(output[-1,:,:], hidden.squeeze(0))
        #return self.fc(hidden.squeeze(0))
        yt_log_proba = self.log_softmax(yt)
        return yt_log_proba

In [179]:
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 3

model = SentimentRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)
model

SentimentRNN(
  (embedding): Embedding(17200, 100)
  (rnn): RNNLayer(
    (fc_xh): Linear(in_features=100, out_features=256, bias=False)
    (fc_hh): Linear(in_features=256, out_features=256, bias=True)
    (fc_hy): Linear(in_features=256, out_features=3, bias=True)
  )
  (log_softmax): LogSoftmax()
)

In [180]:
print(model(batch.text))
print(batch.label)

tensor([[-1.4530, -1.2225, -1.5000],
        [-1.3820, -1.4393, -1.3328],
        [-1.3566, -1.4515, -1.3604],
        [-1.3566, -1.4516, -1.3604]], grad_fn=<LogSoftmaxBackward>)
tensor([0, 0, 0, 1])


In [181]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 1,812,163 trainable parameters


In [188]:
import torch.optim as optim

# Training
optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.9, nesterov=True)
loss_fn = nn.NLLLoss()
model = SentimentRNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM).to(device)


In [198]:
for epoch_idx in range(10):
    
    epoch_loss = 0
    num_correct = 0
    max_batches = 200
    
    for batch_idx, batch in enumerate(train_iterator):
        x, y = batch.text, batch.label
        
        optimizer.zero_grad()
                
        y_pred_log_proba = model(x)
        
        loss = loss_fn(y_pred_log_proba, y)
        
        loss.backward()
        
        optimizer.step()
        
        epoch_loss += loss.item()
        
        y_pred = torch.argmax(y_pred_log_proba, dim=1)
        num_correct += torch.sum(y_pred == y).float().item()
        
        if batch_idx == max_batches-1:
            break
        
    print(f"Epoch #{epoch_idx}, loss={epoch_loss /(max_batches)}, accuracy={num_correct /(max_batches)}")

Epoch #0, loss=1.4017422533035278, accuracy=1.59
Epoch #1, loss=1.4019185101985931, accuracy=1.54
Epoch #2, loss=1.40732672393322, accuracy=1.495
Epoch #3, loss=1.403628532886505, accuracy=1.575
Epoch #4, loss=1.3962826710939407, accuracy=1.635
Epoch #5, loss=1.398765323162079, accuracy=1.63
Epoch #6, loss=1.391654960513115, accuracy=1.53
Epoch #7, loss=1.3938999819755553, accuracy=1.635
Epoch #8, loss=1.40216306746006, accuracy=1.52
Epoch #9, loss=1.404263315796852, accuracy=1.515


In [197]:
torch.sum(y_pred==y).float()/4

tensor(0.2500)

**Image credits**

Images in this tutorial were taken and/or adapted from:

- Fundamentals of Deep Learning, Nikhil Buduma, Oreilly 2017