# 4 Recurrent Neural Networks


## 4.1 Vanilla RNN

<img src="images/rnn.png" style="height:300px">

In general can be described by 2 equations:  
$$ h^{(t)} = f(h^{(t-1)}, x^{(t)}; \theta)$$
$$ o^{(t)} = g(h^{(t-1)}, \theta)$$

where $x^{(t)}$ - input at time $t$,  
$h^{(t)}$ - hidden state at time $t$,  
$o^{(t)}$ - output at time $t$,  
$f,g$ - some functions, parametrized by learnable weights $\theta$  

For Vanilla RNN usually:
$$h^{(t)} = tanh( b + W_{hh} h^{(t-1)}  + W_{xh} x^{(t)})$$
$$o^{(t)} = softmax( c + W_{ho} h^{(t)})$$

If not specified, $h^{(0)}$ is initialized with zeros.

## 4.2 Training RNN

Backpropagation Through Time (BPTT)

Because every next step of RNN becomes dependent on the previous step, it means that gradient estimation becomes dependent on the length of the sequence. Weights of RNN are shared across steps and gradients from every time step will be summed.

For example, for $T=3$:
$$\frac {\partial L} {\partial W_{hh}} = \frac {\partial L} {\partial h^{(3)}} \frac {\partial h^{(3)}} {\partial W_{hh}} + \frac {\partial L} {\partial h^{(2)}} \frac {\partial h^{(2)}} {\partial W_{hh}} + \frac {\partial L} {\partial h^{(1)}} \frac {\partial h^{(1)}} {\partial W_{hh}}$$

To make gradient estimation more managable, in BPTT we introduce a sliding window, where we compute gradients and make weight updates.

<img src="images/bptt1.png" style="height:300px">

<img src="images/bptt2.png" style="height:300px">

## 4.4 Problems with Vanilla RNN

Vanilla RNNs are subject to exploding and vanishing gradients, which makes training unstable.

Look closely on 
$$h^{(t)} = tanh( b + W_{hh} h^{(t-1)}  + W_{xh} x^{(t)})$$

Let's try to compute $\frac {\partial L} {\partial h}$.  
For the sake of simplicity, lets change activation function $tanh \rightarrow 1$, which make sense, if we make the approximationa around $0$, where $tanh$ has linear behavior. Also, suppose that only the last time step is used for loss computation, so $L$ only depends on $h^{(N)}$.

So, 
$$h^{(t)} = b + W_{hh} h^{(t-1)}  + W_{xh} x^{(t)}$$

$$\frac {\partial L} {\partial h} = \frac {\partial L} {\partial h^{(N)}} ( W_{hh} \frac {\partial h^{(N)}} {\partial h^{(N-1)}} )  ( W_{hh} \frac {\partial h^{(N-1)}} {\partial h^{(N-2)}} ) ... ( W_{hh} \frac {\partial h^{(1)}} {\partial h^{(0)}} ) = \frac {\partial L} {\partial h^{(N)}} W_{hh}^T $$

Because $W_{hh}^T$ is a power of square matrix, we can calculate it by eigendecomposition: 

$$W_{hh}^N  = U \Lambda^N U^T$$
, 
where once again, $U$ - unitary matrix, and $\Lambda$ - diagonal matrix

Now, 
$$\frac {\partial L} {\partial h} =  \frac {\partial L} {\partial h^{(N)}} U \Lambda^N U^T $$

Our gradient becomes highly dependent on the conditional number of $cond(W_{hh}) = \frac {\max(\lambda_ii)} {\min(\lambda_ii)}$ (relation between maximumal and minimal eigenvalue of $W_{hh}$).

For example, if we have $cond(W_{hh}) > 1$, then we get exploding gradientds. Because 
$$\lim_{N \rightarrow \infty} {1.001}^N = \infty$$

<img src="images/exploding.png" style="height:300px">

Problem with exploding gradient is that we can move too far from our optimal solution. 
Exploding gradients can be managed by **gradient clipping**.
See `torch.nn.utils.clip_grad`

If we have $cond(W_{hh}) < 1$, then we have vanishing gradients. Because 
$$\lim_{N \rightarrow \infty} {0.999}^N = 0$$
Problem with vanishing gradient is that we can think, that the optimal solution is found. Remember, that in a local minimum gradient is also zero.

To address the problem of vanishing gradients the following architectures were developed. (There are more of them, but theese are the most commonly used)

## 4.2 Long Short-Term Memory (LSTM)

<img src="images/lstm.png" style="height:300px">


<img src="images/lstm2.png" style="height:300px">

### Description

<img src="images/lstm_desc.svg" style="height:300px">

$f_t$ - forget gate,  
$i_t$ - input gate,  
$o_t$ - output gate,  
$c_t$ - cell memory,  
$h_t$ - hidden state

Compared to Vanilla RNN, cell state in LSTM allows "free" gradient propagation without reccurent relations.

There are **4 times more parameters** in LSTM cell, then in RNN cell.


## 4.3 GRU RNN

<img src="images/gru.png" style="height:300px">

### Description

<img src="images/gru_desc.png" style="height:300px">

$r_t$ - record gate,  
$z_t$ - forget gate,    
$h_t$ - hidden state

* Combines input gate and forget gate
* merges cell state and hidden state

There are **3 times more parameters** in GRU cell, then in RNN cell.

## 4.4 Bidirectional RNN

<img src="images/bi.png" style="height:300px">

`torch.nn.LSTM(.., bidirectional=True, ...)`


Stacking RNNs:

<img src="images/stack.jpeg" style="height:300px">

`torch.nn.LSTM(.., num_layers=2, ...)`

# Notes

* There are 2 implementations of LSTM in pytorch: `torch.nn.LSTM` and `torch.nn.LSTMCell`.
`torch.nn.LSTM` is CuDNN optimized version, where all training and BPTT are handled by the library. It also allows you to make bidirectional, stack layers,  or add dropout between layers. But you won't have access to intermidiate hidden and output states.

`torch.nn.LSTMCell` is a basic version of LSTM, all training and BPTT must be managed by you.
Same holds for GRU.

* Though, in theory LSTM(GRU) can manage very long sequences, in practice it is not so good. Direction of RNN matters, in general last time steps will have more influence on the result. Consider Bidirectional if possible.

* By default, sequence padding is handled by you. The library will not recognize padded symbols, and will process them as if they are part of the data. It may affect yout model perfomance and quality.
Consider `torch.nn.utils.rnn.pack_padded_sequence` to specify length of each sequence.

* Usually, RNN is more computationally expensive then CNN, because it cannot be parallelized.

In [1]:
# RNN for classification

In [79]:
import pandas as pd
import numpy as np
from sklearn.externals import joblib
import nltk
import gensim
import spacy
from tqdm import tqdm_notebook

from sklearn import metrics

import torch as tt
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

from torchtext.data import Field, LabelField, BucketIterator, ReversibleField



SEED = 42
np.random.seed(SEED)

In [31]:
!head Tweets.csv

tweet_id,airline_sentiment,airline,retweet_count,text
570306133677760513,neutral,Virgin America,0,@VirginAmerica What @dhepburn said.
570301130888122368,positive,Virgin America,0,@VirginAmerica plus you've added commercials to the experience... tacky.
570301083672813571,neutral,Virgin America,0,@VirginAmerica I didn't today... Must mean I need to take another trip!
570301031407624196,negative,Virgin America,0,"@VirginAmerica it's really aggressive to blast obnoxious ""entertainment"" in your guests' faces &amp; they have little recourse"
570300817074462722,negative,Virgin America,0,@VirginAmerica and it's a really big bad thing about it
570300767074181121,negative,Virgin America,0,"@VirginAmerica seriously would pay $30 a flight for seats that didn't have this playing.
it's really the only bad thing about flying VA"
570300616901320704,positive,Virgin America,0,"@VirginAmerica yes, nearly every time I fly VX this “ear worm” won’t go away :)"
570300248553349120,neutral,Virgin Am

In [32]:
import spacy


spacy_en = spacy.load('en')
spacy_en.remove_pipe('tagger')
spacy_en.remove_pipe('ner')

def tokenizer(text): # create a tokenizer function
    return [tok.lemma_ for tok in spacy_en.tokenizer(text) if tok.text.isalpha()]            

In [33]:
classes={
    'negative':0,
    'neutral':1,
    'positive':2
}

TEXT = Field(include_lengths=True, batch_first=True, 
             tokenize=tokenizer,
             eos_token='<eos>',
             lower=True,
             stop_words=nltk.corpus.stopwords.words('english')
            )
LABEL = LabelField(dtype=tt.int64, use_vocab=True, preprocessing=lambda x: classes[x])

dataset = TabularDataset('Tweets.csv', format='csv', 
                         fields=[(None, None),('label', LABEL), (None, None),(None, None),('text', TEXT)], 
                         skip_header=True)

In [34]:
# TEXT.build_vocab(dataset, min_freq=10, vectors="glove.6B.100d")
TEXT.build_vocab(dataset, min_freq=5)
len(TEXT.vocab.itos)

2329

In [35]:
TEXT.vocab.itos[:10]

['<unk>',
 '<pad>',
 '<eos>',
 'flight',
 '-pron-',
 'get',
 'thank',
 'hour',
 'cancelled',
 'service']

In [36]:
LABEL.build_vocab(dataset)

In [37]:
train, test = dataset.split(0.7, stratified=True)
train, valid = train.split(0.7, stratified=True)

In [38]:
np.unique([x.label for x in train.examples], return_counts=True)

(array([0, 1, 2]), array([4498, 1518, 1158]))

In [39]:
np.unique([x.label for x in valid.examples], return_counts=True)

(array([0, 1, 2]), array([1927,  651,  496]))

In [40]:
np.unique([x.label for x in test.examples], return_counts=True)

(array([0, 1, 2]), array([2753,  930,  709]))

In [47]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, embed_size, hidden_size):
        super(MyModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        self.rnn = nn.LSTM(input_size=embed_size,
                           hidden_size=hidden_size,
                           bidirectional=True,
                           batch_first=True,
                          )
        
        self.fc = nn.Linear(hidden_size * 2 *2, 3)
        
    def forward(self, batch):
        
        x, x_lengths = batch.text
        
        x = self.embedding(x)

        if x_lengths is not None:
            x_lengths = x_lengths.view(-1).tolist()
            x = nn.utils.rnn.pack_padded_sequence(x, x_lengths, batch_first=True)
            
        _, (hidden, cell) = self.rnn(x)
        
        hidden = hidden.transpose(0,1)
        cell = cell.transpose(0,1)
        hidden = hidden.contiguous().view(hidden.size(0),-1)
        cell = cell.contiguous().view(cell.size(0),-1)
        x = tt.cat([hidden, cell], dim=1).squeeze(1)
        x = self.fc(x)
        return x

In [57]:
# tt.cuda.empty_cache()

batch_size = 32

model = MyModel(len(TEXT.vocab.itos),
                embed_size=100,
                hidden_size=128,
               )

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
)

optimizer = optim.Adam(model.parameters())
# scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, verbose=True, cooldown=5)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)
criterion = nn.CrossEntropyLoss()

In [58]:
def _train_epoch(model, iterator, optimizer, criterion, curr_epoch):

    model.train()

    running_loss = 0

    n_batches = len(iterator)
    iterator = tqdm_notebook(iterator, total=n_batches, desc='epoch %d' % (curr_epoch), leave=True)

    for i, batch in enumerate(iterator):
        optimizer.zero_grad()

        pred = model(batch)
        loss = criterion(pred, batch.label)
        loss.backward()
        optimizer.step()

        curr_loss = loss.data.cpu().detach().item()
        
        loss_smoothing = i / (i+1)
        running_loss = loss_smoothing * running_loss + (1 - loss_smoothing) * curr_loss

        iterator.set_postfix(loss='%.5f' % running_loss)

    return running_loss

def _test_epoch(model, iterator, criterion):
    model.eval()
    epoch_loss = 0

    n_batches = len(iterator)
    with tt.no_grad():
        for batch in iterator:
            pred = model(batch)
            loss = criterion(pred, batch.label)
            epoch_loss += loss.data.item()

    return epoch_loss / n_batches


def nn_train(model, train_iterator, valid_iterator, criterion, optimizer, n_epochs=100,
          scheduler=None, early_stopping=0):

    prev_loss = 100500
    es_epochs = 0
    best_epoch = None
    history = pd.DataFrame()

    for epoch in range(n_epochs):
        train_loss = _train_epoch(model, train_iterator, optimizer, criterion, epoch)
        valid_loss = _test_epoch(model, valid_iterator, criterion)

        valid_loss = valid_loss
        print('validation loss %.5f' % valid_loss)

        record = {'epoch': epoch, 'train_loss': train_loss, 'valid_loss': valid_loss}
        history = history.append(record, ignore_index=True)

        if early_stopping > 0:
            if valid_loss > prev_loss:
                es_epochs += 1
            else:
                es_epochs = 0

            if es_epochs >= early_stopping:
                best_epoch = history[history.valid_loss == history.valid_loss.min()].iloc[0]
                print('Early stopping! best epoch: %d val %.5f' % (best_epoch['epoch'], best_epoch['valid_loss']))
                break

            prev_loss = min(prev_loss, valid_loss)

In [59]:
nn_train(model, train_iterator, valid_iterator, criterion, optimizer, scheduler=scheduler, 
        n_epochs=10, early_stopping=2)

HBox(children=(IntProgress(value=0, description='epoch 0', max=225, style=ProgressStyle(description_width='ini…

validation loss 0.66278


HBox(children=(IntProgress(value=0, description='epoch 1', max=225, style=ProgressStyle(description_width='ini…

validation loss 0.64014


HBox(children=(IntProgress(value=0, description='epoch 2', max=225, style=ProgressStyle(description_width='ini…

validation loss 0.65757


HBox(children=(IntProgress(value=0, description='epoch 3', max=225, style=ProgressStyle(description_width='ini…

validation loss 0.75362
Early stopping! best epoch: 1 val 0.64014


# RNN for autoregression

In [194]:
from torchtext import datasets


TEXT = ReversibleField(use_vocab=True, 
             include_lengths=True, 
             batch_first=True,
             init_token='<start>', eos_token='<end>',
             lower=True,
            )

TAG = ReversibleField(use_vocab=True, 
             include_lengths=False, 
             batch_first=True,
             init_token='<start>', eos_token='<end>',
            )


train, valid, test = datasets.CoNLL2000Chunking.splits([('text', TEXT), ('label', TAG)])

In [195]:
TEXT.build_vocab(train, valid, test, min_freq=5)
TAG.build_vocab(train, valid, test, min_freq=5)

In [196]:
TEXT.vocab.itos[:10]

[' UNK ', '<pad>', '<start>', '<end>', ',', 'the', '.', 'of', 'to', 'a']

In [197]:
TAG.vocab.itos[:10]

[' UNK ', '<pad>', '<start>', '<end>', 'NN', 'IN', 'NNP', 'DT', 'NNS', 'JJ']

In [198]:
class MyModel(nn.Module):
    
    def __init__(self, vocab_size, target_vocab_size, embed_size, hidden_size):
        super(MyModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        
        self.rnn = nn.LSTM(input_size=embed_size,
                           hidden_size=hidden_size,
                           bidirectional=True,
                           batch_first=True,
                          )
        
        self.fc = nn.Linear(hidden_size * 2, target_vocab_size)
        
        self.init_weights()
        
    def init_weights(self):
        nn.init.uniform_(self.embedding.weight)
        nn.init.xavier_uniform_(self.fc.weight)
        nn.init.zeros_(self.fc.bias)
        
    def forward(self, batch):
        
        x, x_lengths = batch.text
        batch_size = x.size(0)
        total_length = x.size(1)
        
        x = self.embedding(x)

        if x_lengths is not None:
            x_lengths = x_lengths.view(-1).tolist()
            x = nn.utils.rnn.pack_padded_sequence(x, x_lengths, batch_first=True)
            
        x, _ = self.rnn(x)
        
        x, _ = nn.utils.rnn.pad_packed_sequence(x, total_length=total_length, batch_first=True)
        
        x = x.contiguous().view(batch_size * total_length, -1)
        x = self.fc(x)
        x = x.contiguous().view(batch_size , total_length, -1)
        return x.transpose(1,2)

In [199]:
# tt.cuda.empty_cache()

batch_size = 32

model = MyModel(vocab_size=len(TEXT.vocab.itos),
                target_vocab_size=len(TAG.vocab.itos),
                embed_size=100,
                hidden_size=128,
               )

train_iterator, valid_iterator, test_iterator = BucketIterator.splits(
    (train, valid, test),
    batch_sizes=(batch_size, batch_size, batch_size),
    shuffle=True,
    sort_key=lambda x: len(x.text),
    sort_within_batch=True,
)

optimizer = optim.Adam(model.parameters())
# scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, patience=5, verbose=True, cooldown=5)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=5)

# padding does not count into loss
criterion = nn.CrossEntropyLoss(ignore_index=1)

In [200]:
nn_train(model, train_iterator, valid_iterator, criterion, optimizer, scheduler=scheduler, 
        n_epochs=10, early_stopping=2)

HBox(children=(IntProgress(value=0, description='epoch 0', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.60068


HBox(children=(IntProgress(value=0, description='epoch 1', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.29925


HBox(children=(IntProgress(value=0, description='epoch 2', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.23225


HBox(children=(IntProgress(value=0, description='epoch 3', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.20152


HBox(children=(IntProgress(value=0, description='epoch 4', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.19936


HBox(children=(IntProgress(value=0, description='epoch 5', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.18026


HBox(children=(IntProgress(value=0, description='epoch 6', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.17999


HBox(children=(IntProgress(value=0, description='epoch 7', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.17590


HBox(children=(IntProgress(value=0, description='epoch 8', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.17829


HBox(children=(IntProgress(value=0, description='epoch 9', max=252, style=ProgressStyle(description_width='ini…

validation loss 0.17607
Early stopping! best epoch: 7 val 0.17590


In [201]:
# dirty hack
def reverse(self, batch):
#     if self.use_revtok:
#         try:
#             import revtok
#         except ImportError:
#             print("Please install revtok.")
#             raise
    if not self.batch_first:
        batch = batch.t()
    with tt.cuda.device_of(batch):
        batch = batch.tolist()
    batch = [[self.vocab.itos[ind] for ind in ex] for ex in batch]  # denumericalize

    def trim(s, t):
        sentence = []
        for w in s:
            if w == t:
                break
            sentence.append(w)
        return sentence

    batch = [trim(ex, self.eos_token) for ex in batch]  # trim past frst eos

    def filter_special(tok):
        return tok not in (self.init_token, self.pad_token)

    batch = [filter(filter_special, ex) for ex in batch]
#     if self.use_revtok:
#         return [revtok.detokenize(ex) for ex in batch]
    return [' '.join(ex) for ex in batch]

TAG.reverse = reverse
TEXT.reverse = reverse

In [202]:
for batch in test_iterator:
    pred = model(batch)
    pred = tt.softmax(pred, dim=1)
    pred = tt.argmax(pred, dim=1)
    pred_tags = TAG.reverse(TAG, pred)
    true_tags = TAG.reverse(TAG, batch.label)
    true_text = TEXT.reverse(TEXT, batch.text[0])
    
    for i in range(len(pred_tags)):
        print(i)
        print('text: ', true_text[i])
        print('pred tags: ', pred_tags[i])
        print('true tags: ', true_tags[i])
        print()
        
    break

0
text:  corporate , other issues
pred tags:  JJ , JJ NNS
true tags:  JJ , JJ NNS

1
text:  short-term rates increased .
pred tags:  JJ NNS VBN .
true tags:  JJ NNS VBN .

2
text:  oct.  UNK  1989 :
pred tags:  NNP NNP CD :
true tags:  NNP CD CD :

3
text:  --  UNK  brown .
pred tags:  : NNP NNP .
true tags:  : NNP NNP .

4
text:  --  UNK   UNK  .
pred tags:  : NNP NNP .
true tags:  : NNP NNP .

5
text:  brown 's story :
pred tags:  NNP POS NN :
true tags:  NNP POS NN :

6
text:  he was  UNK  .
pred tags:  PRP VBD JJ .
true tags:  PRP VBD JJ .

7
text:  merck & co .
pred tags:  NNP CC NNP .
true tags:  NNP CC NNP .

8
text:  small talk :
pred tags:  JJ NN :
true tags:  NNP NNP :

9
text:   UNK   UNK  :
pred tags:  NNP NNP :
true tags:  NNP NNPS :

10
text:  clearly not .
pred tags:  RB RB .
true tags:  RB RB .

11
text:  warner-lambert co .
pred tags:  NNP NNP .
true tags:  NNP NNP .

12
text:  markets --
pred tags:  NNS :
true tags:  NNS :

13
text:  newspapers :
pred tags:  NNS :
tru

In [191]:
pred[24]

tensor([ 4, 27, 33, 33])

In [192]:
TAG.vocab.itos

[' UNK ',
 '<pad>',
 'NN',
 'IN',
 'NNP',
 'DT',
 'NNS',
 'JJ',
 ',',
 '.',
 'CD',
 'VBD',
 'RB',
 'VB',
 'CC',
 'TO',
 'VBN',
 'VBZ',
 'PRP',
 'VBG',
 'VBP',
 'MD',
 'PRP$',
 'POS',
 '$',
 '``',
 "''",
 ':',
 'WDT',
 'JJR',
 'WP',
 'WRB',
 'NNPS',
 'JJS',
 'RBR',
 ')',
 '(',
 'EX',
 'RBS',
 'RP',
 'PDT',
 '#',
 'FW',
 'WP$',
 'UH',
 'SYM']