# Lecture 11 Introduction to RNNs
## 0. Overview
- All about RNNs
- LSTM, GRU
- application of RNNs
- RNNs in TensorFlow: implementation tricks & treats
- Live demo of Language Modeling


## 1. RNNs
### 1.1 From feed-forward to RNNs
- feed-forward network
- [Recurrent neural network (RNN)](https://en.wikipedia.org/wiki/Recurrent_neural_network)
    - take advantage of sequential information of data (texts, genomes, spoken words, etc.)
    - Directed cycles
    - All steps share weights to reduce the total number of parameters
    - Form the backbone of NLP
    - Can also be used for images
![rnn_structure_from_nature](figures/11_01.png)

### 1.2 Simple Recurrent Neural Network (SRNN)
![SRNN](figures/11_02.png)
- Introduced by Jeffrey Elman in 1990 (*Elman, Jeffrey L. "Finding structure in time." Cognitive science 14.2 (1990): 179-211*) 
- aka Elman Network 
![SRNNwiki](figures/11_03.png)

### 1.3 RNNs in the context of NLP
![RNNdiagram](figures/11_04.png)

### 1.4 The problem with RNNs
Not very good at capturing long-term dependencies.  
e.g. “I grew up in France… I speak fluent ???”  
-> Needs information from way back

## 2. LSTM & GRU
### 2.1 Long Short Term Memory (LSTM)
- Control how much of new input to take, how much of the previous hidden state to forget
- Closer to how humans process information
- Not a new idea (*Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9.8 (1997): 1735-1780.*)  
![LSTM1](figures/11_05.png)
![LSTM2](figures/11_06.png)

People find LSTMs work well, but unnecessarily complicated, so they introduced GRUs

### 2.2 GRUs (Gated Recurrent Units)
- two most widely used GRUs
![GRUcs224d](figures/11_07.png)
- Computationally less expensive
- Performance on par with LSTMs (*Chung, Junyoung, et al. "Empirical evaluation of gated recurrent neural networks on sequence modeling." arXiv preprint arXiv:1412.3555 (2014).*)

## 3. Applications of RNNs
### 3.1 Language Modeling
- Allows us to measure how likely a sentence is
- Important input for Machine Translation (since high-probability sentences are typically correct)
- Can generate new text
- e.g Character-level Language Modeling: [Shakespeare Generator & Linux Source Code Generator](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) with [code](https://github.com/karpathy/char-rnn)


### 3.2 Machine Translation

- Google Neural Machine Translation ([Google Research’s blog](https://research.google.com/pubs/pub45610.html))

### 3.3 Text Summarization
- *Nallapati, Ramesh, et al. "Abstractive text summarization using sequence-to-sequence rnns and beyond." arXiv preprint arXiv:1602.06023 (2016).*

### 3.4 Image Captioning
- *Karpathy, Andrej, and Li Fei-Fei. "Deep visual-semantic alignments for generating image descriptions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015.*

## 4. RNNs in TensorFlow

### 4.1 cell support
- `BasicRNNCell`: The most basic RNN cell
- `RNNCell`: Abstract object representing an RNN cell
- `BasicLSTMCell`: Basic LSTM recurrent network cell
- `LSTMCell`: LSTM recurrent network cell
- `GRUCell`: Gated Recurrent Unit cell

### 4.2 Construct Cells

In [None]:
cell = tf.nn.rnn_cell.GRUCell(hidden_size)

### 4.3 Stack multiple cells

In [None]:
cell = tf.nn.rnn_cell.GRUCell(hidden_size)
rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)

### 4.4 Construct Recurrent Neural Network
- `tf.nn.dynamic_rnn`: uses a `tf.While` loop to dynamically construct the graph when it is executed
    - Graph creation is faster
    - can feed batches of variable size
- `tf.nn.bidirectional_dynamic_rnn`:  dynamic_rnn with bidirectional

In [None]:
cell = tf.nn.rnn_cell.GRUCell(hidden_size)
rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)
output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state) 
# problem: most sequences are not of the same LENGTH!

### 4.5 Dealing with variable sequence length
#### Padded/truncated sequence length
- Pad all sequences with zero vectors and all labels with zero label (to make them of the same length)
- Most current models can’t deal with sequences of length larger than 120 tokens, so there is usually a fixed max_length and we truncate the sequences to that max_length
- **problem:** padded labels change the total loss, which affects the gradients 

#### Approach 1
- Maintain a mask (True for real, False for padded tokens)
- Run your model on both the real/padded tokens (model will predict labels for the padded tokens as well)
- Only take into account the loss caused by the real elements

In [None]:
full_loss = tf.nn.softmax_cross_entropy_with_logits(preds, labels)
loss = tf.reduce_mean(tf.boolean_mask(full_loss, mask))

#### Approach 2
- Let your model know the real sequence length so it only predict the labels for the real tokens

In [None]:
cell = tf.nn.rnn_cell.GRUCell(hidden_size)
rnn_cells = tf.nn.rnn_cell.MultiRNNCell([cell] * num_layers)
tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
output, out_state = tf.nn.dynamic_rnn(cell, seq, length, initial_state)

### 4.6 deal with common problems when training RNNS
#### Vanishing Gradients
Use different activation units
- `tf.nn.relu`
- `tf.nn.relu6`
- `tf.nn.crelu`
- `tf.nn.elu`  
  
In addition to:
- `tf.nn.softplus`
- `tf.nn.softsign`
- `tf.nn.bias_add`
- `tf.sigmoid`
- `tf.tanh`

#### Exploding Gradients
Clip gradients with `tf.clip_by_global_norm`

In [None]:
# take gradients of cost w.r.t. all trainable variables
gradients = tf.gradients(cost, tf.trainable_variables())

# clip the gradients by a pre-defined max norm
clipped_gradients, _ = tf.clip_by_global_norm(gradients, max_grad_norm)

# add the clipped gradients to the optimizer
optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(clipped_gradients, trainables))

#### Anneal the learning rate
Optimizers accept both scalars and tensors as learning rate

In [None]:
learning_rate = tf.train.exponential_decay(init_lr,
                                           global_step,
                                           decay_steps,
                                           decay_rate,
                                           staircase=True)
optimizer = tf.train.AdamOptimizer(learning_rate)

#### Overfitting
- dropout through `tf.nn.dropout`

In [None]:
hidden_layer = tf.nn.dropout(hidden_layer, keep_prob)

-  `DropoutWrapper` for cells

In [None]:
cell = tf.nn.rnn_cell.GRUCell(hidden_size)
cell = tf.nn.rnn_cell.DropoutWrapper(cell,
                                     output_keep_prob=keep_prob)

## 5. Language Modeling
### 5.1 Neural Language Modeling
- Allows us to measure how likely a sentence is
- Important input for Machine Translation (since high-probability sentences are typically correct)
- Can generate new text

### 5.2 Main approaches
- Word-level: n-grams
    - The traditional approach up until very recently
    - Train a model to predict the next word based on previous n-grams
    - problems:
        - Huge vocabulary
        - Can’t generalize to OOV (out of vocabulary)
        - Requires a lot of memory
- Character-level
    - Introduced in the early 2010s
    - Both input and output are characters
    - Pros:
        - very small vocabulary
        - Doesn’t require word embeddings
        - faster to train
    - Cons:
        - Low fluency (many words can be gibberish)
- Subword-level: somewhere in between the two above
    - hybrid: 
        - word-level by default
        - switch to character-level for unknown tokens
    - Input and output are subwords
    - Keep $W$ most frequent words
    - Keep $S$ most frequent syllables
    - Split the rest into characters
    - Seem to perform better than both word-level and character-level models (*Mikolov, Tomáš, et al. "Subword language modeling with neural networks." preprint(http://www.fit.vutbr.cz/imikolov/rnnlm/char.pdf) (2012).*)

### 5.3 Demo: Character-level Language Modeling
Generate fake Arvix abstracts

[data/arvix_abstracts.txt](https://github.com/chiphuyen/tf-stanford-tutorials/blob/master/data/arvix_abstracts.txt)  

[examples/11_char_nn_gist.py](https://github.com/chiphuyen/tf-stanford-tutorials/blob/master/examples/11_char_rnn_gist.py)

In [None]:
""" A clean, no_frills character-level generative language model.
Created by Danijar Hafner (danijar.com), edited by Chip Huyen
for the class CS 20SI: "TensorFlow for Deep Learning Research"
Based on Andrej Karpathy's blog: 
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
"""
import os
os.environ['TF_CPP_MIN_LOG_LEVEL']='2'
import sys
sys.path.append('..')

import time

import tensorflow as tf

import utils

DATA_PATH = 'data/arvix_abstracts.txt'
HIDDEN_SIZE = 200
BATCH_SIZE = 64
NUM_STEPS = 50
SKIP_STEP = 40
TEMPRATURE = 0.7
LR = 0.003
LEN_GENERATED = 300

def vocab_encode(text, vocab):
    return [vocab.index(x) + 1 for x in text if x in vocab]

def vocab_decode(array, vocab):
    return ''.join([vocab[x - 1] for x in array])

def read_data(filename, vocab, window=NUM_STEPS, overlap=NUM_STEPS//2):
    for text in open(filename):
        text = vocab_encode(text, vocab)
        for start in range(0, len(text) - window, overlap):
            chunk = text[start: start + window]
            chunk += [0] * (window - len(chunk))
            yield chunk

def read_batch(stream, batch_size=BATCH_SIZE):
    batch = []
    for element in stream:
        batch.append(element)
        if len(batch) == batch_size:
            yield batch
            batch = []
    yield batch

def create_rnn(seq, hidden_size=HIDDEN_SIZE):
    cell = tf.contrib.rnn.GRUCell(hidden_size)
    in_state = tf.placeholder_with_default(
            cell.zero_state(tf.shape(seq)[0], tf.float32), [None, hidden_size])
    # this line to calculate the real length of seq
    # all seq are padded to be of the same length which is NUM_STEPS
    length = tf.reduce_sum(tf.reduce_max(tf.sign(seq), 2), 1)
    output, out_state = tf.nn.dynamic_rnn(cell, seq, length, in_state)
    return output, in_state, out_state

def create_model(seq, temp, vocab, hidden=HIDDEN_SIZE):
    seq = tf.one_hot(seq, len(vocab))
    output, in_state, out_state = create_rnn(seq, hidden)
    # fully_connected is syntactic sugar for tf.matmul(w, output) + b
    # it will create w and b for us
    logits = tf.contrib.layers.fully_connected(output, len(vocab), None)
    loss = tf.reduce_sum(tf.nn.softmax_cross_entropy_with_logits(logits=logits[:, :-1], labels=seq[:, 1:]))
    # sample the next character from Maxwell-Boltzmann Distribution with temperature temp
    # it works equally well without tf.exp
    sample = tf.multinomial(tf.exp(logits[:, -1] / temp), 1)[:, 0] 
    return loss, sample, in_state, out_state

def training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state):
    saver = tf.train.Saver()
    start = time.time()
    with tf.Session() as sess:
        writer = tf.summary.FileWriter('graphs/gist', sess.graph)
        sess.run(tf.global_variables_initializer())
        
        ckpt = tf.train.get_checkpoint_state(os.path.dirname('checkpoints/arvix/checkpoint'))
        if ckpt and ckpt.model_checkpoint_path:
            saver.restore(sess, ckpt.model_checkpoint_path)
        
        iteration = global_step.eval()
        for batch in read_batch(read_data(DATA_PATH, vocab)):
            batch_loss, _ = sess.run([loss, optimizer], {seq: batch})
            if (iteration + 1) % SKIP_STEP == 0:
                print('Iter {}. \n    Loss {}. Time {}'.format(iteration, batch_loss, time.time() - start))
                online_inference(sess, vocab, seq, sample, temp, in_state, out_state)
                start = time.time()
                saver.save(sess, 'checkpoints/arvix/char-rnn', iteration)
            iteration += 1

def online_inference(sess, vocab, seq, sample, temp, in_state, out_state, seed='T'):
    """ Generate sequence one character at a time, based on the previous character
    """
    sentence = seed
    state = None
    for _ in range(LEN_GENERATED):
        batch = [vocab_encode(sentence[-1], vocab)]
        feed = {seq: batch, temp: TEMPRATURE}
        # for the first decoder step, the state is None
        if state is not None:
            feed.update({in_state: state})
        index, state = sess.run([sample, out_state], feed)
        sentence += vocab_decode(index, vocab)
    print(sentence)

def main():
    vocab = (
            " $%'()+,-./0123456789:;=?ABCDEFGHIJKLMNOPQRSTUVWXYZ"
            "\\^_abcdefghijklmnopqrstuvwxyz{|}")
    seq = tf.placeholder(tf.int32, [None, None])
    temp = tf.placeholder(tf.float32)
    loss, sample, in_state, out_state = create_model(seq, temp, vocab)
    global_step = tf.Variable(0, dtype=tf.int32, trainable=False, name='global_step')
    optimizer = tf.train.AdamOptimizer(LR).minimize(loss, global_step=global_step)
    utils.make_dir('checkpoints')
    utils.make_dir('checkpoints/arvix')
    training(vocab, seq, loss, optimizer, global_step, temp, sample, in_state, out_state)
    
if __name__ == '__main__':
    main()