<a href="https://colab.research.google.com/github/zpgeng/Machine-Perception/blob/master/5_tensorflow_rnn.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# TensorFlow Tutorial - RNNs

In this tutorial, you learn how to build a recurrent neural network (RNN) to model time series data. More specifically, we will implement a model that generates text by predicting one character at a time. This model is widely based on this well-known [blog post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) by Andrej Karpathy. It is highly recommended to read through the post, as it is a great read. Additionally, we borrow some code from an implementation of this post in TensorFlow, taken from [here](https://github.com/sherjilozair/char-rnn-tensorflow).

In the following, we assume familiarity with the TensorFlow tutorial presented in previous tutorials, i.e., you should be aware of TensorFlow's core concepts, such as graph, session, input placeholders, etc. Furthermore, we will not demonstrate any usage of TensorBoard here - for this, refer to the tutorial on CNNs.

This tutorial consists of:
  1. Introduction to RNNs
  2. A Look at the Data
  3. Building the Model
  4. Generating Text
  5. Concluding Remarks and Exercises
  6. _[Appendix]_

## Introduction to RNNs
### Learning from Sequences
A recurrent neural network (RNN) is a specialized architecture designed to model time series and sequence data. A question often associated with sequences is: Given a number of time steps in the sequence, what is the most likely next outcome? In other words, we want the model to output the following probability
$$
p(\mathbf{x}_t | \mathbf{x}_0, \dotsc, \mathbf{x}_{t-1})
$$

where $t$ is the index over time. This is also the task we want to solve in this tutorial, namely, given a sequence of characters, what is the most likely next character?

To answer the above question, it is a good idea to keep some sort of memory as we walk along the sequence because previous observations in the past influence the desired outcome in the future. As an example, consider predicting the last character in the following sentence (indicated by the question mark):

<center><img src="https://i.imgur.com/TKfgH7f.png" align="middle" hspace="20px" vspace="5px"></center>

For us, it is obvious that we should end it with double quotes. However, the model must _remember_ that there was an opening quotation mark that hasn't been closed yet. It gets even more trickier in the following example:

<center><img src="https://i.imgur.com/iQPGehZ.png" align="middle" hspace="20px" vspace="5px"></center>

To complete this sentence, the model must not only realize that the missing word is a noun, but it must also remember that we were talking about Italy in the beginning of the sentence and that the capital of Italy is Rome. To achieve this, it is not enough that the model only knows what characters are. It must have some notion of more abstract concepts of the underlying language - it has to learn how characters are formed to create words and how words are structured to build sentences and so on. Learning to understand text on all these different levels is difficult and being able to capture long-term dependencies is essential for such a task.

### Vanilla RNNs
In neural networks we model time-dependencies by introducing connections that loop back to the same node (hence the name _recurrent_). The recurrent connection typically updates the internal state/memory of the cell. Such an architecture can be drawn as is shown in the following ([source](http://www.deeplearningbook.org/contents/rnn.html)).

<center><img src="https://i.imgur.com/Uo6PCfN.png" align="middle" hspace="20px" vspace="5px"></center>

On the left you see the compact version of the graph. The model takes as input $\mathbf{x}$ and processes it over time while the recurrent connection updates the internal state $\mathbf{h}$ located in the memory cell. On the right side you see the unfolded version of the same graph, where we basically discretize the time dimension and make every time step explicit. This "unfolding" or "unrolling" of an RNN is what we have to do in practice when training. This step makes the model a finite, computational graph that we can actually work with.

The model shown above is a bit too simplistic - it just processes the input over time, but does not produce an output. A more realistic RNN looks like this ([source](http://www.deeplearningbook.org/contents/rnn.html)):

<center><img src="https://i.imgur.com/nLKmj8v.png" align="middle" hspace="20px" vspace="5px"></center>

Let's have a look at that in more detail. $\mathbf{x}$ is the input to the model, $\mathbf{o}$ is the output, and $L$ is a loss function that measures how much the output deviates from the target $\mathbf{y}$. In our case, $\mathbf{x}$ is a sequence of characters and the model produces character $\mathbf{o}_t$ at every time step $t$. We compare it to the target character $\mathbf{y}_t$ which is equivalent to the next character $\mathbf{x}_{t+1}$ in our sequence. Note that instead of outputting one specific character, we rather produce a probability distribution over all characters in the vocabulary (more on this later). This is similar to what we did with the CNN for image classification where we produce a probability of belonging to a class, instead of making a hard assignment.

The real magic of RNNs happens in the recurrent cell, which we sometimes also call _memory cell_ because it tracks the memory, or hidden state, $\mathbf{h}$. The question is now, how do we update the state $\mathbf{h}$? In the vanilla formulation we model it as follows:

$$
\begin{align}
\mathbf{a}^{(t)} &= \mathbf{W} \cdot \mathbf{h}^{(t-1)} + \mathbf{U} \cdot \mathbf{x}^{(t)} + \mathbf{b}\\
\mathbf{h}^{(t)} &= \tanh(\mathbf{a}^{(t)})\\
\mathbf{o}^{(t)} &= \mathbf{V} \cdot \mathbf{h}^{(t)} + \mathbf{c}
\end{align}
$$

Here the matrices $\mathbf{W}$, $\mathbf{U}$, $\mathbf{V}$, and biases $\mathbf{b}$ and $\mathbf{c}$ represent the learnable parameters of this model. Importantly, those parameters are _shared_ between time steps, i.e. every time step gets exactly the same copy of weights to work with.




### LSTM Cells
Despite its seeming simplicity, the vanilla RNN is already a powerful model. The only problem is that it turned out to be quite difficult to train such an RNN in practice. The reason for this is known as the _vanishing or exploding gradients problem_, which has been introduced in the lecture. In short, when we optimize the weights of an RNN, we end up backpropagating gradients through time. Because of the chain rule, gradients that arrive at layer $t$ are the product of a bunch of gradients of the layers from $t+1$ to $\tau$ (assuming we unfold the RNN for $\tau$ time steps in total). Now if each gradient in this large product is small (or big), the multiplication will make the resulting gradient even smaller (or bigger). This is especially a problem for "early" layers and if $\mathbf{\tau}$ is large, i.e., if we want to capture long-term dependencies. If you would like to read more about this, you can find a great article [here](http://neuralnetworksanddeeplearning.com/chap5.html).

So, what can we do to alleviate the problem of unstable gradients in RNNs? One answer was proposed in the seminal work by [Hochreiter and Schmidhuber, 1997](http://www.bioinf.jku.at/publications/older/2604.pdf) where they introduced Long Short Term Memory (LSTM) cells. These cells were designed to remember information for long periods of time and thus have made training of RNNs considerably easier. The following shows a schematic overview of the inner workings of an LSTM cell ([source](https://codeburst.io/generating-text-using-an-lstm-network-no-libraries-2dff88a3968)): 

<center><img src="https://i.imgur.com/FtcD5eR.png" align="middle" hspace="20px" vspace="5px"></center>



If you would like to read more about LSTM cells, we highly recommend to read [this excellent post](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) from colah's blog. In a nutshell, the most important differences between a vanilla and a LSTM cell are:
  - The LSTM cell has two hidden variables instead of just one, the hidden state $\mathbf{h}$ and the cell state $\mathbf{c}$. 
  - Updates to the cell state are carefully protected by three gating functions consisting of a sigmoid layer and an element-wise multiplication (denoted by $\otimes$ or $\circ$).
  - Notice that the cell state $\mathbf{c}_{t-1}$ can more or less easily flow through the cell (top line in the above diagram) and thus propagate the information further into the next time step.
  
LSTMs have made training of recurrent structures much easier and have thus become the de-facto standard in RNNs. Of course, there are more cell types available (which you can find out about for example in colah's blog), but LSTMs are usually a good initial choice for a recurrent architecture.

## A Look at the Data
Let's now turn to our actual problem of training a character-level language model. Following Andrej Karpathy's [post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), we are using the Shakespeare dataset which is just a text file containing some of Shakespeare's work. Here is an excerpt:

<center><img src="https://i.imgur.com/wPFXbqO.png" align="middle" hspace="20px" vspace="5px"></center>

What we want to achieve with the model can be summarized in the following diagram ([source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/)).

<center><img src="https://i.imgur.com/yYLPfEz.png" align="middle" hspace="20px" vspace="5px"></center>

For the sake of simplicity, this diagram assumes a limited vocabulary of only 4 characters `[h, e, l, o]`. The input to the model is the word "hello". On the bottom you can see the input at each time step and a one-hot encoding of each character shown in red. Similarly at the top you find the target characters for each input character and a predicted confidence score shown in blue units. For example, in the first time step the confidence attributed to the letter `e` is 2.2, while the confidence for `o` is 4.1. Ideally, the confidence for the bold numbers in the blue boxes should be high, while for the red numbers it should be low. In the middle part of the diagram, shown in green, are the hidden states of the recurrent cells. Through this layer and the associated hidden state vectors, the information is propagated to future time steps, so that hopefully in the last time step the letter "o" will have a high confidence score.

We are now going to implement this architecture, but with a larger vocabulary and more training data. To use the Shakespeare data set, we need to preprocess the data, i.e., tokenize it, extract the vocabulary, and create batches of a certain sequence length (depending on how many time steps $\tau$ we want to unfold the RNN for). To do this, we shamelessly copy the code from [this implementation](https://github.com/sherjilozair/char-rnn-tensorflow).

In [0]:
import codecs
import os
import collections
import pickle
import numpy as np
%tensorflow_version 1.x
import tensorflow as tf
from tensorflow.python.util import deprecation
deprecation._PRINT_DEPRECATION_WARNINGS = False

In [0]:
class TextLoader():
    def __init__(self, data_dir, batch_size, seq_length, encoding='utf-8'):
        self.data_dir = data_dir
        self.batch_size = batch_size
        self.seq_length = seq_length
        self.encoding = encoding
        self.rng = np.random.RandomState(42)

        input_file = os.path.join(data_dir, "shakespeare.txt")
        vocab_file = os.path.join(data_dir, "vocab.pkl")
        tensor_file = os.path.join(data_dir, "data.npy")

        if not (os.path.exists(vocab_file) and os.path.exists(tensor_file)):
            print("reading text file")
            self.preprocess(input_file, vocab_file, tensor_file)
        else:
            print("loading preprocessed files")
            self.load_preprocessed(vocab_file, tensor_file)
        self.create_batches()
        self.reset_batch_pointer()

    def preprocess(self, input_file, vocab_file, tensor_file):
        with codecs.open(input_file, "r", encoding=self.encoding) as f:
            data = f.read()
        counter = collections.Counter(data)
        count_pairs = sorted(counter.items(), key=lambda x: -x[1])
        self.chars, _ = zip(*count_pairs)
        self.vocab_size = len(self.chars)
        self.vocab = dict(zip(self.chars, range(len(self.chars))))
        with open(vocab_file, 'wb') as f:
            pickle.dump(self.chars, f)
        self.tensor = np.array(list(map(self.vocab.get, data)))
        np.save(tensor_file, self.tensor)

    def load_preprocessed(self, vocab_file, tensor_file):
        with open(vocab_file, 'rb') as f:
            self.chars = pickle.load(f)
        self.vocab_size = len(self.chars)
        self.vocab = dict(zip(self.chars, range(len(self.chars))))
        self.tensor = np.load(tensor_file)
        self.num_batches = int(self.tensor.size / (self.batch_size *
                                                   self.seq_length))

    def create_batches(self):
        self.num_batches = int(self.tensor.size / (self.batch_size *
                                                   self.seq_length))

        # When the data (tensor) is too small,
        # let's give them a better error message
        if self.num_batches == 0:
            assert False, "Not enough data. Make seq_length and batch_size small."

        self.tensor = self.tensor[:self.num_batches * self.batch_size * self.seq_length]
        xdata = self.tensor
        ydata = np.copy(self.tensor)
        ydata[:-1] = xdata[1:]
        ydata[-1] = xdata[0]
        self.x_batches = np.split(xdata.reshape(self.batch_size, -1),
                                  self.num_batches, 1)
        self.y_batches = np.split(ydata.reshape(self.batch_size, -1),
                                  self.num_batches, 1)

    def next_batch(self):
        x, y = self.x_batches[self.pointer], self.y_batches[self.pointer]
        self.pointer += 1
        return x, y

    def reset_batch_pointer(self):
        self.pointer = 0
    
    def maybe_new_epoch(self):
        if self.pointer >= self.num_batches:
            # this is a new epoch
            self.reset_batch_pointer()
            self.reshuffle()
    
    def reshuffle(self):
        idx = self.rng.permutation(len(self.x_batches))
        self.x_batches = [self.x_batches[i] for i in idx]
        self.y_batches = [self.y_batches[i] for i in idx]

Let's also define some configuration parameters like we did for the CNN tutorial.

In [0]:
def del_all_flags(FLAGS):
    flags_dict = FLAGS._flags()    
    keys_list = [keys for keys in flags_dict]    
    for keys in keys_list:
        FLAGS.__delattr__(keys)

del_all_flags(tf.flags.FLAGS)

tf.app.flags.DEFINE_string("data_dir", "./", "Where the training data is stored")
tf.app.flags.DEFINE_string("log_dir", "/tmp/tensorflow/shakespeare_rnn/logs", "Where to store summaries and checkpoints")
tf.app.flags.DEFINE_float("learning_rate", 1e-3, "Learning rate (default: 1e-3)")
tf.app.flags.DEFINE_integer("batch_size", 128, "Batch size (default: 50)")
tf.app.flags.DEFINE_integer("seq_length", 50, "Number of time steps to unrol the RNN for")
tf.app.flags.DEFINE_integer("hidden_size", 256, "Size of one LSTM hidden layer")
tf.app.flags.DEFINE_integer("num_layers", 2, "How many LSTM layers to use")
tf.app.flags.DEFINE_integer("print_every_steps", 20, "How often to print progress to the console")
tf.app.flags.DEFINE_string('f', '', 'kernel')  # Dummy entry because colab is weird.

In [0]:
FLAGS = tf.app.flags.FLAGS
print("\nCommand-line Arguments:")
for key in FLAGS.flag_values_dict():
  print("{:<22}: {}".format(key.upper(), FLAGS[key].value))
print(" ")


Command-line Arguments:
DATA_DIR              : ./
LOG_DIR               : /tmp/tensorflow/shakespeare_rnn/logs
LEARNING_RATE         : 0.001
BATCH_SIZE            : 128
SEQ_LENGTH            : 50
HIDDEN_SIZE           : 256
NUM_LAYERS            : 2
PRINT_EVERY_STEPS     : 20
F                     : 
 


In [0]:
!if [ ! -f eye_data.h5 ]; then wget -nv https://ait.ethz.ch/projects/shakespeare.txt?raw=true -O shakespeare.txt; fi

2020-03-20 17:12:05 URL:https://ait.ethz.ch/projects/shakespeare.txt?raw=true [1155394/1155394] -> "shakespeare.txt" [1]


In [0]:
# load the data
data_loader = TextLoader(FLAGS.data_dir, FLAGS.batch_size, FLAGS.seq_length)
print("loaded vocabulary with {} letters".format(data_loader.vocab_size))

reading text file
loaded vocabulary with 66 letters


In [0]:
# visualize some data
b = data_loader.x_batches[0]
t = data_loader.y_batches[0]
print('total of {} batches of shape: {}'.format(len(data_loader.x_batches), b.shape))
print('content of batch 0, entry 0, time steps 0 to 20')
print('input : {}'.format(b[0, :20]))
print('target: {}'.format(t[0, :20]))

total of 180 batches of shape: (128, 50)
content of batch 0, entry 0, time steps 0 to 20
input : [50  9  7  6  2  0 38  9  2  9 58  1  8 25 10 11 44  1 19  3]
target: [ 9  7  6  2  0 38  9  2  9 58  1  8 25 10 11 44  1 19  3  7]


In [0]:
# print characters instead of integers
# invert the vocabulary (note that this works because the vocabulary is distinct)
vocab_inv = {v: k for k, v in data_loader.vocab.items()}
print('input : {}'.format([vocab_inv[i] for i in b[0, :20]]))
print('target: {}'.format([vocab_inv[i] for i in t[0, :20]]))

input : ['F', 'i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i', 'z', 'e', 'n', ':', '\r', '\n', 'B', 'e', 'f', 'o']
target: ['i', 'r', 's', 't', ' ', 'C', 'i', 't', 'i', 'z', 'e', 'n', ':', '\r', '\n', 'B', 'e', 'f', 'o', 'r']


Great - we now have a way of tokenizing text and organizing it into batches of a given size and sequence length. Next, we'll look into how to build the actual RNN.

## Building the Model
We start by building the core of our model, the RNN with LSTM cells.

In [0]:
def rnn_lstm(inputs, hidden_size, num_layers, seq_lengths):
    """
    Builds an RNN with LSTM cells.
    :param inputs: The input tensor to the RNN in shape `[batch_size, seq_length]`.
    :param hidden_size: The number of units for each LSTM cell.
    :param num_layers: The number of LSTM cells we want to use.
    :param seq_lengths: Tensor of shape `[batch_size]` specifying the total number
      of time steps per sequence.
    :return: The initial state, final state, predicted logits and probabilities.
    """
    # we first create a one-hot encoding of the inputs
    # the resulting shape is `[batch_size, seq_length, vocab_size]`
    vocab_size = data_loader.vocab_size
    input_one_hot = tf.one_hot(inputs, vocab_size, axis=-1)
    
    # create a list of all LSTM cells we want
    cells = [tf.contrib.rnn.LSTMCell(hidden_size) for _ in range(num_layers)]
    
    # we stack the cells together and create one big RNN cell
    cell = tf.contrib.rnn.MultiRNNCell(cells)
    
    # we need to set an initial state for the cells
    batch_size = tf.shape(inputs)[0]
    initial_state = cell.zero_state(batch_size, dtype=tf.float32)
    
    # now we are ready to unrol the graph
    outputs, final_state = tf.nn.dynamic_rnn(cell=cell,
                                             initial_state=initial_state,
                                             inputs=input_one_hot,
                                             sequence_length=seq_lengths)
    
    # The `outputs` tensor has shape `[batch_size, seq_length, hidden_size]`,
    # i.e. it contains the outputs of the last cell for every time step.
    # We want to map the output back to the "vocabulary space", so we add a dense layer.
    # Importantly, the dense layer should share its parameters across time steps.
    # To do this, we first flatten the outputs to `[batch_size*seq_length, hidden_size]`
    # and then add the dense layer.
    max_seq_length = tf.shape(inputs)[1]
    outputs_flat = tf.reshape(outputs, [-1, hidden_size])
    
    # dense layer
    weights = tf.Variable(tf.truncated_normal([hidden_size, vocab_size], stddev=0.1))
    bias = tf.Variable(tf.constant(0.1, shape=[vocab_size]))
    logits_flat = tf.matmul(outputs_flat, weights) + bias
    
    # reshape back
    logits = tf.reshape(logits_flat, [batch_size, max_seq_length, vocab_size])
    
    # activate to turn logits into probabilities
    probs = tf.nn.softmax(logits)
    
    # we return the initial and final states because this will be useful later
    return initial_state, final_state, logits, probs

A note about the use of `tf.nn.dynamic_rnn`. When dealing with sequences it is often the case that not all have the same length. Hence, we need to ask ourselves two questions:
  1. How do we handle sequences of different lengths in the same batch?
  2. Do we want to use different sequence lengths at inference time than at training time?
  
To answer the first question, we can simply pad all sequences in a batch with dummy values to the maximum length occurring in that batch. Of course, we should tell TensorFlow that it does not make sense to unrol the RNN further for these dummy values. This is why we supply the `seq_lengths` tensor, which is actually important to guarantee correctness during back-propagation. Note that in our example, we do not need to pad the data, because our data loader already ensures that we have sequences of equal length. However, you should generally be aware of that caveat.

To address the second question TensorFlow knows two functions: `tf.nn.dynamic_rnn` and `tf.nn.static_rnn`. In the static RNN, TensorFlow creates an unrolled graph for a fixed length (say 100). It is still possible to use this graph for sequences of length `< 100` (by supplying the `seq_lengths` tensor as mentioned above), but during inference time, we cannot use it for more than 100 time steps. The dynamic RNN on the other hand can handle variable sequence lengths - it unrols the graph in a `tf.while` loop directly on the GPU. In other words, the time dimension in the input placeholder can be `None`, like for the batch size. It is recommended to always pass a `seq_lengths` tensor into the dynamic RNN function.

<center><img src="https://i.imgur.com/rTJuTka.png" align="middle" hspace="20px" vspace="5px"></center>

In the following we have to take care of the remaining tasks: build input placeholders, add a loss function, define the optimizer and define the training loop.

In [0]:
# create input placeholders
with tf.name_scope("input"):
    # shape is `[batch_size, seq_length]`, both are dynamic
    text_input = tf.placeholder(tf.int32, [None, None], name='x-input')
    # shape of target is same as shape of input
    text_target = tf.placeholder(tf.int32, [None, None], name='y-input')
    # sequence length placeholder
    seq_lengths = tf.placeholder(tf.int32, [None], name='seq-lengths')

In [0]:
# build the model
initial_state, final_state, logits, probs = rnn_lstm(text_input,
                                                     FLAGS.hidden_size,
                                                     FLAGS.num_layers,
                                                     seq_lengths)


For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
If you depend on functionality not listed there, please file an issue.



In [0]:
# We use the same loss function as we did for the CNN tutorial, i.e. Cross-Entropy.
# This time, we have to compute it for each time step, so we use a TensorFlow function
# that takes care of this for us.
with tf.name_scope("cross-entropy"):
    # Again, we should supply the logits, not the softmax-activated probabilities
    cross_entropy_loss = tf.contrib.seq2seq.sequence_loss(
        logits, text_target,
        weights=tf.ones_like(text_input, dtype=tf.float32))
    tf.summary.scalar('cross_entropy_loss', cross_entropy_loss)
    
    # The weights allow weighing the contribution of each batch entry and time step (here we don't use it).
    # This parameter can be useful if we have padded values (which we don't here). In this case, `weights`
    # serves like a mask that you should set to 0 for padded values, e.g. like this:
    #   weights = tf.sequence_mask(seq_lengths, max_seq_length)

In [0]:
# check number of trainable parameters
def count_trainable_parameters():
    """Counts the number of trainable parameters in the current default graph."""
    tot_count = 0
    for v in tf.trainable_variables():
        v_count = 1
        for d in v.get_shape():
            v_count *= d.value
        tot_count += v_count
    return tot_count
print("Number of trainable parameters: {}".format(count_trainable_parameters()))

Number of trainable parameters: 873026


In [0]:
# create the optimizer
global_step = tf.Variable(1, name='global_step', trainable=False)
with tf.name_scope("train"):
    # choose Adam optimizer
    optim = tf.train.AdamOptimizer(FLAGS.learning_rate)
    
    # get the gradients
    params = tf.trainable_variables()
    gradients = tf.gradients(cross_entropy_loss, params)
    
    # clip the gradients to counteract exploding gradients
    clipped_gradients, _ = tf.clip_by_global_norm(gradients, 5)
    
    # backprop
    train_step = optim.apply_gradients(zip(clipped_gradients, params), global_step=global_step)

In [0]:
def do_train_step(num_steps, summary_op):
    """Perform as many training steps as specified."""
    for i in range(num_steps):
        step = tf.train.global_step(sess, global_step)
        
        # reset batch pointer and shuffle the data if necessary
        data_loader.maybe_new_epoch()
        
        # get next batch
        x, y = data_loader.next_batch()
        
        # prepare feed_dict
        feed_dict = {text_input: x, text_target: y, seq_lengths: [x.shape[1]]*x.shape[0]}
        
        summary, train_loss, _ = sess.run([summary_op, cross_entropy_loss, train_step],
                                          feed_dict=feed_dict)
        
        writer_train.add_summary(summary, step)
        
        if step % FLAGS.print_every_steps == 0:
            print('[{}] Cross-Entropy Loss Training [{:.3f}]'.format(step, train_loss)) 

In [0]:
# Create the session
sess = tf.InteractiveSession()

# Initialize all variables
sess.run(tf.global_variables_initializer())

# To be able to see something in tensorboard, we must merge summaries to one common operation.
# Whenever we want to write summaries, we must request this operation from the graph.
# Note: creating the file writers should happen after the session was launched.
summaries_merged = tf.summary.merge_all()
writer_train = tf.summary.FileWriter(FLAGS.log_dir + '/train', sess.graph)

Now let's train this model for a couple of steps. Make sure you selected the GPU under Runtime > Change runtime type > Hardware Accelerator. Otherwise training will be quite slow.

In [0]:
do_train_step(1001, summaries_merged)

[20] Cross-Entropy Loss Training [3.443]
[40] Cross-Entropy Loss Training [3.356]
[60] Cross-Entropy Loss Training [3.315]
[80] Cross-Entropy Loss Training [3.206]
[100] Cross-Entropy Loss Training [3.081]
[120] Cross-Entropy Loss Training [2.922]
[140] Cross-Entropy Loss Training [2.787]
[160] Cross-Entropy Loss Training [2.628]
[180] Cross-Entropy Loss Training [2.485]
[200] Cross-Entropy Loss Training [2.415]
[220] Cross-Entropy Loss Training [2.368]
[240] Cross-Entropy Loss Training [2.320]
[260] Cross-Entropy Loss Training [2.328]
[280] Cross-Entropy Loss Training [2.263]
[300] Cross-Entropy Loss Training [2.242]
[320] Cross-Entropy Loss Training [2.172]
[340] Cross-Entropy Loss Training [2.184]
[360] Cross-Entropy Loss Training [2.166]
[380] Cross-Entropy Loss Training [2.114]
[400] Cross-Entropy Loss Training [2.106]
[420] Cross-Entropy Loss Training [2.095]
[440] Cross-Entropy Loss Training [2.083]
[460] Cross-Entropy Loss Training [2.058]
[480] Cross-Entropy Loss Training [2.0

The loss is consistently decreasing, so that looks promising. Let's now look at another feature from TensorFlow: storing checkpoints. During training, it is a good idea to regularly save the model that you trained up to this point. Doing this with TensorFlow is pretty straight-forward.

In [0]:
# Create a saver object. We must specify which variables it should save to disk (in this case all)
# and optionally how many checkpoints should be retained (in this case 2).
saver = tf.train.Saver(var_list=tf.trainable_variables(), max_to_keep=2)

# new save the current session
saver.save(sess, os.path.join(FLAGS.log_dir, 'checkpoints', 'model_name'), global_step)

'/tmp/tensorflow/shakespeare_rnn/logs/checkpoints/model_name-1002'

And that's it! Of course, saving a model is only useful if we can load it again (e.g. to do inference with it or to continue training). This is also quite easy to do. We just call a `restore` function on the saver object.

In [0]:
# get a handle to the latest checkpoint that was stored
ckpt_path = tf.train.latest_checkpoint(os.path.join(FLAGS.log_dir, 'checkpoints'))

# now restore
saver.restore(sess, ckpt_path)

Again, pretty easy. Note however that `saver.restore` only loads the saved weights into the graph, i.e. it assumes a suitable graph exists already. If it does, `restore` will most likely fail.

## Generating Text
We have seen how we can train a model to predict a single character given an input sequence. But how can we use this model to generate text? This is what we will discuss in the following.

One way to do this is to generate text character-by-character and feeding the output of each time step back as input to the model. In other words, we get the output character for a given sequence, append that character to the sequence and repeat the whole process. This is illustrated in the following where the black text is the input sequence and the blue character is the output character.

<center><img src="https://i.imgur.com/NnToWQe.png" align="middle" hspace="20px" vspace="5px"></center>

Let's implement this for our model.

In [0]:
def sample(prime_text, num_steps):
    """
    Sample `num_steps` characters from the model and initialize it with `prime_text`.
    :param prime_text: A string that we want to initialize the RNN with.
    :num_steps: Integer specifying how many characters we want to predict after `prime_text`.
    :return: `prime_text` plus prediction.
    """
    # First we need to look up the initial text in the vocabulary
    input_prime = [data_loader.vocab[c] for c in prime_text]
    
    # Feed the prime sequence into the model. Note that we do not have to supply any targets
    # because we are not doing any backpropagation.
    feed_dict = {text_input: [input_prime],
                 seq_lengths: [len(input_prime)]}
    state, out_probs = sess.run([final_state, probs], feed_dict=feed_dict)
    
    # Now we have initialized the RNN with the given prime text. Let's see what it predicts
    # as the next character after the last from `prime_text`.
    # `out_probs` is of shape `[1, len(prime_text), vocab_size]`
    next_char_probs = out_probs[0, -1]
    
    # `next_char_probs` is a probability distribution over all characters in the vocabulary.
    # How do we determine which character is next? We could just take the one that is most
    # probable of course. But let's implement something a bit different: actually sample
    # from the probability distribution.
    def weighted_pick(p_dist):
        cs = np.cumsum(p_dist)
        idx = int(np.sum(cs < np.random.rand()))
        return idx
    
    next_char = weighted_pick(next_char_probs)
    
    # save all predicted chars in a string
    predicted_text = vocab_inv[next_char]
    
    # now we can sample for `num_steps`
    for _ in range(num_steps):
        # Construct the feed dict. Note how we manually carry over the previous final state
        # and use it as the next initial state. If we don't do that, then TensorFlow will
        # automatically initialize the cells to the zero state and hence we lose all the
        # memory that we've built up to this point.
        feed_dict = {text_input: [[next_char]],
                     seq_lengths: [1],
                     initial_state: state}
        
        # get the prediction
        state, out_probs = sess.run([final_state, probs], feed_dict=feed_dict)
        
        # sample from the distribution
        next_char = weighted_pick(out_probs[0, -1])
        
        # append to already predicted text
        predicted_text += vocab_inv[next_char]   
    
    return prime_text + predicted_text

In [0]:
print(sample('The ', 500))

The wither of our Gloved
To well pricones or he' Elike.

All Sore:
I at the kins gill os is yet,
And priens; is you with lought hou han, and not me a beill's sreital.

KING EREWAll allow,
What I wolewn would bruck deathan hall: serist our now,, do.

LOrY CERLENR:

GEONTES:
Cormond the for.

CApol:
Nose but the' if your couther wafe,
Ale, whyos in the itls to the, now how it wors, lit; allen the dreat,
I dight iols; bue hou the prutcine poor go cance; nece ain weland,
In thut crust 


Depending for how long you trained the model, you should now be able to see some nice outputs. As an example, here is the output of a model that was trained for 5000 steps.

<center><img src="https://i.imgur.com/JhMzdEv.png" align="middle" hspace="20px" vspace="5px"></center>

We make a few observations:
  - The model successfully learned to create English words! Even if some might be purely imagined, they do sound at least like English words.
  - It also learned a great deal about the structure of the input data: mostly nice use of punctuation, it produces paragraphs that start with names in capital letters, etc.
  
Given the simplicity of our model, this is quite a nice result! Refer the Andrej Karpathy's [article](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) to see some more results of the same model (applied to different datasets) and more visualizations of what's going on inside the RNN.

In [0]:
# cleanup
sess.close()

## Concluding Remarks and Exercises
In this tutorial you learned how to build an RNN that predicts the next character given a sequence of characters and how you can turn it into a generative model that produces sample text of arbitrary length. You should now be aware of the most important implementation details needed to train an RNN in TensorFlow: difference between `tf.nn.static_rnn` and `tf.nn.dynamic_rnn`, potential need for padding, sharing weights when mapping RNN hidden states back to the output space, using initial and final state of the RNN to control the generation of sequences, etc.

In our example, we used an RNN to predict an output at each time step of the sequence. RNNs are however much more versatile than this and can be used in many more scenarios as shown here. ([source](http://karpathy.github.io/2015/05/21/rnn-effectiveness/))

<center><img src="https://i.imgur.com/6L3Pbdc.png" align="middle" hspace="20px" vspace="5px"></center>

TensorFlow provides [functions](https://www.tensorflow.org/tutorials/seq2seq) to implement such models, which are sometimes also called _sequence-to-sequence_ (seq2seq) models.

To gain a deeper understanding of RNNs, we encourage you to make a copy of this notebook and play around with it. Specifically, in the following are a couple of (optional) exercises that you might want to look at. Furthermore, you also find information about some more advanced topics in the appendix section, which we provide for the purpose of self-study.

  1. Read [Andrej Karpathy's](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) and [colah's](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) blog posts.
  2. We did not use a validation set in this tutorial. Implement this and evaluate the validation set after every so-many epochs (like we did for the CNN tutorial).
  3. Compute the training accuracy, print it to the console and visualize it in TensorBoard. Check the [CNN notebook](https://colab.research.google.com/drive/1bx2dlJYutNitK-hlhp98OHO-5WZnrNyV) to see how you can use Tensorboard on Colab.
  4. Try out different cell types other than LSTMs, e.g. GRU.
  5. How could you add some regularization to this model? 
  6. Play around with the hyper-parameters. What happens if you omit the gradient clipping? How susceptible is the training to changes in the learning rate? Can you find a model that has less parameters but performs equally well?

## Appendix
### Writing Your Own RNN Cell
Sometimes the standard LSTM cell provided by TensorFlow is just not enough, or sometimes you would like to do fancier stuff within a cell. Knowing how to write your own RNN cell that you can feed into `tf.nn.dynamic_rnn` can be very useful. E.g., we usually want to add a decoder on the outputs of the RNN, i.e. a dense layer that maps back to the output space. In the above model, we did this by reshaping the outputs of the rnn and then adding the dense layer on top of that. It would more elegant however, if we could do this directly in a cell, because anyway the dense layer operates independently on the output of each cell. To do this, we can write a custom RNN cell.

In [0]:
from tensorflow.python.ops.rnn_cell_impl import RNNCell
from tensorflow.contrib.rnn import LSTMStateTuple

class DenseDecoderCell(RNNCell):
    """
    Wraps an existing RNNCell by decoding the outputs of the cell before returning them.
    """
    def __init__(self, cell, output_dim):
        if not isinstance(cell, RNNCell):
            raise TypeError("The parameter cell is not an RNNCell.")
            
        super(DenseDecoderCell, self).__init__()
        self._cell = cell
        self._output_size = output_dim
        
        if hasattr(cell.state_size, '__getitem__'):
            # `cell` is a MultiRNNCell, so get the size from the last layer
            hidden_size = cell.state_size[-1]
        else:
            hidden_size = cell.state_size
            
        if isinstance(hidden_size, LSTMStateTuple):
            # LSTM cells have special state
            hidden_size = hidden_size.h
        
        # Create the weights of the decoder. However, we must be cautios.
        # An instance of this class is created for every time step we unrol
        # the model for. Because we want to share parameters between time
        # steps, we can't just instantiate new weight variables every time
        # this function is called. To avoid this, we use a nice feature
        # of TensorFlow which is `tf.get_variable()`. This function creates
        # a variable if it does not exist yet or else just retrieves it
        # from the graph and returns a handle to it.
        print(output_dim)
        self._decoder_w = tf.get_variable("decoder_cell_w",
                                          shape=[hidden_size, output_dim],
                                          dtype=tf.float32,
                                          initializer=tf.truncated_normal_initializer(stddev=0.1))
        self._decoder_b = tf.get_variable("decoder_cell_b",
                                          shape=[output_dim],
                                          dtype=tf.float32,
                                          initializer=tf.constant_initializer(0.1))
    
    @property
    def state_size(self):
        # just return the state size of the cell we are wrapping
        return self._cell.state_size
    
    @property
    def output_size(self):
        # must return the dimensionality of the tensor returned by this cell
        return self._output_size
    
    def __call__(self, inputs, state, scope=None):
        """
        This is the function that is called at runtime. `inputs` is a tensor of shape
        `[batch_size, input_dimension]`, i.e., only one time step for the sequence is given.
        `state` is the cell's state from the previous time step and `scope` is just the
        current scope (can be ignored most of the time).
        """
        # just forward our call to the cell
        output, next_state = self._cell(inputs, state, scope)
        
        # decode the output
        output_dec = tf.matmul(output, self._decoder_w) + self._decoder_b
        
        return output_dec, next_state

This way, our original code from above to create an RNN, simplifies a bit. You can use the following function to replace `rnn_lstm` from above.

In [0]:
def rnn_lstm_neat(inputs, hidden_size, num_layers, seq_lengths):
    """
    Builds an RNN with LSTM cells.
    :param inputs: The input tensor to the RNN in shape `[batch_size, seq_length]`.
    :param hidden_size: The number of units for each LSTM cell.
    :param num_layers: The number of LSTM cells we want to use.
    :param seq_lengths: Tensor of shape `[batch_size]` specifying the total number
      of time steps per sequence.
    :return: The initial state, final state, predicted logits and probabilities.
    """
    # we first create a one-hot encoding of the inputs
    # the resulting shape is `[batch_size, seq_length, vocab_size]`
    vocab_size = data_loader.vocab_size
    input_one_hot = tf.one_hot(inputs, vocab_size, axis=-1)
    
    # create a list of all LSTM cells we want
    cells = [tf.nn.rnn_cell.BasicLSTMCell(num_units=hidden_size) for _ in range(num_layers)]
    
    # we stack the cells together and create one big RNN cell
    cell = tf.nn.rnn_cell.MultiRNNCell(cells)
    
    # decode the outputs
    cell = DenseDecoderCell(cell, output_dim=vocab_size)
    
    # we need to set an initial state for the cells
    batch_size = tf.shape(inputs)[0]
    initial_state = cell.zero_state(batch_size, dtype=tf.float32)
    
    # now we are ready to unrol the graph
    outputs, final_state = tf.nn.dynamic_rnn(cell=cell,
                                             initial_state=initial_state,
                                             inputs=input_one_hot,
                                             sequence_length=seq_lengths)
    
    # The `outputs` tensor has now shape `[batch_size, seq_length, vocab_size]`,
    # so we can directly think of the output as the logits.
    logits = outputs
    
    # activate to turn logits into probabilities
    probs = tf.nn.softmax(logits)
    
    # we return the initial and final states because this will be useful later
    return initial_state, final_state, logits, probs

### Sharing Weights
Sharing weights between different structures of the graph is a very useful feature (as seen for example in the previous section about custom RNN cells). TensorFlow easily allows this via the function `tf.get_variable`. Essentially, `tf.get_variable` creates a variable if it does not exist yet or otherwise retrieves it by name from the current graph. Note that the current variable scope influences the name of the variable, so it is important to get the scope right. Here is a simple example.

In [0]:
tf.reset_default_graph()
with tf.variable_scope("great_scope"):
        # Variables created here will be named "great_scope/whatever_name"
        w = tf.get_variable("weights", shape=[256, 10], initializer=tf.constant_initializer(0.1))
        
print('all variable names', [v.name for v in tf.global_variables()])

# create another variable but outside the scope, this will create a new variable because the variable scope
# is not set, i.e. `tf.get_variable` searches for a variable named "weights" which does not yet exist, so
# it creates it
w2 = tf.get_variable("weights", shape=[256, 10], initializer=tf.constant_initializer(0.1))
print('all variable names (one created)', [v.name for v in tf.global_variables()])

# lets retrieve `w`
# Note: if we do not set `reuse=True`, TensorFlow will throw an error. This is just to make
# sure that you don't unintentionally share variables.
with tf.variable_scope("great_scope", reuse=True):
        # Variables created here will be named "great_scope/whatever_name"
        w = tf.get_variable("weights", shape=[256, 10], initializer=tf.constant_initializer(0.1))
print('all variable names (none created)', [v.name for v in tf.global_variables()])

# If we set `reuse=True` but the variable does not exist, TensorFlow will also throw an error

all variable names ['great_scope/weights:0']
all variable names (one created) ['great_scope/weights:0', 'weights:0']
all variable names (none created) ['great_scope/weights:0', 'weights:0']


This can get tricky. Newer versions of the API take care of some of this for you. If you'd like to know more about how the sharing of variables works, we recommend reading [this post](https://chromium.googlesource.com/external/github.com/tensorflow/tensorflow/+/r1.0/tensorflow/g3doc/how_tos/variable_scope/index.md).

### Bi-directional RNNs
Bi-directional RNNs (BiRNN) are a powerful variant of recurrent networks. Instead of having only a hidden layer that connects states forward in time, BiRNNs have an additional layer that connects states backward in time. This means, that every time step can draw information from the past as well as the future to produce its prediction. The computational graph of a BiRNN looks something like this ([source](http://www.deeplearningbook.org/contents/rnn.html)):

<center><img src="https://i.imgur.com/Z1FggLN.png" align="middle" hspace="20px" vspace="5px"></center>

You can see that both recurrent layers are connected to the output, but they are not connected amongst themselves. For the task of building a character-level language model, we could technically use a BiRNN to predict single characters in gaps (or even more characters). However, it is then not straight-forward to turn this BiRNN into a generative model, because in order to generate predictions, we always need data from the future, too. Hence, the BiRNN - while powerful - is not always suitable for the problem at hand. TensorFlow's API supports the creation of BiRNNs, and it is not much different then creating a uni-directional RNN. In the older API the function is [`tf.compat.v1.nn.bidirectional_dynamic_rnn`](https://www.tensorflow.org/api_docs/python/tf/compat/v1/nn/bidirectional_dynamic_rnn). Under this link you will find a link to the new API as well.