# Text Generation with Recurrent Neural Networks

In this notebook, we will see how to define and fit a recurrent neural network (RNN) using TensorFlow. We will fit the model on two ebooks, donwloaded in plain txt format from the [Project Gutenberg website](http://www.gutenberg.org/ebooks/). In particular, we will use "Alice in Wonderland" and "Hamlet".

RNNs are computationally expensive to fit. Real-world applications are typically distributed or parallellized across multiple CPUs and GPUs. However, we only have a laptop with no GPU acceleration. Thus, we will make a set of simplifications:

- We will train the RNN to predict *one character at a time*, rather than one word at a time. This has the advantage that we can reduce the input and output dimensionality (there are tens/hundreds thousands words versus around one hundred characters).

- To further reduce the complexity, we will consider *lower case* characters only, i.e., we will convert all characters in the original text to lower case.

- We will consider only reduced versions of the data (Chapters I, II, and III of "Alice in Wonderland"; and Act I of "Hamlet").

Even with these considerations, fitting the model on a laptop still takes a few hours. So we will provide the fitted model results, so that you can directly load them. However, let's leave that for later, and let's focus now on how to leverage TensorFlow to fit a RNN for us.

First, import the packages.

In [1]:
import tensorflow as tf
import numpy as np
import random

## Data

First, load the (reduced) data for "Alice in Wonderland", and conver to lower case. Note that we will keep special characters, such as punctuation marks (`,` `.` `"`) or carriage return (`\n`).

In [2]:
# Load (the reduced version of) "Hamlet" in plain text format
filename = "dat/hamlet-reduced.txt"
with open(filename) as file:
    raw_text = file.read()

**[Task]** Use the cell below to convert the text in variable `raw_text` to lower case. (Overwrite the variable `raw_text`.)

In [3]:
raw_text = raw_text.lower()

**[Task]** Create the variable `n_chars_total` with the total number of characters. Print this number in the cell below.

In [4]:
n_chars_total = len(raw_text)
print('There are {:d} characters in the text'.format(n_chars_total))

There are 36502 characters in the text


**[Task]** Use the cell below to print the first 1000 characters of the text and confirm that it is lower case.

In [5]:
print(raw_text[0:1000])

actus primus. scoena prima.

enter barnardo and francisco two centinels.

  barnardo. who's there?
  fran. nay answer me: stand & vnfold
your selfe

   bar. long liue the king

   fran. barnardo?
  bar. he

   fran. you come most carefully vpon your houre

   bar. 'tis now strook twelue, get thee to bed francisco

   fran. for this releefe much thankes: 'tis bitter cold,
and i am sicke at heart

   barn. haue you had quiet guard?
  fran. not a mouse stirring

   barn. well, goodnight. if you do meet horatio and
marcellus, the riuals of my watch, bid them make hast.
enter horatio and marcellus.

  fran. i thinke i heare them. stand: who's there?
  hor. friends to this ground

   mar. and leige-men to the dane

   fran. giue you good night

   mar. o farwel honest soldier, who hath relieu'd you?
  fra. barnardo ha's my place: giue you goodnight.

exit fran.

  mar. holla barnardo

   bar. say, what is horatio there?
  hor. a peece of him

   bar. welcome horatio, welcome good marcellus



We define now the "vocabulary" (i.e., the set of unique characters).

In [6]:
# List of unique chars in the text
chars = sorted(list(set(raw_text)))
print(chars)

['\n', ' ', '!', '&', "'", '(', ')', ',', '-', '.', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


**[Task]** Create the variable `vocab_size` with the number of unique characters. Print the number of unique characters.

In [7]:
vocab_size = len(chars)
print("There are {} unique characters".format(vocab_size))

There are 40 unique characters


Now, we will create a dictionary that maps characters to integers (similarly to tokenization for words), and a reverse dictionary that maps integers to characters. We will use the reverse dictionary after fitting the model, at the text generation step.

In [8]:
# Create dictionary
dictionary_char2int = dict((c, i) for i, c in enumerate(chars))

**[Task]** Create the reverse dictionary `dictionary_int2char` (use the cell below).

In [9]:
dictionary_int2char = dict(zip(dictionary_char2int.values(), dictionary_char2int.keys()))
# Alternatively: 
# dictionary_int2char = dict((i, c) for i, c in enumerate(chars))

### Data as Sequences of Characters

Now we will create data sequences based on the raw text. Although RNNs have the potential to keep track of long-term memory (specially GRUs and LSTMs), in practice that turns out to be expensive, so it is common to truncate the sequences to have some fixed, reduced length.

In our case, we will use the variable `seq_length` and fit the RNN to predict the next character given a sequence of length `seq_length`. For instance, consider the phrase
```
alice was [...]
```
and consider a hypothetical example in which `seq_length=4`. Then, we would form the following sequences:
```
alic --> Predict "e"
lice --> Predict " "
ice  --> Predict "w"
ce w --> Predict "a"
e wa --> Predict "s"
 was --> Predict " "
...
```
(Of course, we need a value greater than 4 for the sequence length; below we set `seq_length=100`.)

We define the function below to create those sequences, together with the labels that need to be predicted.

In [10]:
def create_sequences(seq_length):
    # List of sequences
    X_seq = [raw_text[i:i+seq_length] for i in range(n_chars_total-seq_length)]
    # List of targets
    Y_seq = [raw_text[i+seq_length] for i in range(n_chars_total-seq_length)]
    # Return
    return X_seq, Y_seq

In the cell below, we specify the value of `seq_length` and build the sequences.

In [11]:
seq_length = 100
X_seq, Y_seq = create_sequences(seq_length)

**[Task]** Create variable `N` containing the total number of sequences. Print `N`.

In [12]:
N = len(X_seq)
print("There are {} sequences in total, each of length {}".format(N, seq_length))

There are 36402 sequences in total, each of length 100


**[Task]** In the cell below, print one of the character sequences and its associated label to be predicted.

In [13]:
print("SEQUENCE:")
print(X_seq[14])
print("")
print("TARGET:")
print(Y_seq[14])

SEQUENCE:
scoena prima.

enter barnardo and francisco two centinels.

  barnardo. who's there?
  fran. nay ans

TARGET:
w


### Batches of Data

We cannot use all the `N` sequences at each iteration of the gradient descent algorithm; that would be computationally very expensive. In neural network applications, it is typically preferred to use stochastic gradient descent (SGD). In SGD, at each iteration of the algorithm we choose a subset of the training data and proceed as if this were the actual dataset.

In the code below, we write the function `get_minibatch`, which at each iteration returns the next minibatch of `batch_size` sequences. It also converts the sequences to integer sequences (instead of char) according to the dictionary, and returns the results as numpy matrices.

For simplicity, we don't choose the minibatch at random (as we should), but instead iterate over the sequences in order.

In [14]:
def get_minibatch(N, batch_size):
    for i in range(N // batch_size):
        # Define the size of the numpy matrices
        Xbatch = np.zeros((batch_size, seq_length)).astype(np.int32) # batch_size x seq_length
        Ybatch = np.zeros((batch_size)).astype(np.int32)             # batch_size
        # For each sequence in minibatch
        for j in range(batch_size):
            # Convert characters to int (for the sequences)
            Xaux = [dictionary_char2int[c] for c in X_seq[i*batch_size+j]]
            Xbatch[j,:] = Xaux
            # Convert characters to int (for the targets)
            Yaux = dictionary_char2int[Y_seq[i*batch_size+j]]
            Ybatch[j] = Yaux
        # Return (iterable object)
        yield Xbatch, Ybatch

### Placeholders

Now we declare placeholders for the data. (This is already part of TensorFlow's *computational graph*.) **Each placeholder will contain a minibatch of data**, and thus `X_minibatch` has size `batch_size x seq_length` and `Y_minibatch` has length `batch_size`. We will use a batch size of 30.

In [15]:
# Declare the minibatch size
batch_size = 30
# Declare placeholders for minibatches of data
X_minibatch = tf.placeholder(tf.int32, shape=(batch_size, seq_length))
Y_minibatch = tf.placeholder(tf.int32, shape=(batch_size))

**Convert to one-hot representation.** As input features for the RNN, we will use the one-hot representation of each character to avoid the need of learning feature vectors. In the cell below, we convert the sequences to their one-hot representation.

In [16]:
# Now convert the sequences to one-hot representation
X_minibatch_one_hot = tf.one_hot(X_minibatch, vocab_size)

**[Task]** Use the cell below to find out the size (or shape) of `X_minibatch_one_hot`. What does each dimension of the tensor represent?

In [17]:
print(X_minibatch_one_hot.shape)  # batch_size x seq_length x vocab_size

(30, 100, 40)


## The Recurrent Neural Network

**1. Define the neural network**

Now we build the neural network. TensorFlow has built-in support for standard RNNs, as well as GRUs and LSTMs. We will use GRUs because they can capture long-term dependencies better than standard RNNs.

RNNs have two parameters: the number of layers (`num_layers`) and the number of hidden units in each layer (`hidden_units`). In the cell below, we create a two-layer GRU with 256 hidden units per layer.

In [18]:
# Number of hidden layers
num_layers = 2
# Number of hidden units per layer
hidden_units = 256

In [19]:
# Form a list of GRU cells (one per layer)
cell_list = []
for h in range(num_layers):
    cell_list.append(tf.contrib.rnn.GRUCell(hidden_units)) # h-th layer (GRU cell)
# Stack all GRU cells in the list
cell = tf.contrib.rnn.MultiRNNCell(cell_list)

** 2. Apply the NN to the data**

We now apply the 2-layer GRU to all the sequences in the minibatch. Specifically, this means to compute all the inner products and non-linearities, up to the end of each sequence. TensorFlow's function `dynamic_rnn` does that for us:

In [20]:
# Apply the RNN to an entire minibatch of data
output, _ = tf.nn.dynamic_rnn(cell, X_minibatch_one_hot, dtype=tf.float32)

**[Task]** Use the cell below to print the size of the variable `output`. What do these numbers correspond to? Do they make sense?

In [21]:
print(output.shape)   # batch_size x seq_lenth x hidden_units

(30, 100, 256)


**3. Predict the next character**

Now we want to predict the next character, based on the output of the RNN. Recall that, for each sequence in the minibatch, the output is a `seq_length x hidden_units` tensor. We will not use the outputs for the intermediate time steps in the sequence, but only the last one, which is a vector of length `hidden_units` (for each sequence). The line below gathers the output for the last time step and reshapes it as a `batch_size x hidden_units` matrix.

In [22]:
# Keep only the output for the last time step
output_last = tf.squeeze(output[:, seq_length-1, :])

**[Task]** Print the shape of `output_last` and check whether it has the expected size.

In [23]:
print(output_last.shape)   # batch_size x hidden_units

(30, 256)


We will use the output from the last time step to predict the next character. For that, we will include a softmax layer, which converts the output dimension (`hidden_units`) to the logits (of length `vocab_size`). Recall that a softmax layer has weights and intercepts.

**[Task]** Declare the parameters for the softmax layer as TensorFlow variables. Use the name `weights_softmax` for the weights and `intercept_softmax` for the intercepts. Be careful to choose the appropriate size. Initialize the intercept to zero, and initialize the weights randomly following a truncated normal distribution (`tf.truncated_normal`) with standard deviation $0.01$.

In [24]:
# Declare and initialize the softmax parameters: weights and intercepts
weights_softmax = tf.Variable( tf.truncated_normal((hidden_units, vocab_size), stddev=0.01) )
intercept_softmax = tf.Variable( tf.zeros(vocab_size), dtype=tf.float32 )

Let's now compute the logits using these variables.

**[Task]** In the cell below, compute the variable `logits`. Print the size of `logits` to check that it has the appropriate size.

In [25]:
# Compute the logits as usual (weight*output_of_previous_layer + intercept)
logits = tf.matmul(output_last, weights_softmax) + intercept_softmax
print(logits.shape)  # batch_size x vocab_size

(30, 40)


**4. Compute the loss**

Now we compute the loss, which is the standard average log-likelihood of the predicted labels, similarly to multinomial logistic regression:
$$
\mathcal{L} = \frac{1}{B} \sum_{b=1}^B \log (\widehat{y}_b),
$$
where $b$ denotes the minibatch, and $\widehat{y}_b$ denotes the predicted probability for the observed class (label).

TensorFlow hast the function `sparse_softmax_cross_entropy_with_logits`, which computes that for us, up to the mean.

In [26]:
# Compute the loss for each sequence in the minibatch
losses = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=logits, labels=Y_minibatch)

**[Task]** Compute the average of the loss into variable `loss`.

In [27]:
# Take the average
loss = tf.reduce_mean(losses)

**5. Define the optimizer**

As in most TensorFlow applications, we ultimately want to solve an optimization problem. In this case, we want to minimize the loss. For that, we use Adadelta optimizer (we could also use other), with learning rate 1.0.

In [28]:
learning_rate = 1.0
optimizer = tf.train.AdadeltaOptimizer(learning_rate).minimize(loss)

## Run the Optimization: TensorFlow Session

Here, we will create a session to optimize the model parameters. But first, we use a `Saver` object to save the model parameters once they have been fit. The reason is that we want to reuse the model parameters to make predictions, but fitting the model is computationally expensive, so we want to fit the model first, and then save the parameters to disk.

In [29]:
# Object to allow saving/restoring the value of TensorFlow variables within a session
saver = tf.train.Saver()

We now run 100 epochs of the optimization algorithm.

**[Task]** Run the cell below. Wait for some time. Then interrupt the kernel (from the menu above, choose Kernel --> Interrupt).

In [None]:
# Number of passes over the data
num_epochs = 100
# Open a TensorFlow session
with tf.Session() as sess:
    # Initialize the variables
    tf.global_variables_initializer().run()
    # For each epoch
    for epoch in range(num_epochs):
        n_iter = 0
        print('Epoch ' + str(epoch) + '...')
        # For each minibatch in the dataset
        for X, Y in get_minibatch(N, batch_size):
            # Feed the placeholder variables
            feed_dict = {X_minibatch:  X,
                         Y_minibatch:  Y}
            # Run a gradient descent step and evaluate the loss
            _, numpy_loss = sess.run([optimizer, loss], feed_dict=feed_dict)
            # Print progress to screen
            if(n_iter%100==0):
                print('   iter ' + str(n_iter) + ': loss=' + str(numpy_loss))
            n_iter += 1
    # After convergence, save the session to disk
    save_path = saver.save(sess, "trained_models/trained_model_hamlet_h2x256_seq100.ckpt")
    print('Model saved')

As you can see, the code above runs slowly on a laptop. It may take hours to converge (you may leave the computer running overnight). For that reason, **we will provide you with the files that are saved after convergence**.

## Text Generation

**1. Preparing the test sequence**

Below, we use an example of a short sentence that we will feed to the fitted RNN. You may replace the text with anything you like; just keep in mind that you *must* use lower case characters.

In [31]:
# Initial string (you may replace with yours)
test_string = 'King. We doubt it nothing, heartily farewell'.lower()
# Number of new characters to generate
n_new_characters = 500
# Number of time steps of the initial string
time_test = len(test_string)
# Convert the characters to int
X_test = [dictionary_char2int[c] for c in list(test_string)]
# Reshape as a numpy matrix
X_test = np.reshape(X_test, (1, time_test)).astype(np.int32)

# Declare a placeholder for the test sequence
Xtest_placeholder = tf.placeholder(tf.int32, (1, time_test))
# Convert the placeholder variable to one-hot representation
Xtest_one_hot = tf.one_hot(Xtest_placeholder, vocab_size)
# Print its size as a sanity check
print(Xtest_one_hot.shape)  # 1 x len(test_string) x vocab_size

(1, 44, 40)


**2. Running the RNN**

We will generate text as follows. We will run the RNN through the test sequence, `test_string`. We will pass the output through a softmax layer to obtain the probabilities of the next character. Then we will sample one character according to the softmax probabilities. We will use this new character as a new input for the RNN, generating another character, and so on. The pseudocode would look like this:

```
1. output, state = apply_rnn(test_string)
2. sampled_char ~ softmax(output * weights + intercept)
3. output, state = apply_rnn(sampled_char, initial_state=state)
4. Go to 2.
```

The following code does that.

**[Warning]** It may take 2-10 minutes to create the computational graph.

In [32]:
# Reuse the same RNN as above
tf.get_variable_scope().reuse_variables()
# 1. Apply the RNN to the test string
test_output, test_state = tf.nn.dynamic_rnn(cell, Xtest_one_hot, dtype=tf.float32)
test_output_last = tf.squeeze(test_output[:,time_test-1,:])
test_output_last = tf.expand_dims(test_output_last, 0)
# 2. Compute the softmax logits and sample a new character
test_logits = tf.matmul(test_output_last, weights_softmax) + intercept_softmax
sampled_char = tf.multinomial(test_logits, 1)
# Convert the sampled character to one-hot representation
sampled_value = tf.one_hot(sampled_char, vocab_size)
# For each new character to be sampled:
char_list = []
for t in range(n_new_characters):
    # 3. Run 1 step of the RNN
    test_output, test_state = tf.nn.dynamic_rnn(cell, sampled_value, initial_state=test_state, dtype=tf.float32)
    test_output_reshaped = tf.reshape(test_output, (1, hidden_units))
    # 4. Compute the logits and sample a new character
    test_logits = tf.matmul(test_output_reshaped, weights_softmax) + intercept_softmax
    sampled_char = tf.multinomial(tf.expand_dims(test_logits[0,:], 0), 1)
    # Convert the sampled character to one-hot representation
    sampled_value = tf.one_hot(sampled_char, vocab_size)
    # Append the sampled character to the list of characters
    char_list.append(tf.squeeze(sampled_char))

**3. Running a session**

Now, we create a TensorFlow session to recover `char_list`.

In [33]:
# In a TensorFlow session
with tf.Session() as sess:
    # Load the model
    saver.restore(sess, 'trained_models/trained_model_hamlet_h2x256_seq100.ckpt')
    # Evaluate char_list (the predicted list of characters)
    char_list_numpy = sess.run([char_list], feed_dict={Xtest_placeholder: X_test})
    
# Convert char_list_numpy to a list of characters
char_list_str = [dictionary_int2char[c] for c in char_list_numpy[0]]

INFO:tensorflow:Restoring parameters from ./trained_model_hamlet_h2x256_seq100.ckpt


**[Task]** Use the cell below to print the test string, followed by the list of predicted characters. You may use `join` to concatenate the characters in the list into a single string.

In [34]:
# Print the test string, followed by the list of predicted charaters
print(test_string + ''.join(char_list_str))

king. we doubt it nothing, heartily farewell

exit voltemand and cornelius.

and now laertes, hamlous drease stand

   aarnllls, and he houle and for in,
the blayder sire dis porpelyus this wirlin,
the hands speake of this that you haue heard:
sweare by my sword

   gho. sweare

   ham. well said old mole, can'st worke i'th' ground so fast?
a worthy pioner, once more remoue good friends

   hor. oh day and night: but this is wondrous strange

   ham. and therefore as a stranger giue it welcome.
there are more things in heauen and earth, h


## Alice in Wonderland

**[Task]** Repeat the process above, but using "Alice in Wonderland" instead of "Hamlet". You weel need to restart the python kernel before that (menu Kernel --> Restart & Clear output). You will need to change:
- The name of the data file to be loaded.
- The test sequence.
- The name of the model to be restored (and maybe the name of the model to be saved, if you plan to leave it running overnight).

## Conclusion

As you can see, the generated text recovers some of the most common words in the text, and it looks like English. However, is it still far from being perfect, as it may produce words that are not proper English words, and it may also overfit to the data. Here are some suggestions that can help improve its quality:
- Use more data instead of the reduced text. We could go even further, and fit a RNN to all Shakespeare's plays.
- Apply regularization techniques, such as dropout.
- Use a more complex neural network. We are using a GRU with 2 layers and 256 hidden units, but that may not be enough.
- Explore several initializations and/or values for the batch size and the stepsize schedule, which may have an impact on the optimum.