##### Copyright 2018 The TensorFlow Authors.

Licensed under the Apache License, Version 2.0 (the "License").



# Text generation using a RNN with eager execution


<table class="tfo-notebook-buttons" align="left">
<td>
<a target=\"_blank\" href="https://www.tensorflow.org/tutorials/sequences/text_generation"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
</td><td>
<a target="_blank"  href="https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/sequences/text_generation.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>  
</td><td>
<a target="_blank"  href="https://github.com/tensorflow/docs/blob/master/site/en/tutorials/sequences/text_generation.ipynb"><img width=32px src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on Github</a></td></table>

This tutorial demonstrates how to generate text using a character-based RNN. We will work with dataset of Shakespeare's writing borrowed from Andrej Karpathy's [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/). Given a sequence of characters from this data ("Shakespear") we'll train a model to predict the next character in the sequence ("e"). We can generate longer sequences of text by using our model repeatedly.

This tutorial includes complete, runnable code implemented using [tf.keras (https://www.tensorflow.org/programmers_guide/keras) and [eager execution](https://www.tensorflow.org/programmers_guide/eager). Below is sample output when the tutorial is run with the default settings:

```
QUEENE:
I had thought thou hadst a Roman; for the oracle,
Thus by All bids the man against the word,
Which are so weak of care, by old care done;
Your children were in your holy love,
And the precipitation through the bleeding throne.

BISHOP OF ELY:
Marry, and will, my lord, to weep in such a one were prettiest;
Yet now I was adopted heir
Of the world's lamentable day,
To watch the next way with his father with his face?

ESCALUS:
The cause why then we are all resolved more sons.

VOLUMNIA:
O, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, no, it is no sin it should be dead,
And love and pale as any will to that word.

QUEEN ELIZABETH:
But how long have I heard the soul for this world,
And show his hands of life be proved to stand.

PETRUCHIO:
I say he look'd on, if I must be content
To stay him from the fatal of our country's bliss.
His lordship pluck'd from this sentence then for prey,
And then let us twain, being the moon,
were she such a case as fills m
```

Is this output any good? Yes! Of course, while some of the sentences are grammatical, most do not make sense. Our model has not learned the meaning of words, but consider what is has learned:

* Our model is character-based (when we began training, it did not yet know how to spell an English word, or that words were even a unit of text!).

* The structure of the output resembles a play (blocks of text generally begin with a speaker name, in all capital letters similar to the dataset).

* As we'll show below, our model is trained on small batches of text (100 characters each), and is still able to generate a longer sequence of text with coherent structure.

## Setup

### Import TensorFlow and other libraries

In [0]:
import tensorflow as tf
tf.enable_eager_execution()

import numpy as np
import os
import time

### Download the Shakespeare dataset

To run this code on your own data, you'll need to change only this line.

In [0]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

### Read the data

First we'll have a look in the text.

In [0]:
text = open(path_to_file).read()
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

In [0]:
# Take a look at the first 1000 characters in text
print(text[:1000])

In [0]:
# The unique characters in the file
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

## Process the text

### Vectorize the text

Before training, we'll need to map strings to a numerical representation. First, we'll create two lookup tables. One mapping from characters to numbers, and another from numbers to characters.

In [0]:
# Creating a mapping from unique characters to indices
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

text_as_int = np.array([char2idx[c] for c in text])

Now we have an integer representation for each character. Notice that we mapped the character as indexes from 0 to len(unique).

In [0]:
for char in char2idx:
    print('{:6s} ---> {:4d}'.format(repr(char), char2idx[char]))

In [0]:
# Show how the first 13 characters from the text are mapped to integers
print ('{} ---- characters mapped to int ---- > {}'.format(text[:13], text_as_int[:13]))

### The prediction task

Given a character, or a sequence of characters, what is the most probable next character? This is the actual task we'll train the model to perform. The input to the model will be a sequence of characters, and the output we will train the model to predict will be the following character at each time step.

In practice, since RNNs maintain an internal state that depends on the previously seen elements, is like given all the characters computed until this moment what is the next character?

Let's prepare this data and have a look on how it looks like.

### Create training examples and targets

We will now divide the text into training examples and targets. Each training example will contain **seq_length** characters from the text. The corresponding targets will contain the same length of text, except shifted one character to the right. For example, say seq_length was 4, and our text was "Hello". We would create one training example "Hell", and one target "ello".

First break the text into chunks of `seq_length+1`:

In [0]:
# The maximum length sentence we want for a single input in characters
seq_length = 100

# Create training examples / targets
chunks = tf.data.Dataset.from_tensor_slices(text_as_int).batch(seq_length+1, drop_remainder=True)

for item in chunks.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

Next, create the input and target texts from this chunk:

In [0]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = chunks.map(split_input_target)

Let's print the first 10 values of the first example.

In [0]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Each index of these vectors will be processed as one time step. For this input at time step 0 we receive the character mapped to the number 18 and try to predict the character mapped to the number 47. At time step 1 we'll do the same thing but considering the previous step along with the current character.

In [0]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2char[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2char[target_idx])))

### Creating batches and shuffling them using tf.data

We used [tf.data](https://www.tensorflow.org/guide/datasets) chop the text into usable chunks. Before feeding this data into the model we still need to shuffle the data (), and pack it into batches.

In [0]:
# Batch size 
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences, 
# so it doesn't attempt to shuffle the entire sequence in memory. Instead, 
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = tf.data.Dataset.from_tensor_slices((input_text, target_text)).shuffle(BUFFER_SIZE)
dataset = dataset.batch(BATCH_SIZE, drop_remainder=True)

## The Model

### Implement the model

We'll use the tf.keras [Model Subclassing API](https://www.tensorflow.org/guide/keras) which gives us full flexibility to create the model and change it however we like. We use 3 layers to define our model. These are:

* An [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer: a trainable lookup table that will map the numbers of each character to a high dimensional vector with **embedding_dim** dimensions;
* A [GRU](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) layer: a type of RNN (you can also use a LSTM layer here) with layer size = **units**;
* A [Dense](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense) layer with **vocab_size** cells.

In [0]:
class Model(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, units):
    super(Model, self).__init__()
    self.units = units

    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)

    if tf.test.is_gpu_available():
      self.gru = tf.keras.layers.CuDNNGRU(self.units, 
                                          return_sequences=True, 
                                          return_state=True, 
                                          recurrent_initializer='glorot_uniform')
    else:
      self.gru = tf.keras.layers.GRU(self.units, 
                                     return_sequences=True, 
                                     return_state=True, 
                                     recurrent_activation='sigmoid', 
                                     recurrent_initializer='glorot_uniform')

    self.fc = tf.keras.layers.Dense(vocab_size)
        
  def call(self, x, hidden):
    embedding = self.embedding(x)
    
    # output at every time step
    # output shape == (batch_size, seq_length, hidden_size) 
    # states variable to preserve the state of the model
    # states shape == (batch_size, hidden_size)
    output, states = self.gru(embedding, initial_state=hidden)
    
    # reshaping the output so that we can pass it to the Dense layer
    # after reshaping the shape is (batch_size * seq_length, hidden_size)
    output = tf.reshape(output, (-1, output.shape[2]))
    
    # The dense layer will output predictions for every time_steps(seq_length)
    # output shape after the dense layer == (seq_length * batch_size, vocab_size)
    prediction = self.fc(output)
    
    # states will be used to pass at every step to the model while training
    return prediction, states

### Instantiate the model, optimizer, and the loss function

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension 
embedding_dim = 256

# Number of RNN units
units = 1024

model = Model(vocab_size, embedding_dim, units)

We'll use [Adam optimizer](https://www.tensorflow.org/api_docs/python/tf/train/AdamOptimizer) with default arguments and [softmax cross entropy](https://www.tensorflow.org/api_docs/python/tf/losses/sparse_softmax_cross_entropy) as the loss function. This loss function is applicable because we're training to predict the next character and the number of characters is a discrete number, similar to a classification problem.

In [0]:
# Using adam optimizer with default arguments
optimizer = tf.train.AdamOptimizer()

# Using sparse_softmax_cross_entropy so that we don't have to create one-hot vectors
def loss_function(real, preds):
    return tf.losses.sparse_softmax_cross_entropy(labels=real, logits=preds)

### Checkpoints (Object-based saving)

We'll use [tf.train.Checkpoint](https://www.tensorflow.org/api_docs/python/tf/train/Checkpoint) to save the weights of the model after a couple of epochs.

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt")
# Checkpoint instance
checkpoint = tf.train.Checkpoint(optimizer=optimizer, model=model)

### Train the model

Here we will use a custom training loop with the help of [GradientTape()](https://www.tensorflow.org/api_docs/python/tf/GradientTape). You can learn more about this approach by reading the eager execution [guide](https://www.tensorflow.org/guide/eager).

* First, we initialize the hidden state of the model with zeros and shape == (batch_size, number of rnn units). We do this by calling the function defined while creating the model.

* Next, we iterate over the dataset (batch by batch) and calculate the **predictions and the hidden states** associated with that input.

* There are a lot of interesting things happening during training:
  * The model gets hidden state (initialized with 0), lets call that **H0** and the first batch of input, lets call that **I0**.
  * The model then returns the predictions **P1** and **H1**.
  * For the next batch of input, the model receives **I1** and **H1**.
  * The interesting thing here is that we pass **H1** to the model with **I1** which is how the model learns. The context learned from batch to batch is contained in the **hidden state**.
  * We continue doing this until the dataset is exhausted and then we start a new epoch and repeat all the process.

* After calculating the predictions, we calculate the **loss** using the loss function defined above. Then we calculate the gradients of the loss with respect to the model variables.

* Finally, we take a step in that direction with the help of the optimizer using the apply_gradients function.

Below is a diagram representing the process described above.

![](https://drive.google.com/uc?export=view&id=1omAlyVeIgP4hk91pBwZTAGxxZ8TuTBqT)

In [0]:
# Training step
EPOCHS = 30

for epoch in range(EPOCHS):
    start = time.time()
    
    # initializing the hidden state at the start of every epoch
    # initally hidden is None
    hidden = model.reset_states()
    
    for (batch, (inp, target)) in enumerate(dataset):
          with tf.GradientTape() as tape:
              # feeding the hidden state back into the model
              # This is the interesting step
              predictions, hidden = model(inp, hidden)
              # reshaping the target because that's how the 
              # loss function expects it
              target = tf.reshape(target, (-1,))
              loss = loss_function(target, predictions)
              
          grads = tape.gradient(loss, model.variables)
          optimizer.apply_gradients(zip(grads, model.variables))

          if batch % 100 == 0:
              print ('Epoch {} Batch {} Loss {:.4f}'.format(epoch+1,
                                                            batch,
                                                            loss))
    # saving (checkpoint) the model every 5 epochs
    if (epoch + 1) % 5 == 0:
      checkpoint.save(file_prefix = checkpoint_prefix)

    print ('Epoch {} Loss {:.4f}'.format(epoch+1, loss))
    print ('Time taken for 1 epoch {} sec\n'.format(time.time() - start))

### Restore the latest checkpoint

In [0]:
# Restoring the latest checkpoint in checkpoint_dir
checkpoint.restore(tf.train.latest_checkpoint(checkpoint_dir))

### Generate text using our trained model

The below code block is used to generated the text

* We start by choosing a start string, initializing the hidden state and setting the number of characters we want to generate.

* We get predictions using the start string and the hidden state.

* Then we use a multinomial distribution to calculate the index of the predicted character. **We use this predicted character as our next input to the model**.

* **The hidden state returned by the model is fed back into the model so that it now has more context rather than just one word.** After we predict the next word, the modified hidden states are again fed back into the model, which is how it learns as it gets more context from the previously predicted words.


![](https://drive.google.com/uc?export=view&id=1MeU-bnClsvknBWTCCPgbrCyGRoyaZf9k)

If you look at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing style.

In [0]:
# Evaluation step (generating text using the learned model)

# Number of characters to generate
num_generate = 1000

# You can change the start string to experiment
start_string = 'Q'

# Converting our start string to numbers (vectorizing) 
input_eval = [char2idx[s] for s in start_string]
input_eval = tf.expand_dims(input_eval, 0)

# Empty string to store our results
text_generated = ''

# Low temperatures results in more predictable text.
# Higher temperatures results in more surprising text.
# Experiment to find the best setting.
temperature = 1.0

# hidden state shape == (batch_size, number of rnn units); here batch size == 1
hidden = [tf.zeros((1, units))]
for i in range(num_generate):
    predictions, hidden = model(input_eval, hidden)

    # using a multinomial distribution to predict the word returned by the model
    predictions = predictions / temperature
    predicted_id = tf.multinomial(tf.exp(predictions), num_samples=1)[0][0].numpy()
    
    # We pass the predicted word as the next input to the model
    # along with the previous hidden state
    input_eval = tf.expand_dims([predicted_id], 0)
    
    text_generated += idx2char[predicted_id]

print (start_string + text_generated)

## Next steps

As a next step, consider training this model on your own data. 
* [Project  Gutenberg](http://www.gutenberg.org/ebooks/100) contains the text of many books.
* Or, you might try training a model on the source code (HTML and Javascript) of your favorite webpage. 

You can also experiment with a different start character, or try adding another RNN layer to improve the model's accuracy, or adjusting the temperature parameter to generate more or less random predictions.