<a href="https://colab.research.google.com/github/stefan-stein/NLP_with_TF_2.0/blob/master/NLPwith_TF_2_0_Text_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

These are my notes for the third part of the "Natural Language Processing with Tensorflow  2 - Beginner's course" that can be found [here](https://www.youtube.com/watch?v=B2q5cRJvqI8). This course consists of three parts and the third one deals with text generation. It is based on [this](https://www.tensorflow.org/tutorials/text/text_generation) official tensorflow tutorial. We will create an RNN that creates text one character at a time using Shakespearean text as training data which comes from one of Andrej Karpathy's [blogposts](http://karpathy.github.io/2015/05/21/rnn-effectiveness/).

# Text generation with RNNs

Make sure we are importing tensorflow 2.x.

In [1]:
import os
%tensorflow_version 2.x
import tensorflow as tf
import numpy as np

TensorFlow 2.x selected.


In [2]:
tf.__version__

'2.1.0'

In [0]:
# Download the data as txt file
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [4]:
# Load the data
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of characters in it
print ('Length of text: {} characters'.format(len(text)))

Length of text: 1115394 characters


In [5]:
# Check out what we just downloaded
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



We saw above that we have 1.1 million characters. But how bis is our alphabet actually, i.e. how many unique characters do we have?

In [6]:
vocab = sorted(set(text))
print ('{} unique characters'.format(len(vocab)))

65 unique characters


Our alphabet consists of 65 unique characters. Next, we create a numeric representation of our alphabet. We create two look-up tables. The first one `char2idx` is simply a dictionary in which each character in our alphabet is given a unique number. The second, `idx2char` is simply a numpy array containing each of the characters in our alphabet.

In [0]:
char2idx = {u:i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)

Now we can create a numeric representation of the entire Shakespeare text, simply by looking up each character in `text` and its associated number in `char2idx`.

In [8]:
text_as_int = np.array([char2idx[c] for c in text])
print(text_as_int[:250])

[18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 14 43 44 53 56 43  1 61 43
  1 54 56 53 41 43 43 42  1 39 52 63  1 44 59 56 58 46 43 56  6  1 46 43
 39 56  1 51 43  1 57 54 43 39 49  8  0  0 13 50 50 10  0 31 54 43 39 49
  6  1 57 54 43 39 49  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10
  0 37 53 59  1 39 56 43  1 39 50 50  1 56 43 57 53 50 60 43 42  1 56 39
 58 46 43 56  1 58 53  1 42 47 43  1 58 46 39 52  1 58 53  1 44 39 51 47
 57 46 12  0  0 13 50 50 10  0 30 43 57 53 50 60 43 42  8  1 56 43 57 53
 50 60 43 42  8  0  0 18 47 56 57 58  1 15 47 58 47 64 43 52 10  0 18 47
 56 57 58  6  1 63 53 59  1 49 52 53 61  1 15 39 47 59 57  1 25 39 56 41
 47 59 57  1 47 57  1 41 46 47 43 44  1 43 52 43 51 63  1 58 53  1 58 46
 43  1 54 43 53 54 50 43  8  0]


Well, not quite as poetic as its character version, but easier to calculate with. Let's create a prettier representation of what is going on here.

In [9]:
print('{} ----> characters mapped to int ----> {}'.format(text[:13], text_as_int[:13]))

First Citizen ----> characters mapped to int ----> [18 47 56 57 58  1 15 47 58 47 64 43 52]


Our approach is as follows: Given an input sequence of characters, say 4 characters, we want to predict the next character. We then add this new predicted character to the sequence and drop the first character in the sequence to form our new input sequence. For example, if we input "Hell", a reasonable prediction for the next character would be "o". We add that character to the sequence and drop the "H", i.e. "ello" becomes our new input. We call "Hell" the _input_ and "ello" the _target_ for our network.

Therefore, when we create our training data, we first want to decide on a fixed sequence length (here: `seq_length = 100`) and then slice our data into sequences of `seq_length+1`. In the example above, we would want to create sequences of length 5, such that we can give the first four characters ("Hell") to our network and it can learn that the correct continuation of that sequence is "o".

To achieve this, we first slice our data into single characters using the `from_tensor_slices()` function.

In [10]:
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in char_dataset.take(5):
  print(idx2char[i.numpy()])

F
i
r
s
t


From here, it is easy to batch the data together to sequences of the desired length using `batch`.

In [11]:
sequences = char_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(''.join(idx2char[item.numpy()])))

'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be d"
'one: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citi'


We now create a function to split the data into input and target as described above.

In [12]:
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
  print ('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target data: 'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


Finally, we create training batches.

In [13]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

## Build the model

We now are ready to build the model. We use a simple RNN with three layers: An embedding layer with embedding dimension 256 that will map each character in our alphabet to the embedding space. Followed by a recurrent layer and a dense output layer.

In [0]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [0]:
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              batch_input_shape=[batch_size, None]),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [16]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (64, None, 256)           16640     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3938304   
_________________________________________________________________
dense (Dense)                (64, None, 65)            66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


What does this model architecture actually mean? For each character the model embeds the character with the embedding layer. Then it runs the GRU one timestep with the numeric embedding as input. Then it calculates the probability of each character in our alphabet being the next one in the sequence.

In [17]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 65) # (batch_size, sequence_length, vocab_size)


In [18]:
example_batch_predictions[0]

<tf.Tensor: shape=(100, 65), dtype=float32, numpy=
array([[-2.09618593e-04,  7.35762995e-03,  7.43970321e-03, ...,
        -5.30001894e-03, -1.23539604e-02, -1.27230510e-02],
       [ 1.23355035e-02, -8.58413428e-03,  1.74405088e-03, ...,
        -9.06605041e-04, -1.25144813e-02, -1.74659267e-02],
       [ 9.25910287e-03, -8.21050629e-03,  8.07187054e-03, ...,
         4.74830903e-03, -1.17486864e-02, -2.36782935e-02],
       ...,
       [ 1.08108269e-02,  7.79656693e-06, -1.18108317e-02, ...,
         2.21678731e-03,  1.64780219e-03, -4.15975507e-03],
       [ 8.11548345e-03,  1.33736013e-03,  1.77029450e-03, ...,
         6.75642211e-03, -2.33738450e-03, -1.56774353e-02],
       [ 1.45033021e-02, -1.72155583e-03,  1.08980108e-04, ...,
         4.89278510e-03, -4.40774299e-03,  1.22230286e-02]], dtype=float32)>

Above we passed the first input batch to our model (`input_example_batch`). It is of dimension (64,100), i.e. consists of 64 sequences of 100 characters each. When fed into the model, the model returns a tensor of dimension (64, 100, 65). That is, we again have 64 sequences of 100 characters each, but rather than having a fixed character in each position of each sequence, we have a _probability distribution_ over our entire alphabet (notice: the actual numbers in the array are the logits for each character, hence negative values are allowed). Each probability distribution can be represented as a vector of length 65, because our alphabet has 65 unique characters. To actually genereate text from the model, we need to sample from this distribution.

Let's try this out. Obviously we are only going to get gibberish for now, since we haven't trained our model properly yet.

In [19]:
# example_batch_predictions is a collection of 100 probability distributions over our discrete
# alphabet of size 65. This line draws one sample from each of these distributions.
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
# The above returned a tensor of shape (100,1). We have to get rid of the (_,1) dimension
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()
sampled_indices

array([56, 52, 18, 63, 40,  4, 51, 53, 34, 63, 62, 42, 25, 56, 29, 37, 42,
       56, 59, 24, 14, 62, 21, 20, 37, 56,  4,  4, 42, 64, 39, 32, 11, 33,
        7, 33, 26, 25,  6, 44, 50, 34, 54, 16, 10, 33, 15, 19, 35,  3, 26,
        2, 50,  8, 30, 23, 34, 56, 41, 54,  0, 50, 55, 11, 27, 46,  2, 64,
       38, 55, 44, 20, 20, 30, 43, 28, 24, 21, 34, 25, 37, 44, 33, 40, 63,
       52,  2, 36, 12,  1,  7, 28, 33, 40, 42, 39, 51, 45, 24, 35])

`sampled_indices` is _one realization_ of the probability distribution returned by our model for the first input sequence. Let's transform it to characters and see what we have.

In [20]:
print("Input: \n", repr("".join(idx2char[input_example_batch[0]])))
print()
print("Next Char Predictions: \n", repr("".join(idx2char[sampled_indices ])))

Input: 
 "our rude brawls doth lie a-bleeding;\nBut I'll amerce you with so strong a fine\nThat you shall all re"

Next Char Predictions: 
 'rnFyb&moVyxdMrQYdruLBxIHYr&&dzaT;U-UNM,flVpD:UCGW$N!l.RKVrcp\nlq;Oh!zZqfHHRePLIVMYfUbyn!X? -PUbdamgLW'


This makes no senese at all. Good! This is what we would expect for an untrained network. So let's go ahead and train it.

We define our loss and set the `from_logits` flag to `True`, due to the form of the outputs of our network. We try it out using the first target batch (`target_example_batch`) and the batch predictions from above (`example_batch_predictions`).

In [21]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 65)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       4.1745257


We compile the model and set it up to use the `'adam'` optimizer for training.

In [0]:
model.compile(optimizer='adam', loss=loss)

Next, we configure checkpoints to save training progress after each epoch.

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

And we're finally good to go!

In [24]:
EPOCHS=30
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Train for 172 steps
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


Currently this model only accepts batch sizes of 64. To get a prediction for a batch size of one, i.e. for a single sequence, we need to rebuild the model from the last checkpoint.

In [25]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_30'

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [27]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (1, None, 256)            16640     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3938304   
_________________________________________________________________
dense_1 (Dense)              (1, None, 65)             66625     
Total params: 4,021,569
Trainable params: 4,021,569
Non-trainable params: 0
_________________________________________________________________


See? Batch sizes are down to 1 now.

So let's generate some text. The following function takes a model and a string to start with. We then convert the `start_string` to its numeric representation. In the `for`-loop below we pass our encoded start string (`input_eval`) into the model. Recall that this returns a collection of probability distributions (as logits) from which we first need to sample to generate actual text. We do this by calling `tf.random.categorical` and store the next predicted character in `predicted_id`. We append this newly predicted character to `input_eval`. No

In [0]:
def generate_text(model, start_string):
  # Evaluation step (generating text using the learned model)

  # Number of characters to generate
  num_generate = 1000

  # Converting our start string to numbers (vectorizing)
  input_eval = [char2idx[s] for s in start_string]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2char[predicted_id])

  return (start_string + ''.join(text_generated))

In [29]:
print(generate_text(model, start_string=u"ROMEO: "))

ROMEO: command men laz.

DUKE VINCENTIO:
I know this change, since you can see
Be seen it nature to defore this a good nurse. Away with her! Pow Margaret?

Pedant:
Stir; and, with thy richer-in-law like a drunken sights of stone?

WARWICK:
Ay, where's my master where no longer speaks,
If thou mayst Pilate grapesom of mind.
Did I let love I might.

ESCALUS:
Why, so hath won.

LUCIO:
Gentle CLorcess
Of dambrike the present dreadful steel.
That Angelo's as I spice, and thus must I rush'd a word;
But, sirrah, or I'll be our kindred's land!
Or whether it be, that he is a wife's hate;
For in his net nead. Good night.
Court, Somerse, when my root with such procect talk'st, revenge furnished: if peachs I till now
Can she be full of win:
And yet I'll pray thee good, Warwick man or maral world.

Third Gentleman:
Well, lords, take up, your children yet unboank,
And other mother come in's time we see it.

GLOUCESTER:
He hath a pass in a three-enemies to be straint our hearts,
Or an assworn meast i