# Character-level Generation with Sequential Models

We have two kinds of text generation:


1.   Character-level
2.   Word Level

In this assignment we are focusing on the first one. Using a sequence of characters, we are going to train a model to predict the next character in the sequence. We will run you through the process step by step.

First, you have to import some required packages by running the cell below.




In [None]:
#@title Import Requireed Packages

import tensorflow as tf
import numpy as np
import os
import time

Now it's time for the our data. In this assignment we will be using the "Shakespeare" data.

In [None]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# Read and decode
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


Let's take a look at the data (the first 250 characters).

In [None]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



We now need to build up the vocabulary by finding the unique characters:

In [None]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

65 unique characters


## Preprocessing the text

We now need to convert these strings to numerical representations so that our model can understand them. To do this you will need to use:

```
tf.keras.layers.StringLookup
```
and then pass it the vocabulary we created in the previous part.

However, the text needs to be tokenized first.





In [None]:
sample_text = ['keep', 'store']

chars = tf.strings.unicode_split(sample_text, input_encoding='UTF-8')

char2id = tf.keras.layers.StringLookup(vocabulary=vocab)

ids = char2id(chars)

print(ids)

<tf.RaggedTensor [[50, 44, 44, 55], [58, 59, 54, 57, 44]]>


Since our goal is to generate text and not just understand text, we need to convert machine understandable numeric data to human-understandable text. Therefore, we will need a method to convert these ids back to string.

Use the same method you used for converting chars into id but this time use the option <font color='red'>invert=True</font>. 

In [None]:
id2char = tf.keras.layers.StringLookup(vocabulary=vocab, invert=True)

chars = id2char(ids)
print("characters: ", chars)

# now we want to use a code to join chars into strings
# Hint: You should use a method from tf.strings called reduce_join
def id2text(ids):
  id2char = tf.keras.layers.StringLookup(vocabulary=vocab, invert=True)
  chars = id2char(ids)
  text = tf.strings.reduce_join(chars, axis=-1)
  return text

restored_text = id2text(ids)
print("text: ", restored_text)

characters:  <tf.RaggedTensor [[b'k', b'e', b'e', b'p'], [b's', b't', b'o', b'r', b'e']]>
text:  tf.Tensor([b'keep' b'store'], shape=(2,), dtype=string)


## Creating the dataset

In this section we would like to divide our text into sequences. Each sequence will be constrained by <font color='red'>sequence_length</font> that we define.

The corresponding target sequence of each input sequence has the same length except one character shifted to the right. So you should break the text into chunks of <font color='red'> seq_length+1</font> For instance, given a seq_length of 5 and "Python" as text, the input sequence would be "pytho" the target would be "ython".

To do this you should use:
```
tf.data.Dataset.from_tensor_slices
```
to convert the text vector into a sequence of character indices.


In [None]:
all_ids = char2id(tf.strings.unicode_split(text, input_encoding='UTF-8'))
print(all_ids)

ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)


for ids in ids_dataset.take(10):
    print(id2char(ids).numpy().decode('utf-8'))

tf.Tensor([19 48 57 ... 46  9  1], shape=(1115394,), dtype=int64)
F
i
r
s
t
 
C
i
t
i


In [None]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

Now use the <font color='red'>batch</font> method to convert these characters to sequences with the desired length.

In [None]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(id2char(seq))

print("----------------------------------------------------------------------------------------------------------------")
  
# Changing tokens back to text
for seq in sequences.take(5):
  print(id2text(seq).numpy())

tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)
----------------------------------------------------------------------------------------------------------------
b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet

To train your model you need aa dataset consisting pairs of (input, label), where input and label are sequences. Given each time step, an input is the current character and the label is the next character. Now write a function that takes a sequence input, duplicates, and shifts it to align the input and label for each timestep:

In [None]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

example:

In [None]:
print(split_input_target(list("Hello")))

# Should return:
# ''' ([H, e, l, l],
#     [e, l, l, o]) '''

(['H', 'e', 'l', 'l'], ['e', 'l', 'l', 'o'])


In [None]:
dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in dataset.take(1):
    print("Input :", id2text(input_example).numpy())
    print("Target:", id2text(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


## Creating training batches

Now you should shuffle the data and pack it into batches.

In [None]:
BATCH_SIZE = 64

# Buffer size is used for shuffling the dataset
BUFFER_SIZE = 1000

dataset = (dataset
          .shuffle(BUFFER_SIZE)
          .batch(BATCH_SIZE, drop_remainder=True)
          .prefetch(tf.data.experimental.AUTOTUNE))

print(dataset)

<PrefetchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>


## Building the model


Here you should build your model. Please use the following for your model:


*   An embedding layer
*   An RNN layer (LSTM or GRU)
*   Dense Layer



In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
vocab_size

65

In [None]:
class GenModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [None]:
model = GenModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(char2id.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

## Check the model

Now let's check our model to see if it behaves as expected.

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 66) # (batch_size, sequence_length, vocab_size)


In [None]:
model.summary()

Model: "gen_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 4,022,850
Trainable params: 4,022,850
Non-trainable params: 0
_________________________________________________________________


In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

print(sampled_indices)

[17  4 17 39 55 44 28 11 31 55 10 50 42 47 12 17 44 15 13 12 37 52 62 14
 21 53 61 37 30 50 28 24 30  2 53 47 65 60 33 29 25 31 53 16 48 52 45 27
 15 45 57  5  4 39 33 48 30 16 57 32 54 53  1 56  2 41 44 13 47 20 22 10
 55 24 29 63  1 51 43 25 44 39 60 47 60 12 61 11 55 64 64 36 61 41 11 48
 59 22 55 59]


In [None]:
print("Input:\n", id2text(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", id2text(sampled_indices).numpy())

Input:
 b'l or lose his hire.\n\nVIRGILIA:\nHis bloody brow! O Jupiter, no blood!\n\nVOLUMNIA:\nAway, you fool! it m'

Next Char Predictions:
 b'D$DZpeO:Rp3kch;DeB?;XmwAHnvXQkOKQ nhzuTPLRnCimfNBfr&$ZTiQCrSon\nq be?hGI3pKPx\nldLeZuhu;v:pyyWvb:itIpt'


## Train the model

We have now built our model. The only part left is to train and then test the model.

We need a loss function. Please choose the correct loss function from the followings:


*   Categorical Cross Entropy
*   Sparse Categorical Cross Entropy
*   Binary Cross Entropy
*   MSE

In [None]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [None]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

Prediction shape:  (64, 100, 66)  # (batch_size, sequence_length, vocab_size)
Mean loss:         4.189194


Mean loss with no training:

In [None]:
tf.exp(mean_loss).numpy()

65.96961

Compile the model:

In [None]:
model.compile(optimizer='adam', loss=loss)

Configure Checkpoints:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
EPOCHS = 20

history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/20
Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


## Generate Texts

Here we write a class to generate characters based on the model we trained.

In [None]:
class OneChar(tf.keras.Model):
  def __init__(self, model, id2char, char2id, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = id2char
    self.ids_from_chars = char2id

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape= [len(char2id.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_char(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    # Convert tokens to ids
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model and get the states and predicted logits
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states, return_state=True)

    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs (use random.categorical).
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
one_char_model = OneChar(model, id2char, char2id)

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = one_char_model.generate_one_char(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

ROMEO:
Who is't touchinous top? Allow it must needs?

Pedant:
She holds, it that there is a house! here, sir.

PETRUCHIO:
Marry, master, play, boy, good Kate, I am nor aimerce;
Therefore you scapp me in prison. Now, good sir,
You live, and that thou obstury'
Wift and a servant of chambet hours haves been
many a thorny with this friar by this?

ANTONIO:
No.

HORTENSIO:
Sir, your garters, call him for:
As Angelo for him; which is from the wings
And never than he goes and tears me!
Signior Baptista, of your course, I thank thee.

GRUMIO:
Who, if a sound of the nipel kind? If it,
and most delights, I dare not break that combany yields
To Bio over and lock-foot.

TRANIO:
I do: and welcome my brother and the frowers;
For he is chief thou shalt bow-like his religat?

PETRUCHIO:
Mast thou not, go: I say, this, blushes, mine,
With unchossess that hath brought my strengh
For her chaple. Thou kinsmen groans
As Pluboud us by wide commends about to speak.
Now well make it here is too shriur.
What '

Now play with the model and hyperparameters (epochs, ...) and run this again to see if the results have improved.

## **Improve model**

I use two GRU layer and increase epoch number to improve model.

In [None]:
class GenModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru1 = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    self.gru2 = tf.keras.layers.GRU(rnn_units, return_sequences=True, return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states1=None, states2=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)

    if states1 is None:
      states1 = self.gru1.get_initial_state(x)
    x, states1 = self.gru1(x, initial_state=states1, training=training)


    if states2 is None:
      states2 = self.gru2.get_initial_state(x)
    x, states2 = self.gru2(x, initial_state=states2, training=training)

    x = self.dense(x, training=training)

    if return_state:
      return x, states1, states2
    else:
      return x

In [None]:
model = GenModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(char2id.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 66) # (batch_size, sequence_length, vocab_size)


In [None]:
model.summary()

Model: "gen_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 gru_1 (GRU)                 multiple                  6297600   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 10,320,450
Trainable params: 10,320,450
Non-trainable params: 0
_________________________________________________________________


In [None]:
model.compile(optimizer='adam', loss=loss)

In [None]:
EPOCHS = 35

history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/35
Epoch 2/35
Epoch 3/35
Epoch 4/35
Epoch 5/35
Epoch 6/35
Epoch 7/35
Epoch 8/35
Epoch 9/35
Epoch 10/35
Epoch 11/35
Epoch 12/35
Epoch 13/35
Epoch 14/35
Epoch 15/35
Epoch 16/35
Epoch 17/35
Epoch 18/35
Epoch 19/35
Epoch 20/35
Epoch 21/35
Epoch 22/35
Epoch 23/35
Epoch 24/35
Epoch 25/35
Epoch 26/35
Epoch 27/35
Epoch 28/35
Epoch 29/35
Epoch 30/35
Epoch 31/35
Epoch 32/35
Epoch 33/35
Epoch 34/35
Epoch 35/35


In [None]:
class OneChar(tf.keras.Model):
  def __init__(self, model, id2char, char2id, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = id2char
    self.ids_from_chars = char2id

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape= [len(char2id.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_char(self, inputs, states1=None, states2=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    # Convert tokens to ids
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model and get the states and predicted logits
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states1, states2 = self.model(inputs=input_ids, states1=states1, states2=states2, return_state=True)

    # Only use the last prediction.
    predicted_logits = predicted_logits[:, -1, :]
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs (use random.categorical).
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)

    # Return the characters and model state.
    return predicted_chars, states1, states2

In [None]:
one_char_model = OneChar(model, id2char, char2id)

In [None]:
start = time.time()
states1 = None
states2 = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states1, states2 = one_char_model.generate_one_char(next_char, states1=states1, states2=states2)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

ROMEO:
There is thy mother roar'd.

PROSPERO:
O, a cherubim
That which I should come to me than Tybalt's death,
There to the fruitor to me for her inclination
And bid good morrow; thy riches it at the heel,
And three renowned me and consent.

LUCENTIO:
I fly, Biondello: but they know me.

MENENIUS:
Do you hear, sir?

PETRUCHIO:
Very well; I see thee ill-spirit,
Stand and speak of some merry passion
And so of me, where we cracking the world?
This night's the time I will unto Venice,
To buy apparel 'gainst the Frederick was wrick'd upon
The duke is entering: then, masters, to whip her kind
Our trunken sleep; commanded likence's vagad,
To whit here an embraces have brought him.

HERMIONE:
No matter, gentlemen! ladies; but I know her with
Respected; every old friends,
More fiery lord will prove a second Griar.
I not been mean an argosy.

TRANIO:
If it be so, sir, to great her grave.

PETRUCHIO:
You lie, in fair of lustful applause and
Master's one and Saint George, I am a gentleman.

KATHA