In [1]:
# https://keras.io/api/layers/core_layers/embedding/
# https://stats.stackexchange.com/questions/326065/cross-entropy-vs-sparse-cross-entropy-when-to-use-one-over-the-other

# Character-level Generation with Sequential Models

We have two kinds of text generation:


1.   Character-level
2.   Word Level

In this assignment we are focusing on the first one. Using a sequence of characters, we are going to train a model to predict the next character in the sequence. We will run you through the process step by step.

First, you have to import some required packages by running the cell below.




In [2]:
#@title Import Requireed Packages

import tensorflow as tf
import numpy as np
import os
import time

Now it's time for the our data. In this assignment we will be using the "Shakespeare" data.

In [3]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
# Read and decode
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


Let's take a look at the data (the first 250 characters).

In [4]:
print(text[:250])

First Citizen:
Before we proceed any further, hear me speak.

All:
Speak, speak.

First Citizen:
You are all resolved rather to die than to famish?

All:
Resolved. resolved.

First Citizen:
First, you know Caius Marcius is chief enemy to the people.



We now need to build up the vocabulary by finding the unique characters:

In [5]:
# vocab

In [6]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

65 unique characters


## Preprocessing the text

We now need to convert these strings to numerical representations so that our model can understand them. To do this you will need to use:

```
tf.keras.layers.StringLookup
```
and then pass it the vocabulary we created in the previous part.

However, the text needs to be tokenized first.





In [7]:
sample_text = ['keep', 'store']

chars = tf.strings.unicode_split(sample_text, input_encoding='UTF-8')

char2id = tf.keras.layers.StringLookup(vocabulary=vocab)


ids = char2id(chars)

print(ids)

<tf.RaggedTensor [[50, 44, 44, 55], [58, 59, 54, 57, 44]]>


Since our goal is to generate text and not just understand text, we need to convert machine understandable numeric data to human-understandable text. Therefore, we will need a method to convert these ids back to string.

Use the same method you used for converting chars into id but this time use the option <font color='red'>invert=True</font>. 

In [8]:
id2char = tf.keras.layers.StringLookup(vocabulary=vocab,  invert=True)

chars = id2char(ids)
print(chars)

# now we want to use a code to join chars into strings
# Hint: You should use a method from tf.strings called reduce_join
def id2text(ids):
    chars = id2char(ids)
    text = tf.strings.reduce_join(chars, axis=-1)
    return text

print(id2text(ids))

<tf.RaggedTensor [[b'k', b'e', b'e', b'p'], [b's', b't', b'o', b'r', b'e']]>
tf.Tensor([b'keep' b'store'], shape=(2,), dtype=string)


## Creating the dataset

In this section we would like to divide our text into sequences. Each sequence will be constrained by <font color='red'>sequence_length</font> that we define.

The corresponding target sequence of each input sequence has the same length except one character shifted to the right. So you should break the text into chunks of <font color='red'> seq_length+1</font> For instance, given a seq_length of 5 and "Python" as text, the input sequence would be "pytho" the target would be "ython".

To do this you should use:
```
tf.data.Dataset.from_tensor_slices
```
to convert the text vector into a sequence of character indices.


In [9]:
all_ids = char2id(tf.strings.unicode_split(text, input_encoding='UTF-8'))
print(all_ids)

ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)


for ids in ids_dataset.take(10):
    print(id2char(ids).numpy().decode('utf-8'))

tf.Tensor([19 48 57 ... 46  9  1], shape=(1115394,), dtype=int64)
F
i
r
s
t
 
C
i
t
i


In [10]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

Now use the <font color='red'>batch</font> method to convert these characters to sequences with the desired length.

In [11]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)


for seq in sequences.take(1):
  print(id2char(seq))
  
# Changing tokens back to text
for seq in sequences.take(5):
  print(id2text(seq).numpy())

tf.Tensor(
[b'F' b'i' b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':'
 b'\n' b'B' b'e' b'f' b'o' b'r' b'e' b' ' b'w' b'e' b' ' b'p' b'r' b'o'
 b'c' b'e' b'e' b'd' b' ' b'a' b'n' b'y' b' ' b'f' b'u' b'r' b't' b'h'
 b'e' b'r' b',' b' ' b'h' b'e' b'a' b'r' b' ' b'm' b'e' b' ' b's' b'p'
 b'e' b'a' b'k' b'.' b'\n' b'\n' b'A' b'l' b'l' b':' b'\n' b'S' b'p' b'e'
 b'a' b'k' b',' b' ' b's' b'p' b'e' b'a' b'k' b'.' b'\n' b'\n' b'F' b'i'
 b'r' b's' b't' b' ' b'C' b'i' b't' b'i' b'z' b'e' b'n' b':' b'\n' b'Y'
 b'o' b'u' b' '], shape=(101,), dtype=string)
b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '
b'are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you k'
b"now Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us ki"
b"ll him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be

To train your model you need a dataset consisting pairs of (input, label), where input and label are sequences. Given each time step, an input is the current character and the label is the next character. Now write a function that takes a sequence input, duplicates, and shifts it to align the input and label for each timestep:

In [12]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

example:

In [13]:
print(split_input_target(list("Hello")))

# Should return:
''' ([H, e, l, l],
    [e, l, l, o]) ''';

(['H', 'e', 'l', 'l'], ['e', 'l', 'l', 'o'])


In [14]:
dataset_ = sequences.map(split_input_target)

In [15]:
for input_example, target_example in dataset_.take(1):
    print("Input :", id2text(input_example).numpy())
    print("Target:", id2text(target_example).numpy())

Input : b'First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou'
Target: b'irst Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou '


## Creating training batches

Now you should shuffle the data and pack it into batches.

In [16]:
BATCH_SIZE = 64

# Buffer size is used for shuffling the dataset
BUFFER_SIZE = 1000

# '''Your code for shuffling and batching the data also use .prefetch(tf.data.experimental.AUTOTUNE)) at the end'''
dataset = dataset_.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True).prefetch(tf.data.experimental.AUTOTUNE)

print(dataset)

<PrefetchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>


In [17]:
# for input_example, target_example in dataset_.take(1):
#     print("Input :", id2text(input_example).numpy())
#     print("Target:", id2text(target_example).numpy())

## Building the model


Here you should build your model. Please use the following for your model:


*   An embedding layer
*   An RNN layer (LSTM or GRU)
*   Dense Layer



In [18]:
# Length of the vocabulary in chars
vocab_size = len(vocab)
vocab_size_ = len(char2id.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [19]:
class GenModel(tf.keras.Model):
    def __init__(self, vocab_size, embedding_dim, rnn_units):
        super().__init__(self)
        self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
        self.rnn = tf.keras.layers.GRU(rnn_units,return_sequences=True,return_state=True)
        self.rnn2 = tf.keras.layers.GRU(rnn_units,return_sequences=True,return_state=True) 
        self.dense = tf.keras.layers.Dense(vocab_size)
        # self.dropout = tf.keras.layers.Dropout(0.1) # I also used LSTM but it didn't help! -   less performance than GTU


    def call(self, inputs, states_1=None, states_2=None, return_state=False, training=False):
        x = inputs
        '''pass the inputs through the embedding layer, the RNN layer, and then the dense layer. You should also check for initial states '''

        x = self.embedding(x, training=training)
        
        if states_1 is None:
            states_1 = self.rnn.get_initial_state(x)

        if states_2 is None:
            states_2 = self.rnn2.get_initial_state(x)

        # x, memory_state, carry_state = self.rnn(x, initial_state=states, training=training)
        x, states_1 = self.rnn(x, initial_state=states_1, training=training)
        x, states_2 = self.rnn2(x, initial_state=states_2, training=training)

        # x, states = self.rnn2(x, initial_state=states, training=training)
        # x = self.dropout(x, training=training)
        # states = [memory_state, carry_state]
        x = self.dense(x, training=training)

        if return_state:
            return x,  states_1, states_2
        else:
            return x

In [20]:
model = GenModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=vocab_size_,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

## Check the model

Now let's check our model to see if it behaves as expected.

In [21]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")
    

(64, 100, 66) # (batch_size, sequence_length, vocab_size)


In [22]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

print(sampled_indices)

[42 23 35 56 56 61 21 40 64 42 63 17 30 53 57 25 55 21 21  2 41 12 28  5
 48 53 35 53 14 10  1 37 47 57  0 53 34 37  4  1 46 13  5 19 33  8 27 57
 34 40 24 34 38 53 51  7 28 46 48  5 63 44 37 39 10 40 10 13 51 11 52 40
 27  3 30 23 45 58 36 33 23  1 41 29 22 25 43 34  4 57 13  6  7 61 54 64
  0 64 29 63]


In [23]:
print("Input:\n", id2text(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", id2text(sampled_indices).numpy())

Input:
 b'r them.\n\nSICINIUS:\nThis, as you say, suggested\nAt some time when his soaring insolence\nShall touch t'

Next Char Predictions:
 b"cJVqqvHaycxDQnrLpHH b;O&inVnA3\nXhr[UNK]nUX$\ng?&FT-NrUaKUYnl,Ogi&xeXZ3a3?l:maN!QJfsWTJ\nbPILdU$r?',voy[UNK]yPx"


## Train the model

We have now built our model. The only part left is to train and then test the model.

We need a loss function. Please choose the correct loss function from the followings:


*   Categorical Cross Entropy
*   Sparse Categorical Cross Entropy
*   Binary Cross Entropy
*   MSE

In [24]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

In [25]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

Prediction shape:  (64, 100, 66)  # (batch_size, sequence_length, vocab_size)
Mean loss:         4.1906624


Mean loss with no training:

In [26]:
tf.exp(mean_loss).numpy()

66.066536

Compile the model:

In [27]:
model.compile(optimizer='adam', loss=loss)

In [28]:
model.summary()

Model: "gen_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  16896     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 gru_1 (GRU)                 multiple                  6297600   
                                                                 
 dense (Dense)               multiple                  67650     
                                                                 
Total params: 10,320,450
Trainable params: 10,320,450
Non-trainable params: 0
_________________________________________________________________


Configure Checkpoints:

In [29]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [30]:
EPOCHS = 30
#  Increased number of epochs to improve the performance of the model.

history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Generate Texts

Here we write a class to generate characters based on the model we trained.

In [31]:
class OneChar(tf.keras.Model):
    def __init__(self, model, id2char, char2id, temperature=1.0):
        super().__init__()
        self.temperature = temperature
        self.model = model
        self.chars_from_ids = id2char
        self.ids_from_chars = char2id

        # Create a mask to prevent "[UNK]" from being generated.
        skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
        sparse_mask = tf.SparseTensor(
            # Put a -inf at each bad index.
            values=[-float('inf')]*len(skip_ids),
            indices=skip_ids,
            # Match the shape to the vocabulary
            dense_shape= [vocab_size_]
        )
        self.prediction_mask = tf.sparse.to_dense(sparse_mask)

    @tf.function
    def generate_one_char(self, inputs, states_1=None, states_2=None):
        # Convert strings to token IDs.
        input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
        # Convert tokens to ids
        input_ids = self.ids_from_chars(input_chars).to_tensor()

        # Run the model and get the states and predicted logits
        # predicted_logits.shape is [batch, char, next_char_logits]
        # predicted_logits, states  = self.model(inputs=input_ids, states=states, return_state=True)
        predicted_logits, states_1, states_2  = self.model(inputs=input_ids, states_1=states_1, states_2=states_2, return_state=True)
        # predicted_logits, memory_state, carry_state  = self.model(inputs=input_ids, states=states, return_state=True)
        # states =  [memory_state, carry_state]  

        # Only use the last prediction.
        predicted_logits = predicted_logits[:, -1, :]
        predicted_logits = predicted_logits/self.temperature
        # Apply the prediction mask: prevent "[UNK]" from being generated.
        predicted_logits = predicted_logits + self.prediction_mask

        # Sample the output logits to generate token IDs (use random.categorical).
        predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
        predicted_ids = tf.squeeze(predicted_ids, axis=-1)

        # Convert from token ids to characters
        predicted_chars = self.chars_from_ids(predicted_ids)

        # Return the characters and model state.
        return predicted_chars, states_1, states_2

In [32]:
one_char_model = OneChar(model, id2char, char2id, temperature=0.8)
# Used lower temperature for less random generation

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
    # next_char,  memory_state, carry_state = one_char_model.generate_one_char(next_char, states=states)
    next_char,  states = one_char_model.generate_one_char(next_char, states=states)
    # states =  [memory_state, carry_state] 

    result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

ROMEO:
I prithee: my most gracious lord.

ABRAHAM:
I do beauty tenger one.

PAULINA:
Good morrow, gentle Isabel!
Seeet BerWaffully my uncle well undereadest,
To think it still from Frontence ta'en.

BAPTISTA:
Well, girl, sir! no.

MIRANDA:
O virtue gall men thunder
Than set how to bring these frown Katch'd and rivers
Or worthy babe:' the cause of all the facred of himself
and abroach 'gainst the heavens fond of the sea,
And let me go and women air.

BAPTISTA:
Now, bride-a! which I wore my poor night!

GrOMNO:
God save your part, you look you, sir, ha?

GONZALO:
I am a puppet of home; come, I beseech you,
Where be these confessor, unless ye need not
where never after how to close you taught to speak,
And now but sweeter than the sea for to sits of bawd.

GRUMIO:
Nay, tarry; I cannot think it would have married with me.

PETRUCHIO:
A juleare! I am no brother's unjust,
Gentlemen, for in such a gently rid
Who sinks nothing else would grown in a
mouth: my raven with his maids!
And like a wh

Now play with the model and hyperparameters (epochs, ...) and run this again to see if the results have improved.

In [33]:
start = time.time()
states_1=None
states_2=None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states_1, states_2 = one_char_model.generate_one_char(next_char, states_1=states_1, states_2=states_2)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

ROMEO:
Nay, not were worth as should have assured thee,
And chief your table strew me. Thus far off from angry now
Of the year of whom, they can make foul winds,
Breathe without the devil's dam: conceives her there;
Or bide thee, poor surfelon, to Belken his eyes
Will have my rapies bear me here.

DUKE VINCENTIO:
That shall not be much amiss: Yet, as 'twere a brother
As fresh and bear. O rack, for those all things should be
unlikedies in the sun,
And bloody dagger with him, he'll be my head.

TRANIO:
I love no chiders, sir. Bioneel, dispatch;
Why story peace is a good friar Bianca craves you! hear you?

DUKE VINCENTIO:
You have told me too malapett.

LUCIO:

ISABELLA:
Thou liest. Is't not Hortensio?

BIANCA:
Hear, take the urgener hall! Go, this is well,
And to my state grew so far that we would not do't,
Let me embrace thee in himself.

DUKE VINCENTIO:
Look where he comes.

MENENIUS:
Ay, if you call?

POMPEY:
Truly, sir, in my poor opinion, the mine own priest
And promise them such en