# Character-level Generation with Sequential Models

We have two kinds of text generation:


1.   Character-level
2.   Word Level

In this assignment we are focusing on the first one. Using a sequence of characters, we are going to train a model to predict the next character in the sequence. We will run you through the process step by step.

First, you have to import some required packages by running the cell below.




In [None]:
#@title Import Requireed Packages

import tensorflow as tf
import numpy as np
import os
import time

Now it's time for the our data. In this assignment we will be using the "Shakespeare" data.

In [None]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

# Read and decode
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

Let's take a look at the data (the first 250 characters).

In [None]:
print(text[:250])

We now need to build up the vocabulary by finding the unique characters:

In [None]:
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

## Preprocessing the text

We now need to convert these strings to numerical representations so that our model can understand them. To do this you will need to use:

```
tf.keras.layers.StringLookup
```
and then pass it the vocabulary we created in the previous part.

However, the text needs to be tokenized first.





In [None]:
sample_text = ['keep', 'store']

chars = tf.strings.unicode_split(sample_text, input_encoding='UTF-8')

char2id = ''' Now use "tf.keras.layers.StringLookup" to convert text to ids '''

ids = char2id(chars)

print(ids)

Since our goal is to generate text and not just understand text, we need to convert machine understandable numeric data to human-understandable text. Therefore, we will need a method to convert these ids back to string.

Use the same method you used for converting chars into id but this time use the option <font color='red'>invert=True</font>. 

In [None]:
id2char = ''' Your code '''

chars = id2char(ids)
print(chars)

# now we want to use a code to join chars into strings
# Hint: You should use a method from tf.strings called reduce_join
def id2text(ids):
''' Your Code '''
  pass

## Creating the dataset

In this section we would like to divide our text into sequences. Each sequence will be constrained by <font color='red'>sequence_length</font> that we define.

The corresponding target sequence of each input sequence has the same length except one character shifted to the right. So you should break the text into chunks of <font color='red'> seq_length+1</font> For instance, given a seq_length of 5 and "Python" as text, the input sequence would be "pytho" the target would be "ython".

To do this you should use:
```
tf.data.Dataset.from_tensor_slices
```
to convert the text vector into a sequence of character indices.


In [None]:
all_ids = char2id(tf.strings.unicode_split(text, input_encoding='UTF-8'))
print(all_ids)

ids_dataset = ''' Your Code (use from_tensor_slices) '''

for ids in ids_dataset.take(10):
    print(id2char(ids).numpy().decode('utf-8'))

In [None]:
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

Now use the <font color='red'>batch</font> method to convert these characters to sequences with the desired length.

In [None]:
sequences = ''' Your Code '''

for seq in sequences.take(1):
  print(id2char(seq))
  
# Changing tokens back to text
for seq in sequences.take(5):
  print(id2text(seq).numpy())

To train your model you need aa dataset consisting pairs of (input, label), where input and label are sequences. Given each time step, an input is the current character and the label is the next character. Now write a function that takes a sequence input, duplicates, and shifts it to align the input and label for each timestep:

In [None]:
def split_input_target(sequence):
    input_text = '''Your Code'''
    target_text = '''Your Code'''
    return input_text, target_text

example:

In [None]:
print(split_input_target(list("Hello")))

# Should return:
''' ([H, e, l, l],
    [e, l, l, o]) '''

In [None]:
dataset = sequences.map(split_input_target)

In [None]:
for input_example, target_example in dataset.take(1):
    print("Input :", id2text(input_example).numpy())
    print("Target:", id2text(target_example).numpy())

## Creating training batches

Now you should shuffle the data and pack it into batches.

In [None]:
BATCH_SIZE = 64

# Buffer size is used for shuffling the dataset
BUFFER_SIZE = 1000

dataset = ('''Your code for shuffling and batching the data also use .prefetch(tf.data.experimental.AUTOTUNE)) at the end''')

print(dataset)

## Building the model


Here you should build your model. Please use the following for your model:


*   An embedding layer
*   An RNN layer (LSTM or GRU)
*   Dense Layer



In [None]:
# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [None]:
class GenModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = '''embedding layer'''
    self.gru = '''GRU or LSTM layer'''
    self.dense = '''dense'''

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    '''
    pass the inputs through the embedding layer, the RNN layer, and then the dense
    layer. You should also check for initial states 
    
    '''

    if return_state:
      return x, states
    else:
      return x

In [None]:
model = GenModel(
    # Be sure the vocabulary size matches the `StringLookup` layers.
    vocab_size=len(char2id.get_vocabulary()),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

## Check the model

Now let's check our model to see if it behaves as expected.

In [None]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

In [None]:
model.summary()

In [None]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

print(sample_indices)

In [None]:
print("Input:\n", id2text(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", id2text(sampled_indices).numpy())

## Train the model

We have now built our model. The only part left is to train and then test the model.

We need a loss function. Please choose the correct loss function from the followings:


*   Categorical Cross Entropy
*   Sparse Categorical Cross Entropy
*   Binary Cross Entropy
*   MSE

In [None]:
loss = ''' Your Code '''

In [None]:
example_batch_loss = loss(target_example_batch, example_batch_predictions)
mean_loss = example_batch_loss.numpy().mean()
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", mean_loss)

Mean loss with no training:

In [None]:
tf.exp(mean_loss).numpy()

Compile the model:

In [None]:
model.compile(optimizer='adam', loss=loss)

Configure Checkpoints:

In [None]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [None]:
EPOCHS = 20

history = ''' Train the model given the dataset, epochs, and callback '''

## Generate Texts

Here we write a class to generate characters based on the model we trained.

In [None]:
class OneChar(tf.keras.Model):
  def __init__(self, model, id2char, char2id, temperature=1.0):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = id2char
    self.ids_from_chars = char2id

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape= '''Your Code'''
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_char(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    # Convert tokens to ids
    input_ids = '''Your Code'''

    # Run the model and get the states and predicted logits
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = '''Your Code'''

    # Only use the last prediction.
    predicted_logits = '''Your Code'''
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = '''Your Code'''

    # Sample the output logits to generate token IDs (use random.categorical).
    predicted_ids = '''Your Code'''
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = '''Your Code'''

    # Return the characters and model state.
    return predicted_chars, states

In [None]:
one_char_model = OneChar(model, id2char, char2id)

Run it in a loop to generate some text. Looking at the generated text, you'll see the model knows when to capitalize, make paragraphs and imitates a Shakespeare-like writing vocabulary. With the small number of training epochs, it has not yet learned to form coherent sentences.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = '''Your Code'''
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

Now play with the model and hyperparameters (epochs, ...) and run this again to see if the results have improved.

In [None]:
start = time.time()
states = None
next_char = tf.constant(['ROMEO:'])
result = [next_char]

for n in range(1000):
  next_char, states = '''Your Code'''
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)