### Libraries import

In [12]:
import tensorflow as tf

import numpy as np
import os
import time

### Download the dataset

In [2]:
!cd sample_data

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Read the data

First we take the csv file of the common english words and we convert it as an unique string

In [13]:
import csv

text = ''
with open('/content/drive/MyDrive/android_dev.txt','rt')as f:
  data = csv.reader(f)
  for row in data:
    text += str(row[0]) + " "

Let's look at the test 

In [14]:
text



Let's do some preprocessing on the text using regex library

In [15]:
import re 
# initializing punctuations string
punc = '''!()-[]{};:",<>./?@#$%^&*_~'''
 
# Removing punctuations in string
# Using loop + punctuation string
text = re.sub(r'\t', ' ', text)
text = re.sub(r'\n', ' ', text)
text = re.sub(r'<s>', '', text)
text = re.sub(r'</s>', '', text)
text = re.sub(r'<unk>', '', text)
text = re.sub(r'_' , '', text)
for ele in text:
    if ele in punc:
        text = text.replace(ele, "")


In [16]:
text



Count the characters in the text

In [17]:
# The unique characters in the file
vocab = sorted(set(text))
print(f'{len(vocab)} unique characters')

54 unique characters


## Process the text

### Vectorize the text

Before training, we need to convert the strings to a numerical representation in order to make them being processsable by the network 

The `tf.keras.layers.StringLookup` layer can convert each character into a numeric ID. It just needs the text to be split into tokens first.

In [18]:
example_texts = ['abcdefg', 'xyz']

chars = tf.strings.unicode_split(example_texts, input_encoding='UTF-8')

chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

Now create a `tf.keras.layers.StringLookup` layer and we will use it as function:

In [19]:
ids_from_chars = tf.keras.layers.StringLookup(
    vocabulary=list(vocab), mask_token=None)

It converts from tokens to character IDs:

In [20]:
ids = ids_from_chars(chars)
ids

<tf.RaggedTensor [[29, 30, 31, 32, 33, 34, 35], [52, 53, 54]]>

Since we want to generate text, even if we stop at the end of the first predicted word, it will also be important to invert this representation and recover human-readable strings from it. For this you can use tf.keras.layers.StringLookup with the parameter 'invert'=True. Important that we create another instance of the function and do not use the same one.

We use the `get_vocabulary()` method of the `tf.keras.layers.StringLookup` layer so that the `[UNK]` tokens is set the same way.

In [21]:
chars_from_ids = tf.keras.layers.StringLookup(
    vocabulary=ids_from_chars.get_vocabulary(), invert=True, mask_token=None)

As indicated, this layer recovers the characters from the vectors of IDs, and returns them as a `tf.RaggedTensor` of characters:

In [22]:
chars = chars_from_ids(ids)
chars

<tf.RaggedTensor [[b'a', b'b', b'c', b'd', b'e', b'f', b'g'], [b'x', b'y', b'z']]>

And we can `tf.strings.reduce_join` to join the characters predicted back into strings. 

In [23]:
tf.strings.reduce_join(chars, axis=-1).numpy()

array([b'abcdefg', b'xyz'], dtype=object)

In [24]:
def text_from_ids(ids):
  return tf.strings.reduce_join(chars_from_ids(ids), axis=-1)

### The prediction task

Generally it works that given a character, or a sequence of characters, it predict the most probable next character.The input to the model will be a sequence of characters, and you train the model to predict the output—the following character at each time step.


### Create training examples and targets

Next divide the text into example sequences. Each input sequence will contain `seq_length` characters from the text.

For each input sequence, the corresponding targets contain the same length of text, except shifted one character to the right.

So break the text into chunks of `seq_length+1`.
To do this first use the `tf.data.Dataset.from_tensor_slices` function to convert the text vector into a stream of character indices.

In [25]:
all_ids = ids_from_chars(tf.strings.unicode_split(text, 'UTF-8'))
all_ids

<tf.Tensor: shape=(334017,), dtype=int64, numpy=array([11,  1, 40, ..., 41, 33,  1])>

In [26]:
ids_dataset = tf.data.Dataset.from_tensor_slices(all_ids)

In [27]:
for ids in ids_dataset.take(10):
    print(chars_from_ids(ids).numpy().decode('utf-8'))

I
 
l
i
k
e
 
t
h
e


In [28]:
seq_length = 10

The `batch` method lets you easily convert the individual characters to sequences of the desired size decided previously

In [29]:
sequences = ids_dataset.batch(seq_length+1, drop_remainder=True)

for seq in sequences.take(1):
  print(chars_from_ids(seq))

tf.Tensor([b'I' b' ' b'l' b'i' b'k' b'e' b' ' b't' b'h' b'e' b' '], shape=(11,), dtype=string)


For training you'll need that the dataset provide `(input, label)` pairs. Where `input` and 
`label` are sequences of characters. At each time step the input is the current character and the label is the next character. 

In [30]:
def split_input_target(sequence):
    input_text = sequence[:-1]
    target_text = sequence[1:]
    return input_text, target_text

In [31]:
split_input_target(list("InternetOfThings"))

(['I', 'n', 't', 'e', 'r', 'n', 'e', 't', 'O', 'f', 'T', 'h', 'i', 'n', 'g'],
 ['n', 't', 'e', 'r', 'n', 'e', 't', 'O', 'f', 'T', 'h', 'i', 'n', 'g', 's'])

We map the split_input_target_function in the entire dataset

In [32]:
dataset = sequences.map(split_input_target)

Let's look at an example

In [33]:
for input_example, target_example in dataset.take(1):
    print("Input :", text_from_ids(input_example).numpy())
    print("Target:", text_from_ids(target_example).numpy())

Input : b'I like the'
Target: b' like the '


### Create training batches

You used `tf.dataset` to split the text into manageable sequences. But before feeding this data into the model, you need to shuffle the data and divide it into batches of the desidered dimension

In [34]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
BUFFER_SIZE = 10000

dataset = (
    dataset
    .shuffle(BUFFER_SIZE)
    .batch(BATCH_SIZE, drop_remainder=True)
    .prefetch(tf.data.experimental.AUTOTUNE))

dataset

<PrefetchDataset element_spec=(TensorSpec(shape=(64, 10), dtype=tf.int64, name=None), TensorSpec(shape=(64, 10), dtype=tf.int64, name=None))>

## Build The Model

This section defines the model as a `keras.Model` subclass

This model has three layers:

* `tf.keras.layers.Embedding`: The input layer. A trainable lookup table that will map each character-ID to a vector with `embedding_dim` dimensions;
* `tf.keras.layers.GRU`: A type of RNN with size `units=rnn_units` used as alternative to LSTM
* `tf.keras.layers.Dense`: The output layer, with `vocab_size` outputs. It outputs one logit for each character in the vocabulary.

In [35]:
# Length of the vocabulary in StringLookup Layer
vocab_size = len(ids_from_chars.get_vocabulary())

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

Definition of the model

In [36]:
class ActualWordPredictionModel(tf.keras.Model):
  def __init__(self, vocab_size, embedding_dim, rnn_units):
    super().__init__(self)
    self.embedding = tf.keras.layers.Embedding(vocab_size, embedding_dim)
    self.gru = tf.keras.layers.GRU(rnn_units,
                                   return_sequences=True,
                                   return_state=True)
    self.dense = tf.keras.layers.Dense(vocab_size)

  def call(self, inputs, states=None, return_state=False, training=False):
    x = inputs
    x = self.embedding(x, training=training)
    if states is None:
      states = self.gru.get_initial_state(x)
    x, states = self.gru(x, initial_state=states, training=training)
    x = self.dense(x, training=training)

    if return_state:
      return x, states
    else:
      return x

In [37]:
model = ActualWordPredictionModel(
    vocab_size=vocab_size,
    embedding_dim=embedding_dim,
    rnn_units=rnn_units)

Note: For training you could use a `keras.Sequential` model here. To  generate text later you'll need to manage the RNN's internal state. It's simpler to include the state input and output options upfront, than it is to rearrange the model architecture later.

## Run the model

First check the shape of the output:

In [38]:
for input_example_batch, target_example_batch in dataset.take(1):
    example_batch_predictions = model(input_example_batch)
    print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 10, 55) # (batch_size, sequence_length, vocab_size)


Let's take a look to the summary of the model

In [39]:
model.summary()

Model: "actual_word_prediction_model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       multiple                  14080     
                                                                 
 gru (GRU)                   multiple                  3938304   
                                                                 
 dense (Dense)               multiple                  56375     
                                                                 
Total params: 4,008,759
Trainable params: 4,008,759
Non-trainable params: 0
_________________________________________________________________


To get actual predictions from the model you need to sample from the output distribution, to get actual character indices. This distribution is defined by the logits over the character vocabulary.

Example:

In [40]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices, axis=-1).numpy()

This gives us, at each timestep, a prediction of the next character index:

In [41]:
sampled_indices

array([17, 32, 30, 54, 49, 54, 24, 51, 31, 21])

Decode these to see the text predicted by this untrained model:

In [42]:
print("Input:\n", text_from_ids(input_example_batch[0]).numpy())
print()
print("Next Char Predictions:\n", text_from_ids(sampled_indices).numpy())

Input:
 b's that I h'

Next Char Predictions:
 b'OdbzuzVwcS'


## Training step of the model

### Attach an optimizer, and a loss function

The standard `tf.keras.losses.sparse_categorical_crossentropy` loss function works in this case because it is applied across the last dimension of the predictions.

Because the model returns logits, we have to set the `from_logits` flag.


In [43]:
loss = tf.losses.SparseCategoricalCrossentropy(from_logits=True)

And here we just see if everything works fine...

In [44]:
example_batch_mean_loss = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("Mean loss:        ", example_batch_mean_loss)

Prediction shape:  (64, 10, 55)  # (batch_size, sequence_length, vocab_size)
Mean loss:         tf.Tensor(4.0076737, shape=(), dtype=float32)


In [45]:
tf.exp(example_batch_mean_loss).numpy()

55.01873

Configure the training with `tf.keras.Model.compile` method passing to it the `tf.keras.optimizers.Adam` with default learning rate = 0.001 and the loss function selected

In [46]:
model.compile(optimizer='adam', loss=loss)

### Configure checkpoints

Use a `tf.keras.callbacks.ModelCheckpoint` to ensure savings during training:

In [47]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

### Execute the training

In [48]:
EPOCHS = 30

In [49]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


## Generate text

The following makes a single step prediction:

In [50]:
class OneStep(tf.keras.Model):
  def __init__(self, model, chars_from_ids, ids_from_chars, temperature=0.5):
    super().__init__()
    self.temperature = temperature
    self.model = model
    self.chars_from_ids = chars_from_ids
    self.ids_from_chars = ids_from_chars

    # Create a mask to prevent "[UNK]" from being generated.
    skip_ids = self.ids_from_chars(['[UNK]'])[:, None]
    sparse_mask = tf.SparseTensor(
        # Put a -inf at each bad index.
        values=[-float('inf')]*len(skip_ids),
        indices=skip_ids,
        # Match the shape to the vocabulary
        dense_shape=[len(ids_from_chars.get_vocabulary())])
    self.prediction_mask = tf.sparse.to_dense(sparse_mask)

  @tf.function
  def generate_one_step(self, inputs, states=None):
    # Convert strings to token IDs.
    input_chars = tf.strings.unicode_split(inputs, 'UTF-8')
    input_ids = self.ids_from_chars(input_chars).to_tensor()

    # Run the model.
    # predicted_logits.shape is [batch, char, next_char_logits]
    predicted_logits, states = self.model(inputs=input_ids, states=states,
                                          return_state=True)
    # Only use the last prediction.
    print(predicted_logits)
    predicted_logits = predicted_logits[:, -1, :]
    
    predicted_logits = predicted_logits/self.temperature
    # Apply the prediction mask: prevent "[UNK]" from being generated.
    predicted_logits = predicted_logits + self.prediction_mask

    # Sample the output logits to generate token IDs.
    predicted_ids = tf.random.categorical(predicted_logits, num_samples=1)
    print(predicted_ids)
    predicted_ids = tf.squeeze(predicted_ids, axis=-1)

    # Convert from token ids to characters
    predicted_chars = self.chars_from_ids(predicted_ids)
    
    # Return the characters and model state.
    return predicted_chars, states

In [51]:
one_step_model = OneStep(model, chars_from_ids, ids_from_chars)

Run it in a loop will generate some text. But since we do not want the text, we already have a model doing this, for us is enough to predict until next space character is encountered.

In [52]:
start = time.time()
states = None
next_char = tf.constant(['bit'])
result = [next_char]

for n in range(1000):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

  if next_char == ' ':
    break

result = tf.strings.join(result)
end = time.time()
print(result[0].numpy().decode('utf-8'), '\n\n' + '_'*80)
print('\nRun time:', end - start)

Tensor("actual_word_prediction_model/dense/BiasAdd:0", shape=(1, None, 55), dtype=float32)
Tensor("categorical/Multinomial:0", shape=(1, 1), dtype=int64)
bit  

________________________________________________________________________________

Run time: 0.725339412689209


We can also make the model generate text *faster* passing a batch to the word generation model. In the example below the model generates 5 outputs in about the same time it took to generate 1 above. 

In [53]:
start = time.time()
states = None
next_char = tf.constant(['hel'])
result = [next_char]

for n in range(200):
  next_char, states = one_step_model.generate_one_step(next_char, states=states)
  result.append(next_char)

result = tf.strings.join(result)
end = time.time()
print(result, '\n\n' + '_'*80)
print('\nRun time:', end - start)

Tensor("actual_word_prediction_model/dense/BiasAdd:0", shape=(1, None, 55), dtype=float32)
Tensor("categorical/Multinomial:0", shape=(1, 1), dtype=int64)
tf.Tensor([b"hell out of the parts near or run lent and with most a couple a week Preven seemed not the side of the second see any post you failed But Le's store The plopers and this will be most as well so much Look"], shape=(1,), dtype=string) 

________________________________________________________________________________

Run time: 1.1530115604400635


## Export the generator

This single-step model can easily saved and this will allow us to use it also in the device code if we have the Tensorflow library

In [54]:
tf.saved_model.save(one_step_model, 'one_step')
one_step_reloaded = tf.saved_model.load('one_step')



Tensor("actual_word_prediction_model/dense/BiasAdd:0", shape=(1, None, 55), dtype=float32)
Tensor("categorical/Multinomial:0", shape=(1, 1), dtype=int64)
Tensor("actual_word_prediction_model/dense/BiasAdd:0", shape=(1, None, 55), dtype=float32)
Tensor("categorical/Multinomial:0", shape=(1, 1), dtype=int64)




In [61]:
states = None
next_char = tf.constant(['i'])
result = [next_char]

for n in range(100):
  next_char, states = one_step_reloaded.generate_one_step(next_char, states=states)
  result.append(next_char)

print(tf.strings.join(result)[0].numpy().decode("utf-8"))

ing the MAFChbl they really get a green show Its a good prone to stom a buspects and just because the
