<a href="https://colab.research.google.com/github/vtkachuk4/next_word_predictor/blob/master/word_based_word_predictor_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview


Next word predictor model is created and trained in this notebook.

The model was trained on Shakespeare text so the next word predictions are in the style of Shakespeare :)

The accompanying next-word predictor is in the Github repository: https://github.com/vtkachuk4/next_word_predictor.git In Notebook: Next-word Predictor

After the model was trained it was saved, downloaded and then uploaded to the corresponding Github repository for later use.

The source code for this model is largly based on the sample code provided here: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/text_generation.ipynb#scrollTo=WvuwZBX5Ogfd

The char based word predictor was changed to a word based word predictor and trained using a pre-trained a Universal Sentence Encoder: https://tfhub.dev/google/universal-sentence-encoder/4 for word embeddings.

All sources are cited as comments above any other code used

# Model Creation and Training

In [0]:
import tensorflow as tf

import numpy as np
import os
import time

In [0]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

In [99]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')
# length of text is the number of words in it
print ('Length of text: {} words'.format(len(text)))

Length of text: 1115394 words


In [0]:
# source: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/
# used for cleaning up the Shakespeare text
import string

# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [101]:
# source: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/
# Double check the text is actually clean
tokens = clean_doc(text)
unique_words = sorted(set(tokens))
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['first', 'citizen', 'before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', 'all', 'speak', 'speak', 'first', 'citizen', 'you', 'are', 'all', 'resolved', 'rather']
Total Tokens: 202820
Unique Tokens: 12669


In [0]:
# This is just done for allow easy integration with the previous sample code
vocab = unique_words
text = tokens

In [0]:
# Creating a mapping from unique words to indices
word2idx = {u:i for i, u in enumerate(vocab)}
idx2word = np.array(vocab)

text_as_int = np.array([word2idx[c] for c in text])

In [106]:
# print the first 20 words and corresponding int representations
print('{')
for word,_ in zip(word2idx, range(20)):
    print('  {:4s}: {:3d},'.format(repr(word), word2idx[word]))
print('  ...\n}')

{
  'a' :   0,
  'abandond':   1,
  'abase':   2,
  'abate':   3,
  'abated':   4,
  'abbey':   5,
  'abbot':   6,
  'abed':   7,
  'abels':   8,
  'abet':   9,
  'abhor':  10,
  'abhorrd':  11,
  'abhorred':  12,
  'abhorring':  13,
  'abhors':  14,
  'abhorson':  15,
  'abide':  16,
  'abides':  17,
  'abilities':  18,
  'ability':  19,
  ...
}


In [107]:
# Show how the first 13 words from the text are mapped to integers
print ('{} ---- words mapped to int ---- > {}'.format(repr(text[:13]), text_as_int[:13]))

['first', 'citizen', 'before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', 'all', 'speak', 'speak'] ---- words mapped to int ---- > [ 4143  1933   924 12174  8393   450  4536  5129  6744 10182   302 10182
 10182]


In [108]:
# The maximum length sentence we want for a single input in words
seq_length = 100
examples_per_epoch = len(text)//(seq_length+1)

# Create training examples / targets
word_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)

for i in word_dataset.take(5):
  print(idx2word[i.numpy()])

first
citizen
before
we
proceed


In [109]:
# print first five sequences
sequences = word_dataset.batch(seq_length+1, drop_remainder=True)

for item in sequences.take(5):
  print(repr(' '.join(idx2word[item.numpy()])))

'first citizen before we proceed any further hear me speak all speak speak first citizen you are all resolved rather to die than to famish all resolved resolved first citizen first you know caius marcius is chief enemy to the people all we knowt we knowt first citizen let us kill him and well have corn at our own price ist a verdict all no more talking ont let it be done away away second citizen one word good citizens first citizen we are accounted poor citizens the patricians good what authority surfeits on would relieve us if they would yield'
'us but the superfluity while it were wholesome we might guess they relieved us humanely but they think we are too dear the leanness that afflicts us the object of our misery is as an inventory to particularise their abundance our sufferance is a gain to them let us revenge this with our pikes ere we become rakes for the gods know i speak this in hunger for bread not in thirst for revenge second citizen would you proceed especially against caiu

In [0]:
# create outpur sequence to be offset from input by 1 word
def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [111]:
for input_example, target_example in  dataset.take(1):
  print ('Input data: ', repr(' '.join(idx2word[input_example.numpy()])))
  print ('Target data:', repr(' '.join(idx2word[target_example.numpy()])))

Input data:  'first citizen before we proceed any further hear me speak all speak speak first citizen you are all resolved rather to die than to famish all resolved resolved first citizen first you know caius marcius is chief enemy to the people all we knowt we knowt first citizen let us kill him and well have corn at our own price ist a verdict all no more talking ont let it be done away away second citizen one word good citizens first citizen we are accounted poor citizens the patricians good what authority surfeits on would relieve us if they would'
Target data: 'citizen before we proceed any further hear me speak all speak speak first citizen you are all resolved rather to die than to famish all resolved resolved first citizen first you know caius marcius is chief enemy to the people all we knowt we knowt first citizen let us kill him and well have corn at our own price ist a verdict all no more talking ont let it be done away away second citizen one word good citizens first citize

In [112]:
for i, (input_idx, target_idx) in enumerate(zip(input_example[:5], target_example[:5])):
    print("Step {:4d}".format(i))
    print("  input: {} ({:s})".format(input_idx, repr(idx2word[input_idx])))
    print("  expected output: {} ({:s})".format(target_idx, repr(idx2word[target_idx])))

Step    0
  input: 4143 ('first')
  expected output: 1933 ('citizen')
Step    1
  input: 1933 ('citizen')
  expected output: 924 ('before')
Step    2
  input: 924 ('before')
  expected output: 12174 ('we')
Step    3
  input: 12174 ('we')
  expected output: 8393 ('proceed')
Step    4
  input: 8393 ('proceed')
  expected output: 450 ('any')


In [113]:
# Batch size
BATCH_SIZE = 64

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences,
# so it doesn't attempt to shuffle the entire sequence in memory. Instead,
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 10000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

dataset

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [0]:
# Length of the vocabulary in words
vocab_size = len(vocab)

# The embedding dimension, based on universal sentence encoder output
embedding_dim = 512

# Number of RNN units
rnn_units = 1024

In [0]:
# source: https://tfhub.dev/google/universal-sentence-encoder/4
import tensorflow_hub as hub
embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

# source: https://medium.com/@pi19404/using-pre-trained-word-vector-embeddings-for-sequence-classification-using-lstm-277dee188348
# create an embedding_matrix that maps word idx representations to their 
# corresponding word embeddings
embedding_matrix = np.zeros((vocab_size, embedding_dim))
for word in vocab:
    embedding = embed([word]).numpy()[0]
    embedding_matrix[word2idx[word]] = embedding




In [0]:
# def a build_model function that uses pre-trained Universal Sentence Encoder 
# for word Embeddings
def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim,
                              weights = [embedding_matrix],
                              batch_input_shape=[batch_size, None],
                              trainable=False),
    tf.keras.layers.GRU(rnn_units,
                        return_sequences=True,
                        stateful=True,
                        recurrent_initializer='glorot_uniform'),
    tf.keras.layers.Dense(vocab_size)
  ])
  return model

In [0]:
model = build_model(
  vocab_size = len(vocab),
  embedding_dim=embedding_dim,
  rnn_units=rnn_units,
  batch_size=BATCH_SIZE)

In [118]:
for input_example_batch, target_example_batch in dataset.take(1):
  example_batch_predictions = model(input_example_batch)
  print(example_batch_predictions.shape, "# (batch_size, sequence_length, vocab_size)")

(64, 100, 12669) # (batch_size, sequence_length, vocab_size)


In [119]:
model.summary()

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (64, None, 512)           6486528   
_________________________________________________________________
gru_4 (GRU)                  (64, None, 1024)          4724736   
_________________________________________________________________
dense_4 (Dense)              (64, None, 12669)         12985725  
Total params: 24,196,989
Trainable params: 17,710,461
Non-trainable params: 6,486,528
_________________________________________________________________


In [0]:
sampled_indices = tf.random.categorical(example_batch_predictions[0], num_samples=1)
sampled_indices = tf.squeeze(sampled_indices,axis=-1).numpy()

In [121]:
sampled_indices

array([ 8360, 11220,  9968,  2005,  4103, 10214, 11890,  7272,  3780,
       11352, 11222,  5920,  4668,  7819,   343,  9084,  2534, 11293,
        1012,  2978,  7378,  2768,  9326,  5201, 12011,  5824,  8538,
        9397, 11083,  2306,  3804,  6950,  7286, 10304, 10581,  2367,
        9146,  1131,  5604,  6454, 10701,  7720,  6276,  3688,   911,
        3300,  8955,  7732, 11823,  6279,  1493,  4226,  5804,  8297,
        5991,  5114,  1036,  3894,  4656,  8545, 11948,   580, 12597,
       10467,  3863,  9072,  4738, 12421,  9068, 10134, 12567, 11085,
       10327,  8149,  9686,  5852,  3684,  4195, 10836,  8979,  7448,
        5740, 10692, 11731,  6469,  3143,  2181,  1490,  1950, 12017,
        9606,   340, 11774,  1021,  8782,  4923,  4128,  2560,  5566,
       11144])

In [122]:
print("Input: \n", repr(" ".join(idx2word[input_example_batch[0]])))
print()
print("Next word Predictions: \n", repr(" ".join(idx2word[sampled_indices ])))

Input: 
 'it was senseless twas nothing to geld a codpiece of a purse i could have filed keys off that hung in chains no hearing no feeling but my sirs song and admiring the nothing of it so that in this time of lethargy i picked and cut most of their festival purses and had not the old man come in with a whoobub against his daughter and the kings son and scared my choughs from the chaff i had not left a purse alive in the whole army camillo nay but my letters by this means being there so soon'

Next word Predictions: 
 'priesthood tongue sleepy clouded filed spheres vassals new excepting tribunes tongues jouncing gibingly patron alter revive crave traind bennet determinate nourish dearth saint helmet visitors invisible psalms scandalous threatened contain exhibit misery newness stabs subject cony ripening bitterness imprisont loose suppresseth palsy letters entrails bedward doubleness repented paper urging level bull flock interred preparation kernels headlong beshrew fairly get puddi

In [123]:
def loss(labels, logits):
  return tf.keras.losses.sparse_categorical_crossentropy(labels, logits, from_logits=True)

example_batch_loss  = loss(target_example_batch, example_batch_predictions)
print("Prediction shape: ", example_batch_predictions.shape, " # (batch_size, sequence_length, vocab_size)")
print("scalar_loss:      ", example_batch_loss.numpy().mean())

Prediction shape:  (64, 100, 12669)  # (batch_size, sequence_length, vocab_size)
scalar_loss:       9.44701


In [0]:
model.compile(optimizer='adam', loss=loss)

In [0]:
# Directory where the checkpoints will be saved
checkpoint_dir = './training_checkpoints'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")

checkpoint_callback=tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True)

In [0]:
EPOCHS=10

In [127]:
history = model.fit(dataset, epochs=EPOCHS, callbacks=[checkpoint_callback])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [128]:
tf.train.latest_checkpoint(checkpoint_dir)

'./training_checkpoints/ckpt_10'

In [0]:
model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))

model.build(tf.TensorShape([1, None]))

In [130]:
model.summary()

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_5 (Embedding)      (1, None, 512)            6486528   
_________________________________________________________________
gru_5 (GRU)                  (1, None, 1024)           4724736   
_________________________________________________________________
dense_5 (Dense)              (1, None, 12669)          12985725  
Total params: 24,196,989
Trainable params: 17,710,461
Non-trainable params: 6,486,528
_________________________________________________________________


In [0]:
def generate_text(model, start_tokens):
  # Evaluation step (generating text using the learned model)

  # Number of words to generate
  num_generate = 1

  # Converting our start string to numbers (vectorizing)
  input_eval = [word2idx[s] for s in start_tokens]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted character as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2word[predicted_id])

  return (' '.join(start_tokens) + ' ' + ' '.join(text_generated))

In [132]:
print(generate_text(model, start_tokens=['romeo', 'oh', 'art', 'thou']))

romeo oh art thou said


In [133]:
# source: https://colab.research.google.com/github/googlecolab/colabtools/blob/master/notebooks/colab-github-demo.ipynb
# Save the entire model as a SavedModel.
!mkdir -p saved_model
model.save('saved_model/model_10_epochs') 

# my_model directory
!ls saved_model

# Contains an assets folder, saved_model.pb, and variables folder.
!ls saved_model/my_model





































































INFO:tensorflow:Assets written to: saved_model/model_10_epochs/assets


INFO:tensorflow:Assets written to: saved_model/model_10_epochs/assets


model_10_epochs  my_model
assets	saved_model.pb	variables


In [0]:
# for mounting drive 
#from google.colab import drive
#drive.mount('/content/drive')

In [0]:
# to save model in drive so you cna download it
#model_save_name = 'model_10_epochs'
#path = F"/content/drive/My Drive/{model_save_name}" 
#model.save(path)