<a href="https://colab.research.google.com/github/vtkachuk4/next_word_predictor/blob/master/Next_word_Predictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Overview

This Notebook performs next word prediction based on Shakespeare text using a pre-trained RNN model that uses a Universal Sentence Encoder for word embeddings.

The model was trained on Shakespeare text so the next word predictions are in the style of Shakespeare :)

The accompanying model and training code is in the Github repository: https://github.com/vtkachuk4/next_word_predictor.git 
In Notebook: word_based_word_predictor_model

After the model was trained it was saved, downloaded and then uploaded to the corresponding Github repository for later use.

The source code for this model is largly based on the sample code provided here: https://colab.research.google.com/github/tensorflow/docs/blob/master/site/en/tutorials/text/text_generation.ipynb#scrollTo=WvuwZBX5Ogfd

The char based word predictor was changed to a word based word predictor and trained using a pre-trained a Universal Sentence Encoder: https://tfhub.dev/google/universal-sentence-encoder/4 for word embeddings.

All sources are cited as comments above any other code used

# Setup




In [0]:
import tensorflow as tf

import numpy as np
import os
import time

In [2]:
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')

Downloading data from https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt


In [0]:
# Read, then decode for py2 compat.
text = open(path_to_file, 'rb').read().decode(encoding='utf-8')

In [0]:
# source: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/
# used for cleaning up the Shakespeare text
import string

# turn a doc into clean tokens
def clean_doc(doc):
    # replace '--' with a space ' '
    doc = doc.replace('--', ' ')
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans('', '', string.punctuation)
    tokens = [w.translate(table) for w in tokens]
    # remove remaining tokens that are not alphabetic
    tokens = [word for word in tokens if word.isalpha()]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

In [5]:
# source: https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/
# Double check the text is actually clean
tokens = clean_doc(text)
unique_words = sorted(set(tokens))
print(tokens[:20])
print('Total Tokens: %d' % len(tokens))
print('Unique Tokens: %d' % len(set(tokens)))

['first', 'citizen', 'before', 'we', 'proceed', 'any', 'further', 'hear', 'me', 'speak', 'all', 'speak', 'speak', 'first', 'citizen', 'you', 'are', 'all', 'resolved', 'rather']
Total Tokens: 202820
Unique Tokens: 12669


In [0]:
# This is just done for allow easy integration with the previous sample code
vocab = unique_words
text = tokens

In [0]:
# Creating a mapping from unique words to indices
word2idx = {u:i for i, u in enumerate(vocab)}
idx2word = np.array(vocab)

text_as_int = np.array([word2idx[c] for c in text])

In [14]:
# load the model saved in the Github repo
!git clone https://github.com/vtkachuk4/next_word_predictor.git

next_word_predictor_model = tf.keras.models.load_model('next_word_predictor/models/model_10_epochs')

# Check its architecture
next_word_predictor_model.summary()

fatal: destination path 'next_word_predictor' already exists and is not an empty directory.
Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (1, None, 512)            6486528   
_________________________________________________________________
gru_3 (GRU)                  (1, None, 1024)           4724736   
_________________________________________________________________
dense_3 (Dense)              (1, None, 12669)          12985725  
Total params: 24,196,989
Trainable params: 24,196,989
Non-trainable params: 0
_________________________________________________________________


In [0]:
def generate_text(model, start_tokens):
  # Evaluation step (generating text using the learned model)

  # Number of words to generate
  num_generate = 1

  # Converting our start string to numbers (vectorizing)
  input_eval = [word2idx[s] for s in start_tokens]
  input_eval = tf.expand_dims(input_eval, 0)

  # Empty string to store our results
  text_generated = []

  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  # Experiment to find the best setting.
  temperature = 1.0

  # Here batch size == 1
  model.reset_states()
  for i in range(num_generate):
      predictions = model(input_eval)
      # remove the batch dimension
      predictions = tf.squeeze(predictions, 0)

      # using a categorical distribution to predict the word returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()

      # We pass the predicted word as the next input to the model
      # along with the previous hidden state
      input_eval = tf.expand_dims([predicted_id], 0)

      text_generated.append(idx2word[predicted_id])

  return (' '.join(text_generated))

# Next-word Generation

Enter any start sentence (start_words) you would like that is at least 1 word in length

In [0]:
start_words = ['romeo', 'oh', 'art', 'thou']

The following will print the predicted word and the sentence including the predicted word

In [31]:
next_word_prediction = generate_text(next_word_predictor_model, start_tokens=start_words)
print(f'Next-word prediction: {next_word_prediction}')
print (f'Full sentence with predicted word: {" ".join(start_words) + " " + next_word_prediction}')

Next-word prediction: time
Full sentence with predicted word: romeo oh art thou time
