# **Phrase Generator Model**

This notebook trains a model using Keras (and TensorFlow Backend) to finish your phrase in the style of author/poet William Shakespeare. 

Below we do the following:
1. Setup training environment 
2. Load and clean the Shakespeare test samples.
3. Train a word-level, neural network language model.
4. Convert the model to CoreML format.
5. Deliver the model to an app using Skafos


In [0]:
# First, let's install the tools and dependencies we need. 
!pip install keras skafos coremltools

In [0]:
# Import tools and libraries
import os
import re
import zipfile
import urllib

import skafos
from skafos import models
import numpy as np
from keras.optimizers import RMSprop
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical

In [0]:
# Check the skafos version
skafos.get_version()

## Data Preparation
The training data for this example are samples of text from some pieces authored by William Shakespeare. 

The code below does the following:

- Downloads the data from a public S3 bucket provided by Skafos.
- Defines some helper functions to parse the text.
- Tokenizes the text.
- Prepares the training data, building input sequences for the model.

In [0]:
# Specify the data set download url
data_path = "Shakespeare.zip"
data_url = "https://s3.amazonaws.com/skafos.example.data/PhraseGenModel/{}".format(data_path)

# Download the dataset
retrieve = urllib.request.urlretrieve(data_url, data_path)

# Unzip
zip_ref = zipfile.ZipFile(data_path, 'r')
zip_ref.extractall()
zip_ref.close()


In [0]:
# Helper functions
# Remove stage direction and comments
def remove_stage_dir(text):
    text = re.sub("[\<].*?[\>]", "", text)
    text = re.sub("\\s+", " ", text)
    return text
  
# Remove the word "SPEECH" adn the number following after that in the corpus
def remove_SPEECH(text):
    text = re.sub("SPEECH \d+", "", text)
    text = re.sub("\\s+", " ", text)
    return text


In [0]:
# Read in Shakespeare files

in_sentences = []

for filename in os.listdir():
    if filename.endswith(".txt"):
        text = ''.join(open(filename, encoding = "utf-8-sig", mode="r").readlines())
        # Chop up into sentences
        split_text = re.split(r' *[\.\?!][\'"\)\]]* *', remove_stage_dir(text))
        for chunk in split_text:
            in_sentences.append(chunk.strip())

print(in_sentences[0:10])

In [0]:
# Some constants
# Length of extracted text sample
maxlen = 10
# Stride of sampling
step = 2
# This holds our samples sequences
sentences = []
# This holds the next word (as training label)
next_word = []

In [0]:
# Prepare the Tokenizer
tokenizer = Tokenizer()
tokenizer.fit_on_texts(list(in_sentences))
list_tokenized_train = tokenizer.texts_to_sequences(list(in_sentences))

In [0]:
# Get vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'{vocab_size} total unique words in our training data corpus', flush=True)

In [0]:
# Stick the encoded words back together as a long sequence
token_word = []
for line in range (0,len(in_sentences)):
    that_sentences = list_tokenized_train[line]
    for i in range(0,len(that_sentences)):
        token_word.append(that_sentences[i])

# Sample from the sequence
for i in range(0, len(token_word) - maxlen, step):
    sentences.append(token_word[i: i + maxlen])
    next_word.append(token_word[i + maxlen])
print('Number of sentences:', len(sentences))

In [0]:
# Prepare the training data sequences
x = np.asarray(sentences)
y = to_categorical(next_word, num_classes=vocab_size)
seq_length = x.shape[1]

In [0]:
# Do some garbage collection
del(sentences, in_sentences, next_word, token_word)

## Model Training
The phrase generation model takes sequences of tokenized text as input and tries to predict the most likely next word from the vocabulary. You can create phrases by recursively feeding previous predictions, adding a single word at a time to a phrase.. almost like a "digital Shakespeare". The Keras model uses three different layer types in the neural network: Embedding, LSTM, and Dense. Links to relevant documentation are provided in the cell below.

In [0]:
# Create the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=256, input_length=seq_length))  # Docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
model.add(LSTM(units=256))                                                           # Docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
model.add(Dense(vocab_size, activation='softmax'))                                   # Docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
print(model.summary(), flush=True)

# Compile the model
# Since our predictions are one-hot encoded, use `categorical_crossentropy` as the loss
optimizer = RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy']) # keep track of accuracy along the way

In [0]:
# Train the model for a few epochs
model.fit(x, y, batch_size=256, epochs=15)


In [0]:
# Pickup training from where you left off last with the following
# Using an initial_epoch of 15 and epochs of 20, the model will begin at epoch 16 and train up until it reaches 20 (from where you last left off)
model.fit(x, y, batch_size=256, initial_epoch=15, epochs=20)

## Model Validation
Below we reverse and export the tokenizer so we can lookup a word based on it's index. Then we test out the newly trained model with some sample text.

In [0]:
import json

# Invert the tokenizer map so we can lookup a word by it's index
index_word_lookup = dict(map(reversed, tokenizer.word_index.items()))
index_word_lookup_file = 'index_word_lookup.json'

# Save it to a json object
with open(index_word_lookup_file, 'w') as fp:
    json.dump(index_word_lookup, fp)

In [0]:
from keras.preprocessing.sequence import pad_sequences

# Function to generate new text based on the input
def generate_text(seed_text, next_words, max_sequence_len, model):
    for j in range(next_words):
        token_list = pad_sequences(
            sequences=tokenizer.texts_to_sequences([seed_text]),
            maxlen=max_sequence_len,
            padding='pre'
        )
        predicted = model.predict_classes(token_list, verbose=0)
        # Generate the output word
        seed_text += " " + index_word_lookup[predicted[0]]
    return seed_text

In [0]:
# Test out the language model by passing in some seed text and the number of words
generate_text("You shall go see", 3, maxlen, model)

## Deliver your model to an iOS App with Skafos

As a final step to optimize your model for use on mobile devices, we need to convert our model from a Keras object to CoreML format. After conversion, we will use the [Skafos SDK](https://sdk.skafos.ai) to upload it to Skafos!

To execute the following steps, you will need to do the following:

- [Sign-up for a Skafos account](https://dashboard.skafos.ai/sign-up) if you haven't already.
- Navigate to the [account settings page on Skafos](https://dashboard.skafos.ai/settings/account) to get an API token.

In [0]:
import coremltools

# Convert the language model to Core ML format
model_name = "PhraseGenModel"
coreml_model_name = model_name + ".mlmodel"
coreml_model = coremltools.converters.keras.convert(
    model,
    input_names=['tokenizedInputSeq'],
    output_names=['tokenProbs']
)

# Add description information (if you want) and save the file
coreml_model.short_description = 'Predicts the most likely next word given a string of text'
coreml_model.input_description['tokenizedInputSeq'] = 'An array of tokenized text'
coreml_model.output_description['tokenProbs'] = 'An array of token probabilities across the entire vocabulary'
coreml_model.save(coreml_model_name)

In [0]:
# Skafos SDK Upload Model Version 

# Set your API Token first for repeated use
os.environ["SKAFOS_API_TOKEN"] = "<YOUR-SKAFOS-API-TOKEN>"

In [0]:
# Get a summary of your existing apps and models on Skafos, so you can determine where to deliver this model.
res = skafos.summary()
print(res)

In [0]:
# You can retrieve this info with skafos.summary()
org_name = "<YOUR-SKAFOS-ORG-NAME>"    # Example: "mike-gmail-com-467h2"
app_name = "<YOUR-SKAFOS-APP-NAME>"    # Example: "PhraseGenerator"
model_name = "<YOUR-SKAFOS-MODEL-NAME>"       # Example: "PhraseGenModel"

# Upload model version to Skafos
model_upload_result = models.upload_version(
    files = [coreml_model_name, index_word_lookup_file],
    description = "Shakespeare model",
    org_name = org_name,
    app_name = app_name,
    model_name = model_name
)