# TensorFlow. Word Generation with LSTM (new model)

Word generation in NLP involves creating meaningful text. Google TensorFlow, an open-source machine learning framework, offers tools to build and train models for this. This notebook covers the basics, capabilities, applications, and steps to develop word generation models using TensorFlow.

![](./assets/cover.jpg)

Blog Post: [TensorFlow. Word Generation with LSTM](https://vitalyzhukov.com/en/tensorflow-word-generation-with-lstm)

# Prerequisites

To work with TensorFlow models, you will need two libraries open-source Python libraries: NumPy (Numerical Python) and TensorFlow.

In [None]:
!pip install tensorflow
!pip install numpy
!pip install matplotlib

Let's import the required packages:

In [None]:
# Import TensorFlow
import tensorflow as tf

# Import numpy
import numpy as np

## Dataset  

To train the model, we must supply both input data and the target output. The code below creates these data sets.

1. Download the dictionary

In [None]:
# Download dictionary
path_to_dict = tf.keras.utils.get_file(
    "popular.txt", "https://raw.githubusercontent.com/dolph/dictionary/master/popular.txt"
)

# Open the dictionary
dict = open(path_to_dict, "rb").read().decode(encoding="UTF-8")

# The dict contains one word per line. Split the text to get list of words
words = dict.splitlines()

2. Tokenize the characters

In [None]:
# Initiate a new instanse of tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer(char_level=True)

# Create tokens for each character in the words list
tokenizer.fit_on_texts(words)

# Add meta-token representing end of word
tokenizer.word_index["<END>"] = len(tokenizer.word_index) + 1

3. Generate data sets

In [None]:
# Size of vocabulary
vocab_size = len(tokenizer.word_index) + 1

# Max length of words in the vocabulary
max_length = max(map(len, words)) + 1

# Vocabulary in tensor format
encoded_words = tokenizer.texts_to_sequences(words)

meta_token = tokenizer.word_index["<END>"]

# Dataset to train
input_data = []

# Corresponding dataset to test
output_data = []

for seq in encoded_words:
    # Append the end of word token
    alt_seq = seq + [meta_token]
    for i in range(1, len(alt_seq)):
        input_data.append(alt_seq[:i + 1])

# Make input data sequences to the same length
input_data = np.array(tf.keras.utils.pad_sequences(input_data, maxlen=max_length, padding='pre'))

input_data, validate_data = input_data[:,:-1], input_data[:,-1]
validate_data = tf.keras.utils.to_categorical(validate_data, num_classes=vocab_size)


# Model

The following code creates a new sequential model with five layers:

In [None]:
model = tf.keras.Sequential()

model.add(tf.keras.Input(shape=(50,), name="input"))
model.add(tf.keras.layers.Embedding(vocab_size, 16))
model.add(tf.keras.layers.LSTM(128, return_sequences=True))
model.add(tf.keras.layers.LSTM(128, return_sequences=True))
model.add(tf.keras.layers.LSTM(128))
model.add(tf.keras.layers.Dense(vocab_size, activation="softmax"))

model.compile(
    optimizer=tf.keras.optimizers.Adam(),
    loss=tf.keras.losses.CategoricalCrossentropy(),
    metrics=[tf.keras.metrics.Accuracy()]
)

## Model Training
Now, we can train the model by specifying the number of iterations:

In [None]:
model.fit(input_data, validate_data, batch_size=512, epochs=10)#, callbacks=keras_callbacks)

## Prediction

Method for creating a bar chart of the prediction distribution.

In [None]:
# Import Matplotlib for Visualization purpose
import matplotlib.pyplot as plt

def show_probability_bar(prefix):
    meta_token = tokenizer.word_index["<END>"]  
    bar_chars = []
    bar_vals = []

    encoded = tokenizer.texts_to_sequences([prefix])
    encoded = tf.keras.utils.pad_sequences(encoded, maxlen=50, padding="pre")
    predicted_characters = np.asarray(model.predict(encoded, verbose=0, batch_size=1)[0]).astype('float64')

    for i in tokenizer.word_index.items():
        key = i[0]
        val = round(predicted_characters[i[1]], 2)
        if meta_token == key:
            print(val)
        if val > 0.0: # ignore predictions with low probability
            bar_chars.append(key.upper())
            bar_vals.append(val)

    plt.bar(bar_chars, bar_vals)
    plt.xlabel("Character")
    plt.ylabel("Probability")
    plt.title("The next character after \"" + prefix + "\" and its probability") 
    plt.show()

Display probabilities for the input "mic"

In [None]:
show_probability_bar("mic")

The predict_next_character function returns a randomly selected character from the set of top crazy_index items, sorted by probability.

In [None]:
def predict_next_character(prefix, crazy_index:int):
    """Predict next characters

    :param prefix: Existing part of the word
    :param crazy_index: The number of predicted characters is used to choose one.
    :return: Predicted character
    """
    encoded = tokenizer.texts_to_sequences([prefix])
    encoded = tf.keras.utils.pad_sequences(encoded, maxlen=50, padding="pre")
    predicted_characters = np.asarray(model.predict(encoded, verbose=0, batch_size=1)[0]).astype('float64')
    
    if crazy_index is None or crazy_index == 0:
        return np.argmax(predicted_characters)
    else:
        if crazy_index > len(predicted_characters) : crazy_index = len(predicted_characters)
        
        # getting top {crazy_index} possible characters
        candidate_args = np.argsort(predicted_characters, axis=0)[-crazy_index:]
        probas = np.take(predicted_characters, candidate_args)
        
        # randomly get one the top possible characters
        probas = np.random.multinomial(1, np.exp(np.arctan(probas))/np.sum(np.exp(np.arctan(probas))),1)
        
        return candidate_args[np.argmax(probas)]

# Word Generation

To generate the entire word, we'll create a generate_words function that will call predict_next_character repeatedly.

In [None]:
def generate_words(prefix, no_words:int, crazy_index:int):
    """Generate words

    :param prefix: Existing part of the word
    :param no_words: Number of words to generate
    :param crazy_index: The number of predicted characters is used to choose one.
    :return: List of generated words
    """
    max_text_lenght = 20
    meta_token = tokenizer.word_index["<END>"]  
    words = []
    
    for _ in range(no_words):
        word_prefix = prefix
        for _ in range(max_text_lenght):
            predicted_character = predict_next_character(word_prefix, crazy_index)
            
            # stop prediction if the next character is the meta token presenting end of word
            if predicted_character == meta_token:
                break
            
            # convert tensor to character
            predicted_char = tokenizer.sequences_to_texts([[predicted_character]])
            
            # append the character to the result word
            word_prefix = word_prefix + predicted_char[0]
                
        words.append(word_prefix)

    return words

In [None]:
generate_words("", 5, 2)

## Save Model for Future Using

In [None]:
model.save("new_model_word_generation.keras")