# Language Model- Bi_LSTM (Keras)
### Written by: Rodrigo Escandon

# Executive Summary

A Natural Language Processing model was developed using Machine Learning to determine the probability of the words that follow the previously provided. This model was trained on nursery rhymes and the intent is for the model to effectively predict the next word in the rhyme. This model was created using Python (Keras (Tensorflow backend), Numpy) to structure and analyze the data set.

## Model Performance

The accuracy of the model predicting the next word for the two nursery rhymes that have been provided was 100%. 
In this example, the model that was created used Tensorflow as its evaluator. This tensor based model used the LSTM architecture and a bi-directional approach for evaluating the text. Single predictors were created to predict the nursery rhymes and parts of the nursery rhymes. This model was successful at predicting both. It is worth noting that the single predictor with the final verse is asking for a number of words that exceed the ones available within the verse. This is done on purpose to show that the model is capable of detecting this issue and can end its word generation.


In [1]:
import numpy as np
from numpy import array
import keras
import tensorflow as tf
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical, pad_sequences
#from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from keras.layers import Embedding

In [2]:
# Generating a sequence from a language model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
    in_text = seed_text
    # generate a fixed number of words
    for i in range(n_words):
    # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
    # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='post')
    # predict probabilities for each word
        #yhat = model.predict_classes(encoded, verbose=0)
        result = np.argmax(model.predict(encoded, verbose=0))
    # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == result:
                in_text=in_text+' '+ word
    return in_text

In [3]:
# source text
data = [""" Jack and Jill went up the hill\n
 To fetch a pail of water\n
 Jack fell down and broke his crown\n
 And Jill came tumbling after\n """,
"""Baa, baa, black sheep\n
 Have you any wool?\n
 Yes sir, yes sir\n
 Three bags full.\n
 One for my master\n
 And one for the dame\n
 One for the little boy\n
 Who lives down the lane."""]
# integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
encoded = tokenizer.texts_to_sequences(data)
# retrieve vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
# encode 4 words -> 1 word
sequences = list()
for a in range(len(encoded)):
    for i in range(len(encoded[a])):
        sequence = encoded[a][i:i+5]
        sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='post')
print('Max Sequence Length: %d' % max_length)

Vocabulary Size: 44
Total Sequences: 59
Max Sequence Length: 5


In [4]:
# Splitting input and output elements
sequences = array(sequences)
X, y = sequences[:,:-1],sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [5]:
# Model Creation, Compiling and Summary
#The LSTM function is the RNN architecture that will be used to the Neural Network.
#A bi-directionality component has been adde to evaluate both directions of the sentence.
#The Softmax function is used for the output layer
model = Sequential()
model.add(Embedding(vocab_size, 10, input_length=max_length-1))
model.add(keras.layers.Bidirectional(LSTM(50,dropout=0.15,recurrent_dropout=0.15)))
model.add(Dense(vocab_size, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 4, 10)             440       
                                                                 
 bidirectional (Bidirectiona  (None, 100)              24400     
 l)                                                              
                                                                 
 dense (Dense)               (None, 44)                4444      
                                                                 
Total params: 29,284
Trainable params: 29,284
Non-trainable params: 0
_________________________________________________________________
None


In [6]:
# Model Fitting
call=tf.keras.callbacks.EarlyStopping(monitor='loss',patience=5,restore_best_weights=True)
model.fit(X, y, epochs=250, verbose=2, callbacks=[call])

Epoch 1/250
2/2 - 6s - loss: 3.7845 - accuracy: 0.0339 - 6s/epoch - 3s/step
Epoch 2/250
2/2 - 0s - loss: 3.7806 - accuracy: 0.0339 - 34ms/epoch - 17ms/step
Epoch 3/250
2/2 - 0s - loss: 3.7779 - accuracy: 0.1356 - 26ms/epoch - 13ms/step
Epoch 4/250
2/2 - 0s - loss: 3.7737 - accuracy: 0.1356 - 34ms/epoch - 17ms/step
Epoch 5/250
2/2 - 0s - loss: 3.7703 - accuracy: 0.1356 - 32ms/epoch - 16ms/step
Epoch 6/250
2/2 - 0s - loss: 3.7672 - accuracy: 0.1356 - 50ms/epoch - 25ms/step
Epoch 7/250
2/2 - 0s - loss: 3.7633 - accuracy: 0.1356 - 33ms/epoch - 17ms/step
Epoch 8/250
2/2 - 0s - loss: 3.7591 - accuracy: 0.1356 - 28ms/epoch - 14ms/step
Epoch 9/250
2/2 - 0s - loss: 3.7550 - accuracy: 0.1356 - 30ms/epoch - 15ms/step
Epoch 10/250
2/2 - 0s - loss: 3.7492 - accuracy: 0.1356 - 34ms/epoch - 17ms/step
Epoch 11/250
2/2 - 0s - loss: 3.7439 - accuracy: 0.1356 - 32ms/epoch - 16ms/step
Epoch 12/250
2/2 - 0s - loss: 3.7376 - accuracy: 0.1356 - 51ms/epoch - 25ms/step
Epoch 13/250
2/2 - 0s - loss: 3.7310 - ac

Epoch 103/250
2/2 - 0s - loss: 1.3867 - accuracy: 0.6271 - 17ms/epoch - 8ms/step
Epoch 104/250
2/2 - 0s - loss: 1.3920 - accuracy: 0.5593 - 17ms/epoch - 9ms/step
Epoch 105/250
2/2 - 0s - loss: 1.3573 - accuracy: 0.6102 - 17ms/epoch - 9ms/step
Epoch 106/250
2/2 - 0s - loss: 1.2857 - accuracy: 0.7119 - 33ms/epoch - 17ms/step
Epoch 107/250
2/2 - 0s - loss: 1.2345 - accuracy: 0.6780 - 33ms/epoch - 17ms/step
Epoch 108/250
2/2 - 0s - loss: 1.2776 - accuracy: 0.6949 - 18ms/epoch - 9ms/step
Epoch 109/250
2/2 - 0s - loss: 1.2185 - accuracy: 0.6780 - 20ms/epoch - 10ms/step
Epoch 110/250
2/2 - 0s - loss: 1.1638 - accuracy: 0.7627 - 30ms/epoch - 15ms/step
Epoch 111/250
2/2 - 0s - loss: 1.2287 - accuracy: 0.7458 - 32ms/epoch - 16ms/step
Epoch 112/250
2/2 - 0s - loss: 1.1737 - accuracy: 0.7119 - 16ms/epoch - 8ms/step
Epoch 113/250
2/2 - 0s - loss: 1.1122 - accuracy: 0.7966 - 35ms/epoch - 17ms/step
Epoch 114/250
2/2 - 0s - loss: 1.1128 - accuracy: 0.7288 - 31ms/epoch - 16ms/step
Epoch 115/250
2/2 - 0

<keras.callbacks.History at 0x2442bb07e20>

In [7]:
# Single predictor, trying to predict a specific number of words (22,5,5,5) after the single four word statement
# The first predictor is predicting the full nursery rhyme
# The last predictor with the final verse is asking for a number of words that exceed the ones available within the verse.
print(generate_seq(model, tokenizer, max_length-1, 'Jack and Jill went', 22))
print(generate_seq(model, tokenizer, max_length-1, 'fell down and broke', 5))
print(generate_seq(model, tokenizer, max_length-1, 'pail of water Jack', 5))
print(generate_seq(model, tokenizer, max_length-1, 'And Jill came tumbling', 5))

Jack and Jill went up the hill to fetch a pail of water jack fell down and broke his crown and jill came tumbling after
fell down and broke his crown and jill came
pail of water Jack fell down and broke his
And Jill came tumbling after


In [8]:
# Single predictor, trying to predict a specific number of words (31,5,5,5) after the single four word statement
# The first predictor is predicting the full nursery rhyme
# The last predictor with the final verse is asking for a number of words that exceed the ones available within the verse.print(generate_seq(model, tokenizer, max_length-1, 'Baa baa black sheep', 31))
print(generate_seq(model, tokenizer, max_length-1, 'for my master and', 5))
print(generate_seq(model, tokenizer, max_length-1, 'one for the dame', 5))
print(generate_seq(model, tokenizer, max_length-1, 'boy who lives down', 5))

Baa baa black sheep have you any wool yes sir yes sir three bags full one for my master and one for the dame one for the little boy who lives down the lane
for my master and one for the dame one
one for the dame one for the little boy
boy who lives down the lane


In [None]:
# Nursery Rhyme Texts
""" Jack and Jill went up the hill\n
 To fetch a pail of water\n
 Jack fell down and broke his crown\n
 And Jill came tumbling after\n """,

"""Baa, baa, black sheep\n
 Have you any wool?\n
 Yes sir, yes sir\n
 Three bags full.\n
 One for my master\n
 And one for the dame\n
 One for the little boy\n
 Who lives down the lane."""