## How to Develop a Word-Based Neural Language Model

We know language modeling involves predicting the next word in a sequence given the sequence of words already present.

A language model is a key element in many NLP models such as machine translation and speech recognition.

The choice of how the language model is framed must match how the language model is intended to be used.

#### Framing Language Modeling

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence. Language models are a key component in larger models for challenging NLP problems, like machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. the training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence. Similarly, when making predictions, the process can be seeded with one or few words, then predicted words can be gathered and presented as input on subsequent predictions in order to build up a generated output sequence.

Therefore, each model will involce splitting the source text into input and output sequences, such that the model can learn to predict words. There are many ways to frame the sequences from a source text for language modeling.

In [8]:
from numpy import array
from keras.preprocessing.text import Tokenizer
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Dense, LSTM, Embedding

#### The Data - Jack and Jill Rhyme

In [9]:
# text
data = """Jack and Jill went up the hill\n 
        To fetch a pail of water\n 
        Jack fell down and broke his crown\n 
        And Jill came tumbling after\n"""

#### Model 1: One-Word-In, One-Word-Out Sequences

We can start with a simple word-in, word-out model to predict the next sequence:
    
                    X,           y
                    Jack,       and
                    and,        Jill
                    Jill,       went
                    ...
                    
We will fit our model to predict a probability distribution across all words in the vocabulary. This means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and 1 for the actual word that the value. This gives the network a ground truth to aim for from which we can calculate error and update the model.

In [10]:
# Create the tokenizer class and fit on texts
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

In [11]:
encoded

[2,
 1,
 3,
 4,
 5,
 6,
 7,
 8,
 9,
 10,
 11,
 12,
 13,
 2,
 14,
 15,
 1,
 16,
 17,
 18,
 1,
 3,
 19,
 20,
 21]

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding. The size of the vocabulary can be retrieved from the trained Tokenizer using the word_index attribute.

In [12]:
# Determining the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)

Vocabulary Size: 22


Running the above will show us a vocab size of 21 words, We add one, because we will need to specify the integer for the larget encoded word as an array index, eg. Words encoded 1 to 21 with array indices 0 to 21 or 22 position. 

Next, we need to create sequences of words to fit the model with one word as input and one word as output.

In [13]:
# create word > word sequences
sequences = list()
for i in range(1, len(encoded)):
    sequence = encoded[i-1:i+1]
    sequences.append(sequence)
print('Total Sequences: %d' %len(sequences))

Total Sequences: 24


In [14]:
# Split sequences into X and y elements
sequences = array(sequences)
X, y = sequences[:, 0], sequences[:, 1]
print(X, y)

[ 2  1  3  4  5  6  7  8  9 10 11 12 13  2 14 15  1 16 17 18  1  3 19 20] [ 1  3  4  5  6  7  8  9 10 11 12 13  2 14 15  1 16 17 18  1  3 19 20 21]


In [15]:
# onehot encode outputs
y = to_categorical(y, num_classes=vocab_size)

In [16]:
y

array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
        0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.,

We are now ready to define the model. The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection.

The input sequence contains a single word, therefore the input_length = 1.

The model has a single hidden LSTM layer with 50 units. This is far more than needed. The output layer is comprised of one neuron for each word in the vocabulary and uses a softmax activation function to ensure the output is normalized to look like a probability.

In [17]:
# Define the model: We'll use the sequential model because we will need to predict classes
def word_pred_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

In [18]:
model = word_pred_model(vocab_size)

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 1, 10)             220       
_________________________________________________________________
lstm_1 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_1 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________


In [19]:
# Fit the model
model.fit(X, y, epochs=500, verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/500
 - 0s - loss: 3.0912 - accuracy: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.0904 - accuracy: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.0897 - accuracy: 0.0833
Epoch 4/500
 - 0s - loss: 3.0889 - accuracy: 0.0833
Epoch 5/500
 - 0s - loss: 3.0882 - accuracy: 0.1250
Epoch 6/500
 - 0s - loss: 3.0874 - accuracy: 0.1250
Epoch 7/500
 - 0s - loss: 3.0866 - accuracy: 0.1250
Epoch 8/500
 - 0s - loss: 3.0859 - accuracy: 0.1250
Epoch 9/500
 - 0s - loss: 3.0851 - accuracy: 0.1250
Epoch 10/500
 - 0s - loss: 3.0843 - accuracy: 0.1250
Epoch 11/500
 - 0s - loss: 3.0835 - accuracy: 0.1250
Epoch 12/500
 - 0s - loss: 3.0827 - accuracy: 0.1250
Epoch 13/500
 - 0s - loss: 3.0819 - accuracy: 0.1250
Epoch 14/500
 - 0s - loss: 3.0810 - accuracy: 0.1250
Epoch 15/500
 - 0s - loss: 3.0802 - accuracy: 0.1250
Epoch 16/500
 - 0s - loss: 3.0793 - accuracy: 0.1250
Epoch 17/500
 - 0s - loss: 3.0784 - accuracy: 0.1250
Epoch 18/500
 - 0s - loss: 3.0775 - accuracy: 0.1250
Epoch 19/500
 - 0s - loss: 3.0766 - accuracy: 0

Epoch 155/500
 - 0s - loss: 2.0924 - accuracy: 0.5417
Epoch 156/500
 - 0s - loss: 2.0763 - accuracy: 0.5417
Epoch 157/500
 - 0s - loss: 2.0600 - accuracy: 0.5417
Epoch 158/500
 - 0s - loss: 2.0438 - accuracy: 0.5417
Epoch 159/500
 - 0s - loss: 2.0275 - accuracy: 0.5833
Epoch 160/500
 - 0s - loss: 2.0113 - accuracy: 0.5833
Epoch 161/500
 - 0s - loss: 1.9950 - accuracy: 0.5833
Epoch 162/500
 - 0s - loss: 1.9787 - accuracy: 0.5833
Epoch 163/500
 - 0s - loss: 1.9624 - accuracy: 0.5833
Epoch 164/500
 - 0s - loss: 1.9461 - accuracy: 0.6250
Epoch 165/500
 - 0s - loss: 1.9298 - accuracy: 0.6250
Epoch 166/500
 - 0s - loss: 1.9135 - accuracy: 0.6250
Epoch 167/500
 - 0s - loss: 1.8973 - accuracy: 0.6667
Epoch 168/500
 - 0s - loss: 1.8810 - accuracy: 0.6667
Epoch 169/500
 - 0s - loss: 1.8648 - accuracy: 0.7500
Epoch 170/500
 - 0s - loss: 1.8486 - accuracy: 0.7500
Epoch 171/500
 - 0s - loss: 1.8325 - accuracy: 0.7500
Epoch 172/500
 - 0s - loss: 1.8163 - accuracy: 0.8333
Epoch 173/500
 - 0s - loss: 

Epoch 307/500
 - 0s - loss: 0.4496 - accuracy: 0.8750
Epoch 308/500
 - 0s - loss: 0.4458 - accuracy: 0.8750
Epoch 309/500
 - 0s - loss: 0.4420 - accuracy: 0.8750
Epoch 310/500
 - 0s - loss: 0.4383 - accuracy: 0.8750
Epoch 311/500
 - 0s - loss: 0.4346 - accuracy: 0.8750
Epoch 312/500
 - 0s - loss: 0.4310 - accuracy: 0.8750
Epoch 313/500
 - 0s - loss: 0.4275 - accuracy: 0.8750
Epoch 314/500
 - 0s - loss: 0.4240 - accuracy: 0.8750
Epoch 315/500
 - 0s - loss: 0.4206 - accuracy: 0.8750
Epoch 316/500
 - 0s - loss: 0.4173 - accuracy: 0.8750
Epoch 317/500
 - 0s - loss: 0.4140 - accuracy: 0.8750
Epoch 318/500
 - 0s - loss: 0.4108 - accuracy: 0.8750
Epoch 319/500
 - 0s - loss: 0.4077 - accuracy: 0.8750
Epoch 320/500
 - 0s - loss: 0.4046 - accuracy: 0.8750
Epoch 321/500
 - 0s - loss: 0.4015 - accuracy: 0.8750
Epoch 322/500
 - 0s - loss: 0.3985 - accuracy: 0.8750
Epoch 323/500
 - 0s - loss: 0.3956 - accuracy: 0.8750
Epoch 324/500
 - 0s - loss: 0.3927 - accuracy: 0.8750
Epoch 325/500
 - 0s - loss: 

Epoch 459/500
 - 0s - loss: 0.2424 - accuracy: 0.8750
Epoch 460/500
 - 0s - loss: 0.2420 - accuracy: 0.8750
Epoch 461/500
 - 0s - loss: 0.2417 - accuracy: 0.8750
Epoch 462/500
 - 0s - loss: 0.2413 - accuracy: 0.8750
Epoch 463/500
 - 0s - loss: 0.2410 - accuracy: 0.8750
Epoch 464/500
 - 0s - loss: 0.2406 - accuracy: 0.8750
Epoch 465/500
 - 0s - loss: 0.2403 - accuracy: 0.8750
Epoch 466/500
 - 0s - loss: 0.2400 - accuracy: 0.8750
Epoch 467/500
 - 0s - loss: 0.2397 - accuracy: 0.8750
Epoch 468/500
 - 0s - loss: 0.2393 - accuracy: 0.8750
Epoch 469/500
 - 0s - loss: 0.2390 - accuracy: 0.8750
Epoch 470/500
 - 0s - loss: 0.2387 - accuracy: 0.8750
Epoch 471/500
 - 0s - loss: 0.2384 - accuracy: 0.8750
Epoch 472/500
 - 0s - loss: 0.2381 - accuracy: 0.8750
Epoch 473/500
 - 0s - loss: 0.2378 - accuracy: 0.8750
Epoch 474/500
 - 0s - loss: 0.2375 - accuracy: 0.8750
Epoch 475/500
 - 0s - loss: 0.2372 - accuracy: 0.8750
Epoch 476/500
 - 0s - loss: 0.2369 - accuracy: 0.8750
Epoch 477/500
 - 0s - loss: 

<keras.callbacks.callbacks.History at 0x7fe0b3002748>

In [20]:
in_text = 'Jack'
print(in_text)
encoded = tokenizer.texts_to_sequences([in_text])[0]
encoded = array(encoded)
yhat = model.predict_classes(encoded, verbose=0)
for word, index in tokenizer.word_index.items():
    if index == yhat:
        print(word)

Jack
and


In [23]:
# Generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
    in_text, result = seed_text, seed_text
    #generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        encoded = array(encoded)
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text, result = out_word, result+' '+out_word
    return result

In [24]:
# Evaluate 
print(generate_seq(model, tokenizer, 'Jack', 6))

Jack and jill went up the hill


In [25]:
# Evaluate 
print(generate_seq(model, tokenizer, 'fell', 6))

fell down and jill went up the


In [26]:
# Evaluate 
print(generate_seq(model, tokenizer, 'and', 6))

and jill went up the hill to


In [27]:
# Evaluate 
print(generate_seq(model, tokenizer, 'to', 6))

to fetch a pail of water jack


In [28]:
# Evaluate 
print(generate_seq(model, tokenizer, 'fetch', 6))

fetch a pail of water jack and


#### Model 2: Line-by-Line Sequence

We saw the approach of using one word to predict the next. We'll try another approach of splitting up the source text line-by line, then break each line down into a series of words that build up. For example:

X,                                 y

    _, _, _, _, _, Jack,               and
    _, _, _, _, Jack, and,             Jill
    _, _, _, Jack, and, Jill,          went
    _, _, Jack, and, Jill, went,       up
    _, Jack, and, Jill, went, up,      the
    Jack, and, Jill, went, up, the,    hill

This approach may alow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity. In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text.

Note then from the above illustration that we'll have to pad our sentences to ensure they meet a fixed length input. This is a requirement when using keras.

In [30]:
from keras.preprocessing.sequence import pad_sequences

In [29]:
# Create line sequences
sequences = list()
for line in data.split('\n'):
    encoded = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(encoded)):
        sequence = encoded[:i+1]
        sequences.append(sequence)
print('Total Sequences: %d' %len(sequences))

Total Sequences: 21


In [31]:
# Extract maximum length and pad input sequences using 'pre' padding
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print('Max Sequence Length: %d'% max_length)

Max Sequence Length: 7


In [38]:
# Split sequences into input and output elements
sequences = array(sequences)
X = sequences[:,:-1] 
y = sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [42]:
# We can use the same model as before, we'll just edit the input length which will be the max_length - 1
# Define the model: We'll use the sequential model because we will need to predict classes
def word_pred_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=max_length-1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

In [43]:
model=word_pred_model(vocab_size)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_3 (Embedding)      (None, 6, 10)             220       
_________________________________________________________________
lstm_3 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_3 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________


In [44]:
model.fit(X, y, epochs=500, verbose=2)

  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/500
 - 0s - loss: 3.0926 - accuracy: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.0913 - accuracy: 0.0476
Epoch 3/500
 - 0s - loss: 3.0899 - accuracy: 0.0952
Epoch 4/500
 - 0s - loss: 3.0884 - accuracy: 0.1429
Epoch 5/500
 - 0s - loss: 3.0868 - accuracy: 0.1429
Epoch 6/500
 - 0s - loss: 3.0852 - accuracy: 0.1905
Epoch 7/500
 - 0s - loss: 3.0836 - accuracy: 0.1905
Epoch 8/500
 - 0s - loss: 3.0818 - accuracy: 0.1905
Epoch 9/500
 - 0s - loss: 3.0801 - accuracy: 0.1429
Epoch 10/500
 - 0s - loss: 3.0782 - accuracy: 0.1429
Epoch 11/500
 - 0s - loss: 3.0764 - accuracy: 0.0952
Epoch 12/500
 - 0s - loss: 3.0744 - accuracy: 0.0952
Epoch 13/500
 - 0s - loss: 3.0724 - accuracy: 0.0952
Epoch 14/500
 - 0s - loss: 3.0702 - accuracy: 0.0952
Epoch 15/500
 - 0s - loss: 3.0679 - accuracy: 0.0952
Epoch 16/500
 - 0s - loss: 3.0655 - accuracy: 0.0952
Epoch 17/500
 - 0s - loss: 3.0630 - accuracy: 0.0952
Epoch 18/500
 - 0s - loss: 3.0603 - accuracy: 0.0952
Epoch 19/500
 - 0s - loss: 3.0575 - accuracy: 0.095

Epoch 155/500
 - 0s - loss: 0.7825 - accuracy: 0.8571
Epoch 156/500
 - 0s - loss: 0.7739 - accuracy: 0.8571
Epoch 157/500
 - 0s - loss: 0.7658 - accuracy: 0.8571
Epoch 158/500
 - 0s - loss: 0.7577 - accuracy: 0.8571
Epoch 159/500
 - 0s - loss: 0.7494 - accuracy: 0.8571
Epoch 160/500
 - 0s - loss: 0.7409 - accuracy: 0.8571
Epoch 161/500
 - 0s - loss: 0.7327 - accuracy: 0.8571
Epoch 162/500
 - 0s - loss: 0.7250 - accuracy: 0.8571
Epoch 163/500
 - 0s - loss: 0.7176 - accuracy: 0.8571
Epoch 164/500
 - 0s - loss: 0.7102 - accuracy: 0.8571
Epoch 165/500
 - 0s - loss: 0.7028 - accuracy: 0.8571
Epoch 166/500
 - 0s - loss: 0.6953 - accuracy: 0.8571
Epoch 167/500
 - 0s - loss: 0.6880 - accuracy: 0.8571
Epoch 168/500
 - 0s - loss: 0.6810 - accuracy: 0.8571
Epoch 169/500
 - 0s - loss: 0.6743 - accuracy: 0.8571
Epoch 170/500
 - 0s - loss: 0.6678 - accuracy: 0.8571
Epoch 171/500
 - 0s - loss: 0.6614 - accuracy: 0.8571
Epoch 172/500
 - 0s - loss: 0.6550 - accuracy: 0.8571
Epoch 173/500
 - 0s - loss: 

Epoch 307/500
 - 0s - loss: 0.2405 - accuracy: 0.9524
Epoch 308/500
 - 0s - loss: 0.2382 - accuracy: 0.9524
Epoch 309/500
 - 0s - loss: 0.2372 - accuracy: 0.9524
Epoch 310/500
 - 0s - loss: 0.2363 - accuracy: 0.9524
Epoch 311/500
 - 0s - loss: 0.2339 - accuracy: 0.9524
Epoch 312/500
 - 0s - loss: 0.2319 - accuracy: 0.9524
Epoch 313/500
 - 0s - loss: 0.2308 - accuracy: 0.9524
Epoch 314/500
 - 0s - loss: 0.2295 - accuracy: 0.9524
Epoch 315/500
 - 0s - loss: 0.2276 - accuracy: 0.9524
Epoch 316/500
 - 0s - loss: 0.2258 - accuracy: 0.9524
Epoch 317/500
 - 0s - loss: 0.2247 - accuracy: 0.9524
Epoch 318/500
 - 0s - loss: 0.2234 - accuracy: 0.9524
Epoch 319/500
 - 0s - loss: 0.2216 - accuracy: 0.9524
Epoch 320/500
 - 0s - loss: 0.2200 - accuracy: 0.9524
Epoch 321/500
 - 0s - loss: 0.2188 - accuracy: 0.9524
Epoch 322/500
 - 0s - loss: 0.2175 - accuracy: 0.9524
Epoch 323/500
 - 0s - loss: 0.2158 - accuracy: 0.9524
Epoch 324/500
 - 0s - loss: 0.2144 - accuracy: 0.9524
Epoch 325/500
 - 0s - loss: 

Epoch 459/500
 - 0s - loss: 0.1147 - accuracy: 0.9524
Epoch 460/500
 - 0s - loss: 0.1144 - accuracy: 0.9524
Epoch 461/500
 - 0s - loss: 0.1141 - accuracy: 0.9524
Epoch 462/500
 - 0s - loss: 0.1138 - accuracy: 0.9524
Epoch 463/500
 - 0s - loss: 0.1134 - accuracy: 0.9524
Epoch 464/500
 - 0s - loss: 0.1131 - accuracy: 0.9524
Epoch 465/500
 - 0s - loss: 0.1128 - accuracy: 0.9524
Epoch 466/500
 - 0s - loss: 0.1125 - accuracy: 0.9524
Epoch 467/500
 - 0s - loss: 0.1122 - accuracy: 0.9524
Epoch 468/500
 - 0s - loss: 0.1119 - accuracy: 0.9524
Epoch 469/500
 - 0s - loss: 0.1116 - accuracy: 0.9524
Epoch 470/500
 - 0s - loss: 0.1113 - accuracy: 0.9524
Epoch 471/500
 - 0s - loss: 0.1110 - accuracy: 0.9524
Epoch 472/500
 - 0s - loss: 0.1108 - accuracy: 0.9524
Epoch 473/500
 - 0s - loss: 0.1105 - accuracy: 0.9524
Epoch 474/500
 - 0s - loss: 0.1102 - accuracy: 0.9524
Epoch 475/500
 - 0s - loss: 0.1099 - accuracy: 0.9524
Epoch 476/500
 - 0s - loss: 0.1096 - accuracy: 0.9524
Epoch 477/500
 - 0s - loss: 

<keras.callbacks.callbacks.History at 0x7fe0a0449668>

We can then use the above generate seq function with a little edit to make predictions given a sequence.

In [45]:
# Generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words, max_length):
    in_text = seed_text
    #generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' '+out_word
    return in_text

In [46]:
# Generate word
print(generate_seq(model, tokenizer, 'Jack', 4, max_length-1))

Jack fell down and broke


In [50]:
print(generate_seq(model, tokenizer, 'came', 4, max_length-1))

came fetch fetch pail of


We can see a few bugs in some predictions. This is expected, as we see that not many of the words will never appear in the beginning of a sequence, only within, so this will force the network to output something else everytime.

#### Model 3: Two-Words In, One-word Out Sequence.

We can use an intermediate between the one-word-in and the whole-sentence-in approaches and pass in a sub-sequence of words as input. this will provide a trade-off between the two framings allowing new lines to be generated and for generation to be picked up mid line. We will use 3 words as input to predict one word as output. We will prepare the sequences like the first example, but with different offsets.

In [53]:
data

'Jack and Jill went up the hill\n \n        To fetch a pail of water\n \n        Jack fell down and broke his crown\n \n        And Jill came tumbling after\n'

In [54]:
# Integer encode sequences of words
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

In [56]:
# Retrieve vocabulary
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary size: %d' % vocab_size)

Vocabulary size: 22


In [57]:
# encode 2 words to 1 word
sequences = list()
for i in range(2, len(encoded)):
    sequence = encoded[i-2:i+1]
    sequences.append(sequence)
print(sequences)

[[2, 1, 3], [1, 3, 4], [3, 4, 5], [4, 5, 6], [5, 6, 7], [6, 7, 8], [7, 8, 9], [8, 9, 10], [9, 10, 11], [10, 11, 12], [11, 12, 13], [12, 13, 2], [13, 2, 14], [2, 14, 15], [14, 15, 1], [15, 1, 16], [1, 16, 17], [16, 17, 18], [17, 18, 1], [18, 1, 3], [1, 3, 19], [3, 19, 20], [19, 20, 21]]


In [58]:
# pad sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print(max_length)

3


In [59]:
# Split int input and output elements
sequences = array(sequences)
X = sequences[:,:-1]
y = sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)

In [60]:
# We can use the same model as before, we'll just edit the input length which will be the max_length - 1
# Define the model: We'll use the sequential model because we will need to predict classes
def word_pred_model(vocab_size):
    model = Sequential()
    model.add(Embedding(vocab_size, 10, input_length=max_length-1))
    model.add(LSTM(50))
    model.add(Dense(vocab_size, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.summary()
    return model

In [61]:
model = word_pred_model(vocab_size)
model.fit(X, y, epochs=500, verbose=2)

Model: "sequential_4"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_4 (Embedding)      (None, 2, 10)             220       
_________________________________________________________________
lstm_4 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_4 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________


  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "


Epoch 1/500
 - 1s - loss: 3.0916 - accuracy: 0.0000e+00
Epoch 2/500
 - 0s - loss: 3.0910 - accuracy: 0.0000e+00
Epoch 3/500
 - 0s - loss: 3.0901 - accuracy: 0.0870
Epoch 4/500
 - 0s - loss: 3.0893 - accuracy: 0.0870
Epoch 5/500
 - 0s - loss: 3.0885 - accuracy: 0.0870
Epoch 6/500
 - 0s - loss: 3.0876 - accuracy: 0.0870
Epoch 7/500
 - 0s - loss: 3.0867 - accuracy: 0.0870
Epoch 8/500
 - 0s - loss: 3.0858 - accuracy: 0.0870
Epoch 9/500
 - 0s - loss: 3.0849 - accuracy: 0.0870
Epoch 10/500
 - 0s - loss: 3.0839 - accuracy: 0.0870
Epoch 11/500
 - 0s - loss: 3.0830 - accuracy: 0.0870
Epoch 12/500
 - 0s - loss: 3.0820 - accuracy: 0.0870
Epoch 13/500
 - 0s - loss: 3.0810 - accuracy: 0.0870
Epoch 14/500
 - 0s - loss: 3.0800 - accuracy: 0.0870
Epoch 15/500
 - 0s - loss: 3.0790 - accuracy: 0.0870
Epoch 16/500
 - 0s - loss: 3.0779 - accuracy: 0.0870
Epoch 17/500
 - 0s - loss: 3.0768 - accuracy: 0.0870
Epoch 18/500
 - 0s - loss: 3.0756 - accuracy: 0.0870
Epoch 19/500
 - 0s - loss: 3.0745 - accuracy: 0

Epoch 155/500
 - 0s - loss: 1.1486 - accuracy: 0.8696
Epoch 156/500
 - 0s - loss: 1.1254 - accuracy: 0.8696
Epoch 157/500
 - 0s - loss: 1.1025 - accuracy: 0.8696
Epoch 158/500
 - 0s - loss: 1.0799 - accuracy: 0.8696
Epoch 159/500
 - 0s - loss: 1.0576 - accuracy: 0.9130
Epoch 160/500
 - 0s - loss: 1.0357 - accuracy: 0.9130
Epoch 161/500
 - 0s - loss: 1.0141 - accuracy: 0.9130
Epoch 162/500
 - 0s - loss: 0.9929 - accuracy: 0.9130
Epoch 163/500
 - 0s - loss: 0.9719 - accuracy: 0.9130
Epoch 164/500
 - 0s - loss: 0.9514 - accuracy: 0.9130
Epoch 165/500
 - 0s - loss: 0.9311 - accuracy: 0.9130
Epoch 166/500
 - 0s - loss: 0.9113 - accuracy: 0.9130
Epoch 167/500
 - 0s - loss: 0.8918 - accuracy: 0.9130
Epoch 168/500
 - 0s - loss: 0.8727 - accuracy: 0.9130
Epoch 169/500
 - 0s - loss: 0.8539 - accuracy: 0.9130
Epoch 170/500
 - 0s - loss: 0.8356 - accuracy: 0.9130
Epoch 171/500
 - 0s - loss: 0.8176 - accuracy: 0.9130
Epoch 172/500
 - 0s - loss: 0.7999 - accuracy: 0.9130
Epoch 173/500
 - 0s - loss: 

Epoch 307/500
 - 0s - loss: 0.1058 - accuracy: 0.9565
Epoch 308/500
 - 0s - loss: 0.1053 - accuracy: 0.9565
Epoch 309/500
 - 0s - loss: 0.1047 - accuracy: 0.9565
Epoch 310/500
 - 0s - loss: 0.1042 - accuracy: 0.9565
Epoch 311/500
 - 0s - loss: 0.1037 - accuracy: 0.9565
Epoch 312/500
 - 0s - loss: 0.1032 - accuracy: 0.9565
Epoch 313/500
 - 0s - loss: 0.1027 - accuracy: 0.9565
Epoch 314/500
 - 0s - loss: 0.1022 - accuracy: 0.9565
Epoch 315/500
 - 0s - loss: 0.1017 - accuracy: 0.9565
Epoch 316/500
 - 0s - loss: 0.1012 - accuracy: 0.9565
Epoch 317/500
 - 0s - loss: 0.1007 - accuracy: 0.9565
Epoch 318/500
 - 0s - loss: 0.1003 - accuracy: 0.9565
Epoch 319/500
 - 0s - loss: 0.0998 - accuracy: 0.9565
Epoch 320/500
 - 0s - loss: 0.0994 - accuracy: 0.9565
Epoch 321/500
 - 0s - loss: 0.0990 - accuracy: 0.9565
Epoch 322/500
 - 0s - loss: 0.0985 - accuracy: 0.9565
Epoch 323/500
 - 0s - loss: 0.0981 - accuracy: 0.9565
Epoch 324/500
 - 0s - loss: 0.0977 - accuracy: 0.9565
Epoch 325/500
 - 0s - loss: 

Epoch 459/500
 - 0s - loss: 0.0736 - accuracy: 0.9565
Epoch 460/500
 - 0s - loss: 0.0735 - accuracy: 0.9565
Epoch 461/500
 - 0s - loss: 0.0734 - accuracy: 0.9565
Epoch 462/500
 - 0s - loss: 0.0733 - accuracy: 0.9565
Epoch 463/500
 - 0s - loss: 0.0733 - accuracy: 0.9565
Epoch 464/500
 - 0s - loss: 0.0732 - accuracy: 0.9565
Epoch 465/500
 - 0s - loss: 0.0731 - accuracy: 0.9565
Epoch 466/500
 - 0s - loss: 0.0731 - accuracy: 0.9565
Epoch 467/500
 - 0s - loss: 0.0730 - accuracy: 0.9565
Epoch 468/500
 - 0s - loss: 0.0729 - accuracy: 0.9565
Epoch 469/500
 - 0s - loss: 0.0728 - accuracy: 0.9565
Epoch 470/500
 - 0s - loss: 0.0728 - accuracy: 0.9565
Epoch 471/500
 - 0s - loss: 0.0727 - accuracy: 0.9565
Epoch 472/500
 - 0s - loss: 0.0726 - accuracy: 0.9565
Epoch 473/500
 - 0s - loss: 0.0726 - accuracy: 0.9565
Epoch 474/500
 - 0s - loss: 0.0725 - accuracy: 0.9565
Epoch 475/500
 - 0s - loss: 0.0724 - accuracy: 0.9565
Epoch 476/500
 - 0s - loss: 0.0724 - accuracy: 0.9565
Epoch 477/500
 - 0s - loss: 

<keras.callbacks.callbacks.History at 0x7fe0a06977f0>

In [62]:
# Generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words, max_length):
    in_text = seed_text
    #generate a fixed number of words
    for _ in range(n_words):
        # encode the text as integer
        encoded = tokenizer.texts_to_sequences([in_text])[0]
        # pre-pad sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
        # predict a word in the vocabulary
        yhat = model.predict_classes(encoded, verbose=0)
        # map predicted word index to word
        out_word = ''
        for word, index in tokenizer.word_index.items():
            if index == yhat:
                out_word = word
                break
        # append to input
        in_text += ' '+out_word
    return in_text

In [68]:
#evaluate the model
print(generate_seq(model, tokenizer, 'Jack and', 3, max_length-1))

Jack and jill came tumbling


In [70]:
#evaluate the model
print(generate_seq(model, tokenizer, 'fell down', 5, max_length-1))

fell down and broke his crown and


We can see that the choice of how the language model is framed and the requirements on how the model will be used must be compatible. that careful design is required when using languae models in general, perhaps followed-up by spot testing with sequence generation to confirm model requirements have been met.