<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-5-language-modeling/2_develop_word_based_neural_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Develop a Word-Based Neural Language Model

Language modeling involves predicting the next word in a sequence given the sequence of words already present. A language model is a key element in many natural language processing models such as machine translation and speech recognition. The choice of how the language model is framed must match how the language model is intended to be used.

Nevertheless, in the field of neural language models, word-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

We will cover the followings topics:-

* Developing a good framing of a word-based language model for a given
application.
* Develop one-word, two-word, and line-based framings for word-based language models.
* Generate sequences using a fit language model.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from pickle import dump

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, LSTM, Embedding

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model, to_categorical

%matplotlib inline

## Jack and Jill Nursery Rhyme

Jack and Jill is a simple nursery rhyme. It is comprised of 4 lines, as follows:

Jack and Jill went up the hill
To fetch a pail of water
Jack fell down and broke his crown
And Jill came tumbling after

We will use this as our source text for exploring different framings of a word-based language model.

```python
# source text
data = """ Jack and Jill went up the hill\n
  To fetch a pail of water\n
  Jack fell down and broke his crown\n
  And Jill came tumbling after\n """
```

## Framing Language Modeling

A statistical language model is learned from raw text and predicts the probability of the next word in the sequence given the words already present in the sequence. Language models are a key component in larger models for challenging natural language processing problems, like
machine translation and speech recognition. They can also be developed as standalone models and used for generating new sequences that have the same statistical properties as the source text.

Language models both learn and predict one word at a time. The training of the network involves providing sequences of words as input that are processed one at a time where a prediction can be made and learned for each input sequence. Similarly, when making predictions, the process can be seeded with one or a few words, then predicted words can be gathered and
presented as input on subsequent predictions in order to build up a generated output sequence.

Therefore, each model will involve splitting the source text into input and output sequences, such that the model can learn to predict words. There are many ways to frame the sequences from a source text for language modeling. We will explore 3 different ways of developing word-based language models in the Keras deep learning library. There is no single best approach, just different framings that may suit different applications.

## Model 1: One-Word-In, One-Word-Out Sequences

We can start with a very simple model. Given one word as input, the model will learn to predict the next word in the sequence.

```python
X,      y
Jack,  and
and,   Jill
Jill,  went
...,   ...
```

**Step-1**

The first step is to encode the text as integers. Each lowercase word in the source text is assigned a unique integer and we can convert the sequences of words to sequences of integers.

```python
# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]
```

Keras provides the Tokenizer class that can be used to perform this encoding. First, the Tokenizer is fit on the source text to develop the mapping from words to unique integers. Then sequences of text can be converted to sequences of integers by calling the texts to sequences()
function.

**Step-2**

We will need to know the size of the vocabulary later for both defining the word embedding layer in the model, and for encoding output words using a one hot encoding. The size of the vocabulary can be retrieved from the trained Tokenizer by accessing the word index attribute.

```python
# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print('Vocabulary Size: %d' % vocab_size)
```

The size of the vocabulary is 21 words. We add one, because we will need to specify the integer for the largest encoded word as an array index, e.g.
words encoded 1 to 21 with array indicies 0 to 21 or 22 positions. 

**Step-3**

Next, we need to create sequences of words to fit the model with one word as input and one word as output.

```python
# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
sequence = encoded[i-1:i+1]
sequences.append(sequence)
print('Total Sequences: %d' % len(sequences))
```

**Step-4**

We can then split the sequences into input (X) and output elements (y). This is straightforward as we only have two columns in the data.

```python
# split into X and y elements
sequences = array(sequences)
X, y = sequences[:,0],sequences[:,1]
```

**Step-5**

We will fit our model to predict a probability distribution across all words in the vocabulary. That means that we need to turn the output element from a single integer into a one hot encoding with a 0 for every word in the vocabulary and a 1 for the actual word that the value.

This gives the network a ground truth to aim for from which we can calculate error and update the model. Keras provides the to categorical() function that we can use to convert theinteger to a one hot encoding while specifying the number of classes as the vocabulary size.

```python
# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)
```

**Step-6**

We are now ready to define the neural network model. The model uses a learned word embedding in the input layer. This has one real-valued vector for each word in the vocabulary, where each word vector has a specified length. In this case we will use a 10-dimensional projection. The input sequence contains a single word, therefore the input length=1.

```python
# define the model
def define_model(vocab_size):
  model = Sequential()
  model.add(Embedding(vocab_size, 10, input_length=1))
  model.add(LSTM(50))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile network
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model
```



Tying all of this together, the complete code listing is provided below.

In [2]:
# generate a sequence from the model
def generate_seq(model, tokenizer, seed_text, n_words):
  in_text, result = seed_text, seed_text

  # generate a fixed number of words
  for _ in range(n_words):
    # encode the text as integer
    encoded = tokenizer.texts_to_sequences([in_text])[0]
    encoded = np.array(encoded)

    # predict a word in the vocabulary
    yhat = model.predict_classes(encoded, verbose=0)
    # map predicted word index to word
    out_word = ''
    for word, index in tokenizer.word_index.items():
      if index == yhat:
        out_word = word
        break
    
    in_text, result = out_word, result + ' ' + out_word
  return result

# define the model
def define_model(vocab_size):
  model = Sequential()
  model.add(Embedding(vocab_size, 10, input_length=1))
  model.add(LSTM(50))
  model.add(Dense(vocab_size, activation='softmax'))

  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model

# source text
data = """
  Jack and Jill went up the hill\n
  To fetch a pail of water\n
  Jack fell down and broke his crown\n
  And Jill came tumbling after\n
"""

# integer encode text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])
encoded = tokenizer.texts_to_sequences([data])[0]

# vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'Vocabulary Size: {str(vocab_size)}')

# create word -> word sequences
sequences = list()
for i in range(1, len(encoded)):
  sequence = encoded[i - 1: i + 1]
  sequences.append(sequence)
print(f'Total Sequences: {str(len(sequences))}')

# split into X and y elements
sequences = np.array(sequences)
X, y = sequences[:, 0], sequences[:, 1]

# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)

# define model
model = define_model(vocab_size)

# fit model
model.fit(X, y, epochs=500, verbose=2)

# evaluate the model
print(generate_seq(model, tokenizer, 'Jack', 6))

Vocabulary Size: 22
Total Sequences: 24
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 1, 10)             220       
_________________________________________________________________
lstm (LSTM)                  (None, 50)                12200     
_________________________________________________________________
dense (Dense)                (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
Epoch 1/500
1/1 - 0s - loss: 3.0915 - accuracy: 0.0000e+00
Epoch 2/500
1/1 - 0s - loss: 3.0908 - accuracy: 0.0417
Epoch 3/500
1/1 - 0s - loss: 3.0900 - accuracy: 0.1250
Epoch 4/500
1/1 - 0s - loss: 3.0893 - accuracy: 0.1250
Epoch 5/500
1/1 - 0s - loss: 3.0886 - accuracy: 0.1250
Epoch 6/500
1/1 - 0s - loss: 3.0878 - accuracy: 0.1250
Epoch 7/

In [3]:
# evaluate the model
print(generate_seq(model, tokenizer, 'And', 6))

And jill went up the hill to


In [4]:
print(generate_seq(model, tokenizer, 'To', 6))

To fetch a pail of water jack


This is a good first cut language model, but does not take full advantage of the LSTM's ability to handle sequences of input and disambiguate some of the ambiguous pairwise sequences by using a broader context.

## Model 2: Line-by-Line Sequence

Another approach is to split up the source text line-by-line, then break each line down into a series of words that build up. 

```python
X,                                 y
_, _, _, _, _, Jack,              and
_, _, _, _, Jack, and,            Jill
_, _, _, Jack, and, Jill,         went
_, _, Jack, and, Jill, went,      up
_, Jack, and, Jill, went, up,     the
Jack, and, Jill, went, up, the,   hill
```

This approach may allow the model to use the context of each line to help the model in those cases where a simple one-word-in-and-out model creates ambiguity. In this case, this comes at the cost of predicting words across lines, which might be fine for now if we are only interested in modeling and generating lines of text. Note that in this representation, we will require a padding of sequences to ensure they meet a fixed length input.

**Step-1**

First, we can create the sequences of integers, line-by-line by using the Tokenizer already fit on the source text.

**Step-2**

Next, we can pad the prepared sequences. We can do this using the
pad sequences() function provided in Keras. This first involves finding the longest sequence, then using that as the length by which to pad-out all other sequences.

**Step-3**

Next, we can split the sequences into input and output elements, much like before.

**Step-4**

The model can then be defined as before, except the input sequences are now longer than a single word. Specifically, they are max length-1 in length, -1 because when we calculated the maximum length of sequences, they included the input and output elements.

**Step-5**

We can use the model to generate new sequences as before. The generate seq() function can be updated to build up an input sequence by adding predictions to the list of input words each iteration.


Tying all of this together, the complete code example is provided below.

In [7]:
# generate a sequence from the model
def generate_seq(model, tokenizer, max_length, seed_text, n_words):
  in_text = seed_text

  # generate a fixed number of words
  for _ in range(n_words):
    # encode the text as integer
    encoded = tokenizer.texts_to_sequences([in_text])[0]
    # pre-pad sequences to a fixed length
    encoded = pad_sequences([encoded], maxlen=max_length, padding='pre')
    # predict probabilities for each word
    yhat = model.predict_classes(encoded, verbose=0)
    # map predicted word index to word
    out_word = ''
    for word, index in tokenizer.word_index.items():
      if index == yhat:
        out_word = word
        break
    
    in_text += ' ' + out_word
  return in_text

# define the model
def define_model(vocab_size, max_length):
  model = Sequential()
  model.add(Embedding(vocab_size, 10, input_length=max_length - 1))
  model.add(LSTM(50))
  model.add(Dense(vocab_size, activation='softmax'))

  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model

# source text
data = """
  Jack and Jill went up the hill\n
  To fetch a pail of water\n
  Jack fell down and broke his crown\n
  And Jill came tumbling after\n
"""

# prepare the tokenizer on the source text
tokenizer = Tokenizer()
tokenizer.fit_on_texts([data])

# determine the vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'Vocabulary Size: {str(vocab_size)}')

# create line-based sequences
sequences = list()
for line in data.split('\n'):
  encoded = tokenizer.texts_to_sequences([line])[0]
  for i in range(1, len(encoded)):
    sequence = encoded[: i + 1]
    sequences.append(sequence)
print(f'Total Sequences: {str(len(sequences))}')

# pad input sequences
max_length = max([len(seq) for seq in sequences])
sequences = pad_sequences(sequences, maxlen=max_length, padding='pre')
print(f'Max Sequence Length: {str(max_length)}')

# split into input and output elements
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:, -1]
# one hot encode outputs
y = to_categorical(y, num_classes=vocab_size)

# define model
model = define_model(vocab_size, max_length)

# fit model
model.fit(X, y, epochs=500, verbose=2)

# evaluate the model
print(generate_seq(model, tokenizer, max_length - 1, 'Jack', 4))

Vocabulary Size: 22
Total Sequences: 21
Max Sequence Length: 7
Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 6, 10)             220       
_________________________________________________________________
lstm_2 (LSTM)                (None, 50)                12200     
_________________________________________________________________
dense_2 (Dense)              (None, 22)                1122      
Total params: 13,542
Trainable params: 13,542
Non-trainable params: 0
_________________________________________________________________
Epoch 1/500
1/1 - 0s - loss: 3.0900 - accuracy: 0.0000e+00
Epoch 2/500
1/1 - 0s - loss: 3.0885 - accuracy: 0.0952
Epoch 3/500
1/1 - 0s - loss: 3.0869 - accuracy: 0.0952
Epoch 4/500
1/1 - 0s - loss: 3.0853 - accuracy: 0.0952
Epoch 5/500
1/1 - 0s - loss: 3.0837 - accuracy: 0.0952
Epoch 6/500
1/1 - 0s - loss: 3.0820 - 

In [8]:
print(generate_seq(model, tokenizer, max_length - 1, 'Jill', 4))

Jill jill came tumbling after


In [12]:
print(generate_seq(model, tokenizer, max_length - 1, 'fell', 6))
print(generate_seq(model, tokenizer, max_length - 1, 'And', 4))

fell fell down and broke his crown
And jill came tumbling after


## Model 3: Two-Words-In, One-Word-Out Sequence