<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-5-language-modeling/1_develop_character_based_neural_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Develop a Character-Based Neural Language Model

A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence. It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small
vocabulary and  exibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train. 

Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

We will cover the followings topics:-

* Prepare text for character-based language modeling.
* Develop a character-based language model using LSTMs.
* Use a trained character-based language model to generate text.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from pickle import dump

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, LSTM

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model, to_categorical

%matplotlib inline

## Sing a Song of Sixpence Dataset

The nursery rhyme Sing a Song of Sixpence is well known in the west. The first verse is common, but there is also a 4 verse version that we will use to develop our character-based language model. It is short, so fitting the model will be fast, but not so short that we won't see anything
interesting.

## Data Preparation

The first step is to prepare the text data. We will start by defining the type of language model.:

1. Language Model Design.
2. Load Text.
3. Clean Text.
4. Create Sequences
5. Save Sequences




### Language Model Design

A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters. The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character. After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text.

We will use an arbitrary length of 10 characters for this model. There is not a lot of text, and 10 characters is a few words. We can now transform the raw text into a form that our model can learn; specifically, input and output sequences of characters.

Tying all of this together, the complete code listing is provided below.

In [27]:
# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file
  file.close()

  return text

# save tokens to file, one dialog per line
def save_doc(lines, filename):
  data = '\n'.join(lines)
  file = open(filename, 'w')
  file.write(data)
  file.close()

# load text
raw_text = load_doc('rhyme.txt')
print(raw_text)

# clean text
tokens = raw_text.split()
raw_text = ' '.join(tokens)

# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
  seq = raw_text[i - length: i + 1]   # select sequence of tokens
  sequences.append(seq)
print(f'\nTotal Sequences: {str(len(sequences))}')

# save sequences to file
print(sequences[:10])
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.

When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.

The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.

The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.

Total Sequences: 399
['Sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ', 'a song of s', ' song of si', 'song of six', 'ong of sixp', 'ng of sixpe']


## Train Language Model

We will develop a neural language model for the prepared sequence data. The
model will read encoded characters and predict the next character in the sequence. 

A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

We will do the following steps:-
1. Load Data 
2. Encode Sequences
3. Split Inputs and Output
4. Fit Model
5. Save Model

### Load Data

The first step is to load the prepared character sequence data from char sequences.txt.

```python
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
```

### Encode Sequences

The sequences of characters must be encoded as integers. This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers. We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.

```python
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
```

Next, we can process each sequence of characters one at a time and use the dictionary mapping to look up the integer value for each character.

```python
sequences = list()
for line in lines:
  # integer encode line
  encoded_seq = [mapping[char] for char in line]
  sequences.append(encoded_seq)
```

The result is a list of integer lists. We need to know the size of the vocabulary later. We can retrieve this as the size of the dictionary mapping.

```python
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
```


### Split Inputs and Output

Now that the sequences have been integer encoded, we can separate the columns into input and output sequences of characters. We can do this using a simple array slice.

```python
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
```

Next, we need to one hot encode each character. That is, each character becomes a vector as long as the vocabulary (38 elements) with a 1 marked for the specific character. This provides a more precise input representation for the network. It also provides a clear objective for the
network to predict, where a probability distribution over characters can be output by the model and compared to the ideal case of all 0 values with a 1 for the actual next character. 

We can use the to categorical() function in the Keras API to one hot encode the input and output sequences.

```python
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)
```

### Fit Model

The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences. Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition. 

The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error. 

The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. 

A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

```python
# define the model
def define_model(X):
  model = Sequential()
  model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model
```

The model is learning a multiclass classification problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update. The model is fit for 100 training epochs, again found with a little trial and error.

### Save Model

After the model is fit, we save it to file for later use. The Keras model API provides the save() function that we can use to save the model to a single file, including weights and topology information.

```python
# save the model to file
model.save('model.h5')
```

We also save the mapping from characters to integers that we will need to encode any input when using the model and decode any output from the model.

```python
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))
```

### Complete Example

Tying all of this together, the complete code listing for fitting the character-based neural language model is listed below.

In [28]:
from os import listdir
from pickle import dump


# define the model
def define_model(X):
  model = Sequential()
  model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
  # integer encode line
  encoded_seq = [mapping[char] for char in line]
  sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print(f'Vocabulary Size: {str(vocab_size)}')

# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:, -1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = np.array(sequences)
y = to_categorical(y, num_classes=vocab_size)

# define model
model = define_model(X)

# fit model
model.fit(X, y, epochs=100, verbose=2)

# save the model to file
model.save('model.h5')

# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

Vocabulary Size: 38
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 75)                34200     
_________________________________________________________________
dense_1 (Dense)              (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
13/13 - 0s - loss: 3.6000 - accuracy: 0.0902
Epoch 2/100
13/13 - 0s - loss: 3.4634 - accuracy: 0.1679
Epoch 3/100
13/13 - 0s - loss: 3.1387 - accuracy: 0.1905
Epoch 4/100
13/13 - 0s - loss: 3.0259 - accuracy: 0.1905
Epoch 5/100
13/13 - 0s - loss: 3.0062 - accuracy: 0.1905
Epoch 6/100
13/13 - 0s - loss: 2.9777 - accuracy: 0.1905
Epoch 7/100
13/13 - 0s - loss: 2.9739 - accuracy: 0.1905
Epoch 8/100
13/13 - 0s - loss: 2.9422 - accuracy: 0.1905
Epoch 9/100
13/13 - 0s - loss: 2.947

We are now ready to develop our model.

## Generate Text

We will use the learned language model to generate new sequences of text that have the same statistical properties.. This section is divided into 3 parts:

1.  Load Model
2.  Generate Characters.

### Load Model

The first step is to load the model saved to the file model.h5.


```python
# load the model
model = load_model('model.h5')
```

We also need to load the pickled dictionary for mapping characters to integers from the file mapping.pkl.

```python
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))
```








### Generate Characters

We must provide sequences of 10 characters as input to the model in order to start the generation process. We will pick these manually. A given input sequence will need to be prepared in the same way as preparing the training data for the model. 

First, the sequence of characters must be integer encoded using the loaded mapping.

```python
# encode the characters as integers
encoded = [mapping[char] for char in in_text]
```

Next, the integers need to be one hot encoded using the to categorical() Keras function. We also need to reshape the sequence to be 3-dimensional, as we only have one sequence and LSTMs require all input to be three dimensional (samples, time steps, features).

```python
# one hot encode
encoded = to_categorical(encoded, num_classes=len(mapping))
encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1])
```

We can then use the model to predict the next character in the sequence. We use predict classes() instead of predict() to directly select the integer for the character with the highest probability instead of getting the full probability distribution across the entire set of characters.

```python
# predict character
yhat = model.predict_classes(encoded, verbose=0)
```

We can then decode this integer by looking up the mapping to see the character to which it maps.

```python
out_char = ''
for char, index in mapping.items():
  if index == yhat:
    out_char = char
    break
```

This character can then be added to the input sequence. We then need to make sure that the input sequence is 10 characters by truncating the first character from the input sequence text.

We can use the pad sequences() function from the Keras API that can perform this truncation operation.


### Complete Example

We can put all of this together in a single example.

In [29]:
from pickle import load

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
  in_text = seed_text
  # generate a fixed number of characters
  for _ in range(n_chars):
    # encode the characters as integers
    encoded = [mapping[char] for char in in_text]
    # truncate sequences to a fixed length
    encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
    # one hot encode
    encoded = to_categorical(encoded, num_classes=len(mapping))
    encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1])
    # predict character
    yhat = model.predict_classes(encoded, verbose=0)
    # reverse map integer to character
    out_char = ''
    for char, index in mapping.items():
      if index == yhat:
        out_char = char
        break
    # append to input
    in_text += out_char
  return in_text

# load the model
model = load_model('model.h5')

# load the mapping
mapping = load(open('mapping.pkl', 'rb'))



In [30]:
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))

ValueError: ignored

In [31]:
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))

ValueError: ignored

In [32]:
# test not in original
print(generate_seq(model, mapping, 10, 'hello worl', 20))

ValueError: ignored