<a href="https://colab.research.google.com/github/rahiakela/deep-learning-for-nlp-by-jason-brownlee/blob/part-5-language-modeling/1_develop_character_based_neural_language_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Develop a Character-Based Neural Language Model

A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence. It is also possible to develop language models at the character level using neural networks. The benefit of character-based language models is their small
vocabulary and  exibility in handling any words, punctuation, and other document structure. This comes at the cost of requiring larger models that are slower to train. 

Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling.

We will cover the followings topics:-

* Prepare text for character-based language modeling.
* Develop a character-based language model using LSTMs.
* Use a trained character-based language model to generate text.

## Setup

In [0]:
from __future__ import absolute_import, division, print_function, unicode_literals

try:
  # %tensorflow_version only exists in Colab.
  %tensorflow_version 2.x
except Exception:
  pass
import tensorflow as tf
from tensorflow import keras
import numpy as np
import pandas as pd
from pickle import dump

import matplotlib.pyplot as plt

from tensorflow.keras import backend as keras_backend
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, LSTM

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model, to_categorical

%matplotlib inline

## Sing a Song of Sixpence Dataset

The nursery rhyme Sing a Song of Sixpence is well known in the west. The first verse is common, but there is also a 4 verse version that we will use to develop our character-based language model. It is short, so fitting the model will be fast, but not so short that we won't see anything
interesting.

## Data Preparation

The first step is to prepare the text data. We will start by defining the type of language model.:

1. Language Model Design.
2. Load Text.
3. Clean Text.
4. Create Sequences
5. Save Sequences




### Language Model Design

A language model must be trained on the text, and in the case of a character-based language model, the input and output sequences must be characters. The number of characters used as input will also define the number of characters that will need to be provided to the model in order to elicit the first predicted character. After the first character has been generated, it can be appended to the input sequence and used as input for the model to generate the next character.

Longer sequences offer more context for the model to learn what character to output next but take longer to train and impose more burden on seeding the model when generating text.

We will use an arbitrary length of 10 characters for this model. There is not a lot of text, and 10 characters is a few words. We can now transform the raw text into a form that our model can learn; specifically, input and output sequences of characters.

Tying all of this together, the complete code listing is provided below.

In [2]:
# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r')
  # read all text
  text = file.read()
  # close the file
  file.close()

  return text

# save tokens to file, one dialog per line
def save_doc(lines, filename):
  data = '\n'.join(lines)
  file = open(filename, 'w')
  file.write(data)
  file.close()

# load text
raw_text = load_doc('rhyme.txt')
print(raw_text)

# clean text
tokens = raw_text.split()
raw_text = ' '.join(tokens)

# organize into sequences of characters
length = 10
sequences = list()
for i in range(length, len(raw_text)):
  seq = raw_text[i - length: i + 1]   # select sequence of tokens
  sequences.append(seq)
print(f'\nTotal Sequences: {str(len(sequences))}')

# save sequences to file
print(sequences[:10])
out_filename = 'char_sequences.txt'
save_doc(sequences, out_filename)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.

When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.

The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.

The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.

Total Sequences: 399
['Sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ', 'a song of s', ' song of si', 'song of six', 'ong of sixp', 'ng of sixpe']


## Train Language Model

We will develop a neural language model for the prepared sequence data. The
model will read encoded characters and predict the next character in the sequence. 

A Long Short-Term Memory recurrent neural network hidden layer will be used to learn the context from the input sequence in order to make the predictions.

We will do the following steps:-
1. Load Data 
2. Encode Sequences
3. Split Inputs and Output
4. Fit Model
5. Save Model

### Load Data

The first step is to load the prepared character sequence data from char sequences.txt.

```python
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')
```

### Encode Sequences

The sequences of characters must be encoded as integers. This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers. We can create the mapping given a sorted set of unique characters in the raw input data. The mapping is a dictionary of character values to integer values.

```python
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
```

Next, we can process each sequence of characters one at a time and use the dictionary mapping to look up the integer value for each character.

```python
sequences = list()
for line in lines:
  # integer encode line
  encoded_seq = [mapping[char] for char in line]
  sequences.append(encoded_seq)
```

The result is a list of integer lists. We need to know the size of the vocabulary later. We can retrieve this as the size of the dictionary mapping.

```python
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
```


### Split Inputs and Output

Now that the sequences have been integer encoded, we can separate the columns into input and output sequences of characters. We can do this using a simple array slice.

```python
sequences = array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]
```

Next, we need to one hot encode each character. That is, each character becomes a vector as long as the vocabulary (38 elements) with a 1 marked for the specific character. This provides a more precise input representation for the network. It also provides a clear objective for the
network to predict, where a probability distribution over characters can be output by the model and compared to the ideal case of all 0 values with a 1 for the actual next character. 

We can use the to categorical() function in the Keras API to one hot encode the input and output sequences.

```python
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)
```

### Fit Model

The model is defined with an input layer that takes sequences that have 10 time steps and 38 features for the one hot encoded input sequences. Rather than specify these numbers, we use the second and third dimensions on the X input data. This is so that if we change the length of the sequences or size of the vocabulary, we do not need to change the model definition. 

The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error. 

The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. 

A softmax activation function is used on the output layer to ensure the output has the properties of a probability distribution.

```python
# define the model
def define_model(X):
  model = Sequential()
  model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model
```

The model is learning a multiclass classification problem, therefore we use the categorical log loss intended for this type of problem. The efficient Adam implementation of gradient descent is used to optimize the model and accuracy is reported at the end of each batch update. The model is fit for 100 training epochs, again found with a little trial and error.

### Save Model

After the model is fit, we save it to file for later use. The Keras model API provides the save() function that we can use to save the model to a single file, including weights and topology information.

```python
# save the model to file
model.save('model.h5')
```

We also save the mapping from characters to integers that we will need to encode any input when using the model and decode any output from the model.

```python
# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))
```

### Complete Example

Tying all of this together, the complete code listing for fitting the character-based neural language model is listed below.

In [5]:
from os import listdir
from pickle import dump


# define the model
def define_model(X):
  model = Sequential()
  model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
  model.add(Dense(vocab_size, activation='softmax'))
  # compile model
  model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
  # summarize defined model
  model.summary()
  plot_model(model, to_file='model.png', show_shapes=True)
  return model

# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
  # integer encode line
  encoded_seq = [mapping[char] for char in line]
  sequences.append(encoded_seq)

# vocabulary size
vocab_size = len(mapping)
print(f'Vocabulary Size: {str(vocab_size)}')

# separate into input and output
sequences = np.array(sequences)
X, y = sequences[:, :-1], sequences[:, -1]
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = np.array(sequences)
y = to_categorical(y, num_classes=vocab_size)

# define model
model = define_model(X)

# fit model
model.fit(X, y, epochs=100, verbose=2)

# save the model to file
model.save('model.h5')

# save the mapping
dump(mapping, open('mapping.pkl', 'wb'))

Vocabulary Size: 38
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 75)                34200     
_________________________________________________________________
dense_1 (Dense)              (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
13/13 - 0s - loss: 3.6141 - accuracy: 0.0827
Epoch 2/100
13/13 - 0s - loss: 3.5121 - accuracy: 0.1930
Epoch 3/100
13/13 - 0s - loss: 3.1909 - accuracy: 0.1905
Epoch 4/100
13/13 - 0s - loss: 3.0726 - accuracy: 0.1905
Epoch 5/100
13/13 - 0s - loss: 3.0231 - accuracy: 0.1905
Epoch 6/100
13/13 - 0s - loss: 2.9948 - accuracy: 0.1905
Epoch 7/100
13/13 - 0s - loss: 2.9785 - accuracy: 0.1905
Epoch 8/100
13/13 - 0s - loss: 2.9526 - accuracy: 0.1905
Epoch 9/100
13/13 - 0s - loss: 2.947

We are now ready to develop our model.

## Generate Text

In this section, we will develop a multichannel convolutional neural network for the sentiment analysis prediction problem. This section is divided into 3 parts:

1.  Encode Data
2.  Define Model.
3.  Complete Example.

### Encode Data

The first step is to load the cleaned training dataset. The function below-named load dataset() can be called to load the pickled training dataset.


```python
# load a clean dataset
def load_dataset(filename):
  return load(open(filename, 'rb'))

trainLines, trainLabels = load_dataset('train.pkl')
```

Next, we must fit a Keras Tokenizer on the training dataset. We will use this tokenizer to both define the vocabulary for the Embedding layer and encode the review documents as integers.

```python
# fit a tokenizer
def create_tokenizer(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer
```

We also need to know the maximum length of input sequences as input for the model and to pad all sequences to the fixed length.

The function max length() below will calculate the maximum length (number of words) for all reviews in the training dataset.

```python
# calculate the maximum document length
def max_length(lines):
  return max([len(s.split()) for s in lines])
```

We also need to know the size of the vocabulary for the Embedding layer. This can be calculated from the prepared Tokenizer.

```python
# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
```

Finally, we can integer encode and pad the clean movie review text. The function below named encode text() will both encode and pad text data to the maximum review length.

```python
# encode a list of lines
def encode_text(tokenizer, lines, length):
  # integer encode
  encoded = tokenizer.texts_to_sequences(lines)
  # pad encoded sequences
  padded = pad_sequences(encoded, maxlen=length, padding='post')
  return padded
```








### Define Model

A standard model for document classification is to use an Embedding layer as input, followed by a one-dimensional convolutional neural network, pooling layer, and then a prediction output layer. The kernel size in the convolutional layer defines the number of words to consider as the convolution is passed across the input text document, providing a grouping parameter. 

A multi-channel convolutional neural network for document classification involves using multiple versions of the standard model with different sized kernels. This allows the document to be processed at different resolutions or different n-grams (groups of words) at a time, whilst the model learns how to best integrate these interpretations.

In Keras, a multiple-input model can be defined using the functional API. We will define a model with three input channels for processing 4-grams, 6-grams, and 8-grams of movie review text. Each channel is comprised of the following elements:

* **Input layer** that defines the length of input sequences.
* **Embedding layer** set to the size of the vocabulary and 100-dimensional real-valued representations.
* **Conv1D layer** with 32 filters and a kernel size set to the number of words to read at once.
* **MaxPooling1D layer** to consolidate the output from the convolutional layer.
* **Flatten layer** to reduce the three-dimensional output to two dimensional for concatenation.

The output from the three channels are concatenated into a single vector and process by a Dense layer and an output layer.

```python
# define the model
def define_model(length, vocab_size):

  # channel 1
  inputs1 = Input(shape=(length,))
  embedding1 = Embedding(vocab_size, 100)(inputs1)
  conv1 = Conv1D(filters=32, kernel_size=4, activation='relu')(embedding1)
  drop1 = Dropout(0.5)(conv1)
  pool1 = MaxPooling1D(pool_size=2)(drop1)
  flat1 = Flatten()(pool1)

  # channel 2
  inputs2 = Input(shape=(length,))
  embedding2 = Embedding(vocab_size, 100)(inputs2)
  conv2 = Conv1D(filters=32, kernel_size=6, activation='relu')(embedding2)
  drop2 = Dropout(0.5)(conv2)
  pool2 = MaxPooling1D(pool_size=2)(drop2)
  flat2 = Flatten()(pool2)

  # channel 3
  inputs3 = Input(shape=(length,))
  embedding3 = Embedding(vocab_size, 100)(inputs3)
  conv3 = Conv1D(filters=32, kernel_size=8, activation='relu')(embedding3)
  drop3 = Dropout(0.5)(conv3)
  pool3 = MaxPooling1D(pool_size=2)(drop3)
  flat3 = Flatten()(pool3)

  # merge
  merged = concatenate([flat1, flat2, flat3])

  # interpretation
  dense1 = Dense(10, activation='relu')(merged)
  outputs = Dense(1, activation='sigmoid')(dense1)
  model = Model(inputs=[inputs1, inputs2, inputs3], outputs=outputs)

  # compile
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  
  # summarize
  model.summary()
  plot_model(model, show_shapes=True, to_file='multichannel.png')
  return model
```



We can put all of this together in a single example.

In [0]:
from keras.preprocessing.text import Tokenizer
from pickle import load


# load dataset
def load_dataset(filename):
  # load dataset
  return load(open(filename, 'rb'))

# fit a tokenizer
def create_tokenizer(lines):
  tokenizer = Tokenizer()
  tokenizer.fit_on_texts(lines)
  return tokenizer

# calculate the maximum document length
def max_length(lines):
  return max([len(line.split()) for line in lines])

# encode a list of lines
def encode_text(tokenizer, lines, max_length):
  # integer encode
  encoded = tokenizer.texts_to_sequences(lines)
  # pad sequences
  padded = pad_sequences(encoded, maxlen=max_length, padding='post')

  return padded

# define the model
def define_model(max_length, vocab_size):

  print('Creating channel.......')
  # channel 1
  inputs1 = Input(shape=(max_length, ))
  embedding1 = Embedding(vocab_size, 100)(inputs1)
  conv1 = Conv1D(filters=32, kernel_size=4, activation='relu')(embedding1)
  drop1 = Dropout(0.5)(conv1)
  pool1 = MaxPool1D(pool_size=2)(drop1)
  flat1 = Flatten()(pool1)

  print('Creating channe2.......')
  # channel 2
  inputs2 = Input(shape=(max_length, ))
  embedding2 = Embedding(vocab_size, 100)(inputs2)
  conv2 = Conv1D(filters=32, kernel_size=6, activation='relu')(embedding2)
  drop2 = Dropout(0.5)(conv2)
  pool2 = MaxPool1D(pool_size=2)(drop2)
  flat2 = Flatten()(pool2)

  print('Creating channe3.......')
  # channel 3
  inputs3 = Input(shape=(max_length, ))
  embedding3 = Embedding(vocab_size, 100)(inputs3)
  conv3 = Conv1D(filters=32, kernel_size=8, activation='relu')(embedding3)
  drop3 = Dropout(0.5)(conv3)
  pool3 = MaxPool1D(pool_size=2)(drop3)
  flat3 = Flatten()(pool3)

  print('Creating all channes.......')
  # merge all channel
  merged_layer = Concatenate([flat1, flat2, flat3])

  # interpretation
  dense_layer = Dense(10, activation='relu')(merged_layer)
  output_layer = Dense(1, activation='sigmoid')(dense_layer)

  print('Creating model.......')
  # create model
  model = Model(inputs=[inputs1, inputs2, inputs3], outputs=output_layer)

  # compile model
  model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

  # summarize defined model
  model.summary()

  # plot model architecture
  plot_model(model, to_file='model.png', show_shapes=True)

  return model

print('Loadin dataset.......')
# load training dataset
trainLines, trainLabels = load_dataset('train.pkl')
# convert to array
trainLines = np.array(trainLines)
trainLabels = np.array(trainLabels)

# create the tokenizer
tokenizer = create_tokenizer(trainLines)

# calculate the maximum document length
max_length = max_length(trainLines)
print(f'Maximum document length: {str(max_length)}')

# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'Vocabulary size: {str(vocab_size)}')

# encode data
trainX = encode_text(tokenizer, trainLines, max_length)

print('Creating model.......')
# define model
model = define_model(max_length, vocab_size)

print('Traing model.......')
# fit model
model.fit([trainX, trainX, trainX], trainLabels, epochs=7, batch_size=16, verbose=1)

# save the model
model.save('model.h5')

## Evaluate Model

We can evaluate the fit model by predicting the sentiment on all reviews in the unseen test dataset. Using the data loading functions developed in the previous section, we can load and encode both the training and test datasets.

In [0]:
# load datasets
trainLines, trainLabels = load_dataset('train.pkl')
testLines, testLabels = load_dataset('test.pkl')

# create tokenizer
tokenizer = create_tokenizer(trainLines)

# calculate max document length
length = max_length(trainLines)
print(f'Max document length: {length}')

# calculate vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'Vocabulary size: {vocab_size}')

# encode data
trainX = encode_text(tokenizer, trainLabels, length)
testX = encode_text(tokenizer, testLines, length)

# load the model
model = load_model('model.h5')

# evaluate model on training dataset
_, acc = model.evaluate([trainX, trainX, trainX], trainLabels, verbose=0)
print(f'Train Accuracy: {acc * 100}')

# evaluate model on test dataset dataset
_, acc = model.evaluate([testX, testX, testX], testLabels, verbose=0)
print(f'Test Accuracy: {acc * 100}')

We can see that, as expected, the skill on the training dataset is excellent, here at 100% accuracy. We can also see that the skill of the model on the unseen test dataset is also very impressive, achieving 88.5%, which is above the skill of the model reported in the 2014 paper.