Building a character based Natural Language Model (NLM)

- Sequence of characters are fed into the model at a time
- prediction of the character is based on the sequence of character inserted into the model


In this example, we will use a poem, "sing a Song of Sixpence".

### Loading text data

In [2]:
# load doc into memory 
def load_doc(filename): 
    # open the file as read only 
    file = open(filename, 'r') 
    # read all text 
    text = file.read() 
    # close the file 
    file.close() 
    return text

In [3]:
# load text
raw_text = load_doc('./sing_a_song_of_sixpence')
print(raw_text)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie. 

When the pie was opened 
The birds began to sing;
Wasn't that a dainty dish,
To set before the king. 

The king was in his counting house, 
Counting out his money; 
The queen was in the parlour, 
Eating bread and honey. 

The maid was in the garden, 
Hanging out the clothes, 
When down came a blackbird 
And pecked off her nose.


### Data cleaning

In [4]:
# stripping off new line chars
tokens = raw_text.split()
raw_text = ' '.join(tokens)

### Creating a list of sequences

Limiting the size of sequence to 10 characters

Each sequence will have one character remove from the first place and a new character from the data source added into the last place. Plotting few examples of sequences.

In [5]:
length = 10
i=10
print(i)
raw_text[i-length:i+1]

10


'Sing a song'

In [6]:
i=11
print(i)

raw_text[i-length:i+1]

11


'ing a song '

In [7]:
length = 10
sequences = list()
for i in range(length, len(raw_text)):
    # select sequence of tokens
    seq = raw_text[i-length:i+1] 
    sequences.append(seq)
print("Total sequences: ", len(sequences))

Total sequences:  399


### Saving sequences into file

In [8]:
def sav_doc(lines, filename):
    data = '\n'.join(lines)
    file = open(filename, 'w')
    file.write(data)
    file.close

In [9]:
# save sequences to file
out_filename = 'char_sequences_sing_a_song.txt'
sav_doc(sequences, out_filename)

## Training Language Model

In [26]:
# load doc into memory 
def load_doc(filename): 
    # open the file as read only 
    file = open(filename, 'r') 
    # read all text 
    text = file.read() 
    # close the file 
    file.close() 
    return text 

# load 
in_filename = './char_sequences_sing_a_song.txt'
raw_text = load_doc(in_filename) 
lines = raw_text.split('\n')

### Encoding the characters in sequence

Assigning each character with an integer value, so that the sequence of character will become sequence of integers.

In [27]:
chars = sorted(list(set(raw_text)))

mapping = dict((c,i) for i, c in enumerate(chars))

In [28]:
print("Checking the mapping of a random char:",mapping['C'])

Checking the mapping of a random char: 8


In [29]:
# checking vocab size
vocab_size = len(mapping)
print("vocab size: ", vocab_size)

vocab size:  38


In [30]:
# processing each sequence and mapping int values

sequences = list()
for line in lines:
    encoded_seq = [mapping[char] for char in line]
    
    # store
    sequences.append(encoded_seq)
    
sequences[:5]

[[12, 23, 27, 21, 1, 15, 1, 32, 28, 27, 21],
 [23, 27, 21, 1, 15, 1, 32, 28, 27, 21, 1],
 [27, 21, 1, 15, 1, 32, 28, 27, 21, 1, 28],
 [21, 1, 15, 1, 32, 28, 27, 21, 1, 28, 20],
 [1, 15, 1, 32, 28, 27, 21, 1, 28, 20, 1]]

### splitting the Input and Ouput

- Selecting last columns as y and rest of the data as X
- Encoding the data to categorical, using Kears

In [31]:
import numpy as np
sequences = np.array(sequences)
X, y = sequences[:,:-1], sequences[:,-1]

In [32]:
from keras.utils import to_categorical

In [33]:
# hot encoding
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]

In [34]:
X =np.array(sequences)

In [35]:
y = to_categorical(y, num_classes=vocab_size)

## Fitting model

- Creating a single LSTM
- Singel hidden layer, with 75 memory cells
- softmax activation function for output layer

In [51]:
from pickle import dump 
from keras.utils import to_categorical 
from keras.utils.vis_utils import plot_model 
from keras.models import Sequential 
from keras.layers import Dense 
from keras.layers import LSTM

In [43]:
def define_model(X):
    model = Sequential()
    model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
    model.add(Dense(vocab_size, activation='softmax'))
    
    # compile model
    model.compile(loss='categorical_crossentropy'
                  , optimizer='adam'
                  , metrics=['accuracy']
                 )
    
    # summarizing
    model.summary()
#     plot_model(model, to_file='model.png', show_shapes=True)
    
    return model

In [44]:
# model
model = define_model(X)

Model: "sequential_3"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_3 (LSTM)                (None, 75)                34200     
_________________________________________________________________
dense_3 (Dense)              (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________


In [45]:
# fit model
model.fit(X, y, epochs=100, verbose=2)

Epoch 1/100
 - 2s - loss: 3.6113 - accuracy: 0.0827
Epoch 2/100
 - 0s - loss: 3.4816 - accuracy: 0.1905
Epoch 3/100
 - 0s - loss: 3.1917 - accuracy: 0.1905
Epoch 4/100
 - 0s - loss: 3.0616 - accuracy: 0.1905
Epoch 5/100
 - 0s - loss: 3.0190 - accuracy: 0.1905
Epoch 6/100
 - 0s - loss: 2.9964 - accuracy: 0.1905
Epoch 7/100
 - 0s - loss: 2.9737 - accuracy: 0.1905
Epoch 8/100
 - 0s - loss: 2.9557 - accuracy: 0.1905
Epoch 9/100
 - 0s - loss: 2.9458 - accuracy: 0.1905
Epoch 10/100
 - 0s - loss: 2.9247 - accuracy: 0.1905
Epoch 11/100
 - 0s - loss: 2.9097 - accuracy: 0.1905
Epoch 12/100
 - 0s - loss: 2.8861 - accuracy: 0.1905
Epoch 13/100
 - 0s - loss: 2.8634 - accuracy: 0.1905
Epoch 14/100
 - 0s - loss: 2.8408 - accuracy: 0.2080
Epoch 15/100
 - 0s - loss: 2.8031 - accuracy: 0.1930
Epoch 16/100
 - 0s - loss: 2.7687 - accuracy: 0.2080
Epoch 17/100
 - 0s - loss: 2.7307 - accuracy: 0.2481
Epoch 18/100
 - 0s - loss: 2.6970 - accuracy: 0.2030
Epoch 19/100
 - 0s - loss: 2.6345 - accuracy: 0.2431
Ep

<keras.callbacks.callbacks.History at 0x257a8831d68>

In [47]:
# saving the model
model.save('model.h5')

# saving the mapping
dump(mapping, open('mapping.pkl','wb'))

## Generate Text

### Generate characters

In [108]:
from keras.preprocessing.sequence import pad_sequences

# generate a sequence of characters with a language model 
def generate_seq(model, mapping, seq_length, seed_text, n_chars): 
    in_text = seed_text 
    # generate a fixed number of characters 
    for _ in range(n_chars):
        # encode the characters as integers 
        encoded = [mapping[char] for char in in_text] 
        
        # truncate sequences to a fixed length 
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        
        # one hot encode 
        encoded = to_categorical(encoded, num_classes=len(mapping)) 
#         encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1]) 
        
        # predict character 
        yhat = model.predict_classes(encoded, verbose=0) 
        
        # reverse map integer to character 
        out_char = ''
        for char, index in mapping.items(): 
            if index == yhat: 
                out_char = char 
                break 
        # append to input 
        in_text += out_char 
    return in_text

In [67]:
from keras.models import load_model
# load the model
model = load_model('model.h5')

In [68]:
from pickle import load
# load mapping
mapping = load(open('mapping.pkl', 'rb'))

In [115]:
# test start of rhyme 
print(generate_seq(model, mapping, 10, 'Sing a son', 20)) 
# test mid-line 
print(generate_seq(model, mapping, 10, 'king was i', 15)) 
# test not in original 
print(generate_seq(model, mapping, 10, 'sheep was in', 20))

Sing a song of sixpence, A poc
king was in his counting 
sheep was ing os cint oos thello


### Summary
- data preperation for the LSTM model, sequence were created of a fixed length
- inserting encoded categorical values in place of character/integer, although, chars were converted into int
- character model can predict sequence of chars based on the given text and length of chars need to be predicted