## Character Based Neural Language model

The nursery rhyme Sing a Song of Sixpence is well known in the west. The first verse is common,
but there is also a 4 verse version that we will use to develop our character-based language
model. It is short, so fitting the model will be fast, but not so short that we won’t see anything
interesting. The complete 4 verse version we will use as source text is listed below

### Language Model Design 

In [None]:
# We must train our language model on a text and in the case of a character based language model, the input and output sequence must be characters

#The number of characters used as input will also define the number of characters that will need to be provided to the model
#in order to elicit the first predicted character. After the first character has been generated, it
#can be appended to the input sequence and used as input for the model to generate the next character.



# Longer sequences offer more context but take longer to train. 
# We will use an arbitary length of 10 characters for this model.



### Load Text 

In [1]:
import os

In [2]:
def load_doc(filename):
    file = open(filename,'r')
    text = file.read()
    file.close()
    return text

In [4]:
raw_text = load_doc('ryhme.txt')
print(raw_text)

Sing a song of sixpence,
A pocket full of rye.
Four and twenty blackbirds,
Baked in a pie.
When the pie was opened
The birds began to sing;
Wasn't that a dainty dish,
To set before the king.
The king was in his counting house,
Counting out his money;
The queen was in the parlour,
Eating bread and honey.
The maid was in the garden,
Hanging out the clothes,
When down came a blackbird
And pecked off her nose.



### Clean Data

we will strip all of the new line characters so we will have one long sequence of characters separater only by white space

In [15]:
import string
def clean_data(text):
    tokens = text.split()
    tokens = [t for t in tokens if t not in string.punctuation]
    tokens = [t for t in tokens if t.isalpha()]
    tokens = [t.lower() for t in tokens]
    tokens = ' '.join(tokens)
    return tokens

In [18]:
raw_text = clean_data(raw_text)

In [19]:
print(raw_text)

sing a song of a pocket full of four and twenty baked in a when the pie was opened the birds began to that a dainty to set before the the king was in his counting counting out his the queen was in the eating bread and the maid was in the hanging out the when down came a blackbird and pecked off her


### Create Sequences

In [20]:
# each input sequence will be 10 characters long with one output character, making each seq 11 characters long
length = 10
sequences = list()
def create_seq(raw_text):
    for i in range(length,len(raw_text)):
        sequences.append(raw_text[i-length:i+1])
    print('Total Sequences',len(sequences))

In [21]:
create_seq(raw_text)

Total Sequences 289


In [22]:
print(sequences)

['sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ', 'a song of a', ' song of a ', 'song of a p', 'ong of a po', 'ng of a poc', 'g of a pock', ' of a pocke', 'of a pocket', 'f a pocket ', ' a pocket f', 'a pocket fu', ' pocket ful', 'pocket full', 'ocket full ', 'cket full o', 'ket full of', 'et full of ', 't full of f', ' full of fo', 'full of fou', 'ull of four', 'll of four ', 'l of four a', ' of four an', 'of four and', 'f four and ', ' four and t', 'four and tw', 'our and twe', 'ur and twen', 'r and twent', ' and twenty', 'and twenty ', 'nd twenty b', 'd twenty ba', ' twenty bak', 'twenty bake', 'wenty baked', 'enty baked ', 'nty baked i', 'ty baked in', 'y baked in ', ' baked in a', 'baked in a ', 'aked in a w', 'ked in a wh', 'ed in a whe', 'd in a when', ' in a when ', 'in a when t', 'n a when th', ' a when the', 'a when the ', ' when the p', 'when the pi', 'hen the pie', 'en the pie ', 'n the pie w', ' the pie wa', 'the pie was', 'he pie was ', 'e pie wa

In [23]:
def save_doc(lines,filename):
    text = '\n'.join(lines)
    file = open(filename,'w')
    file.write(text)
    file.close()

In [25]:
output_file = 'char_seq.txt'
save_doc(sequences,output_file)

Data Preparation is done.

## Train language model 

In this section, we will develop a neural language model for the prepared sequence data. The
model will read encoded characters and predict the next character in the sequence. A Long
Short-Term Memory recurrent neural network hidden layer will be used to learn the context
from the input sequence in order to make the predictions.


In [26]:
in_filename = 'char_seq.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split('\n')

In [27]:
print(lines)

['sing a song', 'ing a song ', 'ng a song o', 'g a song of', ' a song of ', 'a song of a', ' song of a ', 'song of a p', 'ong of a po', 'ng of a poc', 'g of a pock', ' of a pocke', 'of a pocket', 'f a pocket ', ' a pocket f', 'a pocket fu', ' pocket ful', 'pocket full', 'ocket full ', 'cket full o', 'ket full of', 'et full of ', 't full of f', ' full of fo', 'full of fou', 'ull of four', 'll of four ', 'l of four a', ' of four an', 'of four and', 'f four and ', ' four and t', 'four and tw', 'our and twe', 'ur and twen', 'r and twent', ' and twenty', 'and twenty ', 'nd twenty b', 'd twenty ba', ' twenty bak', 'twenty bake', 'wenty baked', 'enty baked ', 'nty baked i', 'ty baked in', 'y baked in ', ' baked in a', 'baked in a ', 'aked in a w', 'ked in a wh', 'ed in a whe', 'd in a when', ' in a when ', 'in a when t', 'n a when th', ' a when the', 'a when the ', ' when the p', 'when the pi', 'hen the pie', 'en the pie ', 'n the pie w', ' the pie wa', 'the pie was', 'he pie was ', 'e pie wa

Encode Sequences
The sequences of characters must be encoded as integers. This means that each unique character will be assigned a specific integer value and each sequence of characters will be encoded as a sequence of integers

In [34]:
#we can create the mapping given a sorted set of unique characters

chars = sorted(list(set(raw_text)))
mapping = dict((c,i) for i,c in enumerate(chars))

In [35]:
print(mapping)

{'\n': 0, ' ': 1, 'a': 2, 'b': 3, 'c': 4, 'd': 5, 'e': 6, 'f': 7, 'g': 8, 'h': 9, 'i': 10, 'k': 11, 'l': 12, 'm': 13, 'n': 14, 'o': 15, 'p': 16, 'q': 17, 'r': 18, 's': 19, 't': 20, 'u': 21, 'w': 22, 'y': 23}


In [36]:
print(chars)

['\n', ' ', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'w', 'y']


In [37]:
# Encoding each character according to our above mapping
encoded_sequences = list()

for line in lines:
    encode_seq = [mapping[char] for char in line]
    
    encoded_sequences.append(encode_seq)


In [39]:
print(encoded_sequences[0])

[19, 10, 14, 8, 1, 2, 1, 19, 15, 14, 8]


In [41]:
vocab_size = len(mapping)
print(vocab_size)

24


In [55]:
# Splitting into inputs and outputs 
import numpy as np
encoded_sequences = np.array(encoded_sequences)
X,y = encoded_sequences[:,:-1], encoded_sequences[:,-1] 

In [56]:
print(X[0])
print(y[0])
print(X[1])

[19 10 14  8  1  2  1 19 15 14]
8
[10 14  8  1  2  1 19 15 14  8]


In [57]:
print(X.shape)
print(y.shape)

(289, 10)
(289,)


In [58]:
# One hot encoding each character . so each character becomes a vector as long as 
# the vocabulary (24 elements) with 1 marked for specific character
# we use to_categorical() fn to one hot encode the input and output sequences

from tensorflow.keras.utils import to_categorical

onehot_encoded_seq = [to_categorical(x,num_classes=vocab_size) for x in X]
X = np.array(onehot_encoded_seq)

y = to_categorical(y,num_classes=vocab_size)

In [59]:
print(X.shape)
print(y.shape)

(289, 10, 24)
(289, 24)


In [66]:
from pickle import load
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Input
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dropout
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import MaxPooling1D
from tensorflow.keras.layers import concatenate
from tensorflow.keras import Sequential

In [76]:
# Fit model



def define_model(X):
    model = Sequential()
    model.add(LSTM(50,input_shape=(X.shape[1],X.shape[2])))
    
    model.add(Dense(50,activation='relu'))
    model.add(Dense(100,activation='relu'))
    model.add(Dense(150,activation='relu'))
    model.add(Dense(vocab_size,activation='softmax'))
    
    model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
    
    model.summary()
    
    return model

In [78]:
model = define_model(X)
model.fit(X,y,epochs=200,verbose=2)

Model: "sequential_5"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_7 (LSTM)                (None, 50)                15000     
_________________________________________________________________
dense_9 (Dense)              (None, 50)                2550      
_________________________________________________________________
dense_10 (Dense)             (None, 100)               5100      
_________________________________________________________________
dense_11 (Dense)             (None, 150)               15150     
_________________________________________________________________
dense_12 (Dense)             (None, 24)                3624      
Total params: 41,424
Trainable params: 41,424
Non-trainable params: 0
_________________________________________________________________
Epoch 1/200
10/10 - 3s - loss: 3.1634 - accuracy: 0.1626
Epoch 2/200
10/10 - 0s - loss: 3.1022 - accuracy: 0.2111
Ep

Epoch 125/200
10/10 - 0s - loss: 0.2655 - accuracy: 0.9031
Epoch 126/200
10/10 - 0s - loss: 0.2520 - accuracy: 0.9446
Epoch 127/200
10/10 - 0s - loss: 0.2128 - accuracy: 0.9412
Epoch 128/200
10/10 - 0s - loss: 0.1864 - accuracy: 0.9585
Epoch 129/200
10/10 - 0s - loss: 0.1523 - accuracy: 0.9723
Epoch 130/200
10/10 - 0s - loss: 0.1364 - accuracy: 0.9792
Epoch 131/200
10/10 - 0s - loss: 0.1231 - accuracy: 0.9827
Epoch 132/200
10/10 - 0s - loss: 0.1085 - accuracy: 0.9862
Epoch 133/200
10/10 - 0s - loss: 0.1037 - accuracy: 0.9827
Epoch 134/200
10/10 - 0s - loss: 0.1281 - accuracy: 0.9792
Epoch 135/200
10/10 - 0s - loss: 0.1152 - accuracy: 0.9862
Epoch 136/200
10/10 - 0s - loss: 0.1004 - accuracy: 0.9896
Epoch 137/200
10/10 - 0s - loss: 0.1602 - accuracy: 0.9550
Epoch 138/200
10/10 - 0s - loss: 0.3865 - accuracy: 0.8512
Epoch 139/200
10/10 - 0s - loss: 0.2887 - accuracy: 0.9100
Epoch 140/200
10/10 - 0s - loss: 0.1896 - accuracy: 0.9516
Epoch 141/200
10/10 - 0s - loss: 0.1487 - accuracy: 0.96

<tensorflow.python.keras.callbacks.History at 0x1ec7e5d85e0>

In [79]:
model.save('CharBasedLanguageModeling.h5')

In [84]:
#save the mapping 
from pickle import dump
dump(mapping,open('mapping.pkl','wb'))

### Generate Characters

We must provide sequences of 10 characters as input to the model in order to start the generation process. We will pick these manually. A given input sequence will need to be prepared in the same way as preparing the training data for the model. 
We will use the following steps in order to do it

1) the sequence of characters must be integer encoded using the loaded mapping

2) the integers need to be one hot encoded using the to categorical() Keras function

3) reshape the sequence to be 3-dimensional, as we only have one sequence and LSTMs require all input to be three dimensional (samples, time steps, features).

4)model to predict the next character in the sequence. We use
predict classes() instead of predict() to directly select the integer for the character with the highest probability instead of getting the full probability distribution across the entire set of characters

5) We can then decode this integer by looking up the mapping to see the character to which it maps.

6) This character can then be added to the input sequence. We then need to make sure that the input sequence is 10 characters by truncating the first character from the input sequence text.
We can use the pad sequences() function from the Keras API that can perform this truncation
operation


In [109]:
# generate a sequence of characters with a language model 

def generate_seq(model,mapping, seq_length, seed_text, n_chars):
    in_txt = clean_data(seed_text)
    
    #genarating n number of chars
    for _ in range(n_chars):
        
        #encode the text as integers
        encoded_seq = [mapping[char] for char in in_txt]
        
        #truncate sequences to a fixed legth 
        encoded_seq = pad_sequences([encoded_seq],maxlen=seq_length,truncating='pre')
        
        #one hot encoding
        encoded_seq = to_categorical(encoded_seq,num_classes=len(mapping))
        
        #reshaping 
        #encoded_seq = encoded_seq.reshape(1,encoded_seq.shape[0],encoded_seq.shape[1])
        
        yhat = model.predict_classes(encoded_seq)
       # print(yhat)
        
        #Integer to character
        out_char = ''
        for char,i in mapping.items():
            if i == yhat:
                out_char = char
                break
        in_txt += out_char
        
    return in_txt
        


In [85]:
from pickle import load
from tensorflow.keras.models import load_model

model = load_model('CharBasedLanguageModeling.h5')
mapping = load(open('mapping.pkl','rb'))

In [111]:
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 20))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 20))
# test not in original
print(generate_seq(model, mapping, 10, 'hello world', 20))

sing a song of a pocket full o
king was in his counting count
hello world of crfackddnd pblkb


In [94]:
in_txt = clean_data('Sing a son')
print(in_txt)    

sing a son


In [95]:
encoded_seq = [mapping[char] for char in in_txt]
print(encoded_seq)

[19, 10, 14, 8, 1, 2, 1, 19, 15, 14]


In [96]:
encoded_seq = pad_sequences([encoded_seq],maxlen=10,truncating='pre')
print(encoded_seq)

[[19 10 14  8  1  2  1 19 15 14]]


In [97]:
encoded_seq = to_categorical(encoded_seq,num_classes=len(mapping))
print(encoded_seq.shape)


(1, 10, 24)


In [106]:
print(mapping.items())

dict_items([('\n', 0), (' ', 1), ('a', 2), ('b', 3), ('c', 4), ('d', 5), ('e', 6), ('f', 7), ('g', 8), ('h', 9), ('i', 10), ('k', 11), ('l', 12), ('m', 13), ('n', 14), ('o', 15), ('p', 16), ('q', 17), ('r', 18), ('s', 19), ('t', 20), ('u', 21), ('w', 22), ('y', 23)])
