#  Character-Based Neural Language Model in Keras

 A language model predicts the next word in the sequence based on the specific words that have come before it in the sequence. This comes at the cost of requiring larger models that are slower to train. Nevertheless, in the field of neural language models, character-based models offer a lot of promise for a general, flexible and powerful approach to language modeling. In this section you will know:
 How to prepare text for character-based language modeling.
 How to develop a character-based language model using LSTMs.
 How to use a trained character-based language model to generate text


# Data Preparation

The first step is to prepare the text data. We will start by defining the type of language model.

In [8]:
from numpy import array
from keras.utils import to_categorical
from keras.utils.vis_utils import plot_model
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM

# load doc into memory
def load_doc(filename):
	# open the file as read only
	file = open(filename, 'r')
	# read all text
	text = file.read()
	# close the file
	file.close()
	return text
# define the model

# Clean Text 
Next, we need to clean the loaded text. We will not do much to it on this example. Specifically,
we will strip all of the new line characters so that we have one long sequence of characters
separated only by white space.



In [11]:
# save tokens to file, one dialog per line
def save_doc(lines, filename):
	data = '\n'.join(lines)
	file = open(filename, 'w')
	file.write(data)
	file.close()

# load text
raw_text = load_doc('el_quijote.txt')
#raw_text = load_doc('/floyd/input/dataset2/el_quijote.txt')
#raw_text = load_doc('rhyme.txt')

#print(raw_text)
# clean
tokens = raw_text.split()
raw_text = ' '.join(tokens)


# Create Sequences
Now that we have a long list of characters, we can create our input-output sequences used to
train the model. Each input sequence will be 10 characters with one output character, making
each sequence 11 characters long. We can create the sequences by enumerating the characters
in the text, starting at the 11th character at index 10. The sequences are save in a file with
function save_doc()

In [12]:
# organize into sequences of characters
length = 20
sequences = list()
for i in range(length, len(raw_text)):
	# select sequence of tokens
	seq = raw_text[i-length:i+1]
	# store
	sequences.append(seq)
print('Total Sequences: %d' % len(sequences))
# save sequences to file
out_filename = 'char_sequences.txt'
#print (sequences)
save_doc(sequences, out_filename)

Total Sequences: 1038375


# Train Language Model

The model will read encoded characters and predict the next character in the sequence. The first step is to load the prepared character sequence data from char sequences.txt. 

In [13]:

# load doc into memory
def load_doc(filename):
  # open the file as read only
  file = open(filename, 'r' )
  # read all text
  text = file.read()
  # close the file
  file.close()
  return text
# load
in_filename = 'char_sequences.txt'
raw_text = load_doc(in_filename)
lines = raw_text.split( '\n' )
#print lines


# Dictionary Mapping
We can create the mapping given a sorted set of unique characters in the
raw input data. The mapping is a dictionary of character values to integer values.
 

In [14]:
# integer encode sequences of characters
chars = sorted(list(set(raw_text)))
mapping = dict((c, i) for i, c in enumerate(chars))
sequences = list()
for line in lines:
	# integer encode line
	encoded_seq = [mapping[char] for char in line]
	# store
	sequences.append(encoded_seq)
# vocabulary size
vocab_size = len(mapping)
print('Vocabulary Size: %d' % vocab_size)
print(mapping)

Vocabulary Size: 89
{'t': 70, '<': 22, 'q': 67, 'B': 25, '”': 88, '8': 18, 'F': 29, '9': 19, ':': 20, 'a': 52, 'U': 44, 'd': 55, 'r': 68, 'V': 45, ']': 51, '[': 50, 'g': 58, 'I': 32, '¿': 79, '4': 14, 'E': 28, "'": 4, 'A': 24, 'M': 36, '1': 11, '-': 8, '«': 77, 'l': 62, '0': 10, 'v': 72, ';': 21, 'N': 37, 'x': 73, 'C': 26, '\n': 0, '’': 86, '6': 16, '‘': 85, 'D': 27, '(': 5, '́': 81, '2': 12, 'H': 31, 'e': 56, 'p': 66, '3': 13, '?': 23, '!': 2, ' ': 1, '̈': 83, 'W': 46, 'u': 71, 'Y': 48, 'T': 43, 'R': 41, '¡': 76, 'K': 34, 'b': 53, 'P': 39, '"': 3, '–': 84, '̀': 80, 'm': 63, 'n': 64, 's': 69, '̃': 82, 'G': 30, 'Q': 40, '.': 9, 'O': 38, 'S': 42, '“': 87, ',': 7, 'Z': 49, '5': 15, 'j': 61, '»': 78, 'L': 35, 'h': 59, 'X': 47, 'y': 74, 'i': 60, ')': 6, 'o': 65, 'J': 33, 'z': 75, 'c': 54, '7': 17, 'f': 57}


# Encode Sequences
The sequences of characters must be encoded as integers. This means that each unique character
will be assigned a specific integer value and each sequence of characters will be encoded as a
sequence of integers. We can separate the columns into input and
output sequences of characters. We can do this using a simple array slice.

In [6]:
# separate into input and output
sequences = array(sequences)
print('vectorización de secuencias')
#print (sequences[:,:-1],sequences[:,-1])
X, y = sequences[:,:-1], sequences[:,-1]
#print ('estas son las secuenciassssss', X,'de yyyyyy',y)
sequences = [to_categorical(x, num_classes=vocab_size) for x in X]
X = array(sequences)
y = to_categorical(y, num_classes=vocab_size)
print (X[0],X[1],'este es el',y[0],y[1])
# define model


vectorización de secuencias
[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 

# Define the language model
The model is defined with an input layer that takes sequences that have 10 time steps and 38
features for the one hot encoded input sequences. The model has a single LSTM hidden layer with 75 memory cells, chosen with a little trial and error. The model has a fully connected output layer that outputs one vector with a probability distribution across all characters in the vocabulary. A softmax activation function is used on
the output layer to ensure the output has the properties of a probability distribution.


In [7]:
from pickle import dump
def define_model(X):
    model = Sequential()
    model.add(LSTM(75, input_shape=(X.shape[1], X.shape[2])))
    model.add(Dense(vocab_size, activation= 'softmax' ))
    # compile model
    model.compile(loss= 'categorical_crossentropy' , optimizer= 'adam' , metrics=[ 'accuracy' ])
    # summarize defined model
    model.summary()
    plot_model(model, to_file= 'model.png' , show_shapes=True)
    return model
model = define_model(X)
# fit model
model.fit(X, y, epochs=100, verbose=2)
# save the model to file
model.save( '/home/raul/clases-pln/modelo-caracter/model.h5' )
# save the mapping
dump(mapping, open( '/home/raul/clases-pln/modelo-caracter/mapping.pkl' , 'wb' ))

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
lstm_1 (LSTM)                (None, 75)                34200     
_________________________________________________________________
dense_1 (Dense)              (None, 38)                2888      
Total params: 37,088
Trainable params: 37,088
Non-trainable params: 0
_________________________________________________________________
Epoch 1/100
 - 1s - loss: 3.6175 - acc: 0.0852
Epoch 2/100
 - 0s - loss: 3.5401 - acc: 0.1880
Epoch 3/100
 - 0s - loss: 3.2740 - acc: 0.1905
Epoch 4/100
 - 0s - loss: 3.0777 - acc: 0.1905
Epoch 5/100
 - 0s - loss: 3.0246 - acc: 0.1905
Epoch 6/100
 - 0s - loss: 2.9931 - acc: 0.1905
Epoch 7/100
 - 0s - loss: 2.9760 - acc: 0.1905
Epoch 8/100
 - 0s - loss: 2.9636 - acc: 0.1905
Epoch 9/100
 - 0s - loss: 2.9500 - acc: 0.1905
Epoch 10/100
 - 0s - loss: 2.9286 - acc: 0.1905
Epoch 11/100
 - 0s - loss: 2.9114 - acc: 0.1905
Epoch 12/100
 -

In [8]:
import numpy as np
from pickle import load
#from numpy import array
from keras.models import load_model
from keras.utils import to_categorical
from keras.preprocessing.sequence import pad_sequences

# generate a sequence of characters with a language model
def generate_seq(model, mapping, seq_length, seed_text, n_chars):
    in_text = seed_text
    # generate a fixed number of characters
    for _ in range(n_chars):
        # encode the characters as integers
        encoded = [mapping[char] for char in in_text]
        # truncate sequences to a fixed length
        encoded = pad_sequences([encoded], maxlen=seq_length, truncating='pre')
        # one hot encode
        encoded = to_categorical(encoded, num_classes=len(mapping))
        #encoded = encoded.reshape(1, encoded.shape[0], encoded.shape[1])
        # predict character
        yhat = model.predict_classes(encoded, verbose=0)
        #print(yhat)
        # reverse map integer to character
        out_char = ''
        for char, index in mapping.items():
            if index == yhat:
                out_char = char
                break 
        # append to input
        in_text += out_char
    return in_text
# load the model
model = load_model('model.h5')
# load the mapping
mapping = load(open('mapping.pkl', 'rb'))
print(mapping)
# test start of rhyme
print(generate_seq(model, mapping, 10, 'Sing a son', 25))
# test mid-line
print(generate_seq(model, mapping, 10, 'king was i', 25))
# test not in original
print(generate_seq(model, mapping, 10, 'The queen', 25))

{'c': 17, 'k': 24, 'p': 29, 't': 33, "'": 2, 'S': 12, 'a': 15, 'w': 35, ',': 3, 'l': 25, ' ': 1, 'T': 13, 'B': 7, 'E': 9, 'H': 11, '\n': 0, 'F': 10, 'o': 28, 'q': 30, 'C': 8, ';': 5, 'h': 22, 'y': 37, 'd': 18, 'x': 36, 'e': 19, 'W': 14, 'm': 26, 'r': 31, 'i': 23, 'f': 20, 'g': 21, '.': 4, 'n': 27, 'u': 34, 'A': 6, 's': 32, 'b': 16}
Sing a song of sixpence, A pocket f
king was in his counting house, Cou
The queen was in the garden, Hangi
