import the required libraries

In [15]:
from keras.preprocessing.sequence import pad_sequences
from keras.layers import Embedding, LSTM, Dense,Dropout
from keras.preprocessing.text import Tokenizer
from keras.callbacks import EarlyStopping
from keras.models import Sequential
import keras.utils as ku 
import numpy as np

In [2]:
# load ascii text and covert to lowercase
filename = "wonderland.txt"
raw_text = open(filename,encoding="utf8").read()
raw_text = raw_text.lower()

In dataset preparation step, we will first perform Tokenization. Tokenization is a process of extracting tokens (terms / words) from a corpus. Python’s library Keras has inbuilt model for tokenization which can be used to obtain the tokens and their index in the corpus.

In [3]:
tokenizer = Tokenizer()


In [10]:
def dataset_preparation(data):
    corpus = data.lower().split("\n")    
    tokenizer.fit_on_texts(corpus)
    total_words = len(tokenizer.word_index) + 1
    
    #####Next, we need to convert the corpus into a flat dataset of sentence sequences.
    input_sequences = []
    for line in corpus:
        token_list = tokenizer.texts_to_sequences([line])[0]
        for i in range(1, len(token_list)):
            n_gram_sequence = token_list[:i+1]
            input_sequences.append(n_gram_sequence)
    ####Now that we have generated a data-set which contains sequence of tokens, 
    ####it is possible that different sequences have different lengths. 
    ####Before starting training the model, we need to pad the sequences and make their lengths equal. 
    ####We can use pad_sequence function of Kears for this purpose.

    max_sequence_len = max([len(x) for x in input_sequences])
    input_sequences = np.array(pad_sequences(input_sequences,maxlen=max_sequence_len, padding='pre'))
    
    ####To input this data into a learning model, we need to create predictors and label. 
    ####We will create N-grams sequence as predictors and the next word of the N-gram as label.
    
    predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
    label = ku.to_categorical(label, num_classes=total_words)
    
    return predictors, label,max_sequence_len, total_words

In [5]:
predictors, label = dataset_preparation(raw_text)

In [6]:
label

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]], dtype=float32)

In [7]:
predictors

array([[   0,    0,    0, ...,    0,    0,   54],
       [   0,    0,    0, ...,    0,   54, 1308],
       [   0,    0,    0, ...,   54, 1308,  250],
       ...,
       [   0,    0,    0, ..., 3398,    4,  278],
       [   0,    0,    0, ...,    4,  278,   38],
       [   0,    0,    0, ...,  278,   38,  497]])

Recurrent Neural Networks

Unlike Feed-forward neural networks in which activation outputs are propagated only in one direction, the activation outputs from neurons propagate in both directions (from inputs to outputs and from outputs to inputs) in Recurrent Neural Networks. This creates loops in the neural network architecture which acts as a ‘memory state’ of the neurons. This state allows the neurons an ability to remember what have been learned so far.

The memory state in RNNs gives an advantage over traditional neural networks but a problem called Vanishing Gradient is associated with them. In this problem, while learning with a large number of layers, it becomes really hard for the network to learn and tune the parameters of the earlier layers. To address this problem, A new type of RNNs called LSTMs (Long Short Term Memory) Models have been developed.

LSTMs have an additional state called ‘cell state’ through which the network makes adjustments in the information flow. The advantage of this state is that the model can remember or forget the leanings more selectively. To learn more about LSTMs, here is a great post. Lets architecture a LSTM model in our code. I have added total three layers in the model.

Input Layer : Takes the sequence of words as input
    
LSTM Layer : Computes the output using LSTM units. I have added 100 units in the layer, but this number can be fine tuned later.

Dropout Layer : A regularisation layer which randomly turns-off the activations of some neurons in the LSTM layer. It helps in preventing over fitting.

Output Layer : Computes the probability of the best possible next word as output

In [32]:
def create_model(predictors, label, max_sequence_len, total_words):
    input_len = max_sequence_len - 1
    model = Sequential()
    model.add(Embedding(total_words, 10, input_length=input_len))
    model.add(LSTM(150))
    model.add(Dropout(0.1))
    model.add(Dense(total_words, activation='softmax'))
    model.compile(loss='categorical_crossentropy', optimizer='adam')
    model.fit(predictors, label, epochs=25, verbose=1)
    return model

Great, our model architecture is now ready and we can train it using our data. Next lets write the function to predict the next word based on the input words (or seed text). We will first tokenize the seed text, pad the sequences and pass into the trained model to get predicted word. The multiple predicted words can be appended together to get predicted sequence.

In [27]:
def generate_text(seed_text, next_words, max_sequence_len,model):
	for _ in range(next_words):
		token_list = tokenizer.texts_to_sequences([seed_text])[0]
		token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
		predicted = model.predict_classes(token_list)
		output_word = ""
		for word, index in tokenizer.word_index.items():
			if index == predicted:
				output_word = word
				break
		seed_text += " " + output_word
	return seed_text

In [33]:
X, Y, max_len, total_words = dataset_preparation(raw_text)
model = create_model(X, Y, max_len, total_words)

Epoch 1/25
Epoch 2/25
Epoch 3/25
Epoch 4/25
Epoch 5/25
Epoch 6/25
Epoch 7/25
Epoch 8/25
Epoch 9/25
Epoch 10/25
Epoch 11/25
Epoch 12/25
Epoch 13/25
Epoch 14/25
Epoch 15/25
Epoch 16/25
Epoch 17/25
Epoch 18/25
Epoch 19/25
Epoch 20/25
Epoch 21/25
Epoch 22/25
Epoch 23/25
Epoch 24/25
Epoch 25/25


In [34]:
type(model)

keras.engine.sequential.Sequential

In [35]:
text = generate_text("alice was", 3, max_len,model)
print(text)

alice was a little irritated


In [36]:
text = generate_text("the forest", 3, max_len, model)
print(text)

the forest queen was a


In [40]:
text = generate_text("poisonous cake", 3, max_len, model)
print(text)

poisonous cake ‘i don’t have
