# Word Prediction

In this notebook, we are going to predict the next word that the writer is going to write. This will help us evaluate that how much the neural network has understood about dependencies between different letters that combine to form a word. We can also get an idea of how much the model has understood about the order of different types of word in a sentence.

Code segments [1] to [5] are same as that in 'train.ipynb' notebook and their detailed explanation can be found over their itself.

In [1]:
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Activation
from tensorflow.keras.optimizers import RMSprop, Adam
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

In [2]:
SEQ_LENGTH = 100

In [3]:
def buildmodel(VOCABULARY):
    model = Sequential()
    model.add(LSTM(256, input_shape = (SEQ_LENGTH, 1), return_sequences = True))
    model.add(Dropout(0.2))
    model.add(LSTM(256))
    model.add(Dropout(0.2))
    model.add(Dense(VOCABULARY, activation = 'softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
    return model

In [4]:
file = open('wonderland.txt', encoding = 'utf8')
raw_text = file.read()
raw_text = raw_text.lower()

In [5]:
chars = sorted(list(set(raw_text)))
print(chars)
bad_chars = ['#', '*', '@', '_', '\ufeff']
for i in range(len(bad_chars)):
    raw_text = raw_text.replace(bad_chars[i],"")
chars = sorted(list(set(raw_text)))
print(chars)
VOCABULARY = len(chars)

int_to_char = dict((i, c) for i, c in enumerate(chars))
char_to_int = dict((c, i) for i, c in enumerate(chars))

['\n', ' ', '!', '#', '$', '%', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”', '\ufeff']
['\n', ' ', '!', '$', '%', '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”']


Now that our model has been defined and we have preprocessed our input file and redefinded our vocabulary, as in train.ipynb file we are ready to proceed. The best model with least loss as we obtained in the last epoch of training is loaded and the model is build and recompiled.

In [6]:
filename = 'saved_models/weights-improvement-49-1.3420.hdf5'
model = buildmodel(VOCABULARY)
model.load_weights(filename)
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

In [7]:
from ipywidgets import widgets
from IPython.display import display

The original model has been defined in a manner to take in 100 character inputs. So when the user initially starts typing the words, the total length of input string will be less than 100 characters. To solve this issue, the input has been padded with series of spaces in the beginning in ordert to make the total length of 100 characters. As the total length exceeds 100 characters, only last 100 characters as the LSTM nodes take care of remembering the context of the document from before.

Succeeding characters are predicted by the model until a space or full stop is encountered. The predicted characters are joined to form the next word, predicted by the model.

In [8]:
original_text = []
predicted_text = []

text = widgets.Text()
display(text)

def handle_submit(sender):
    global predicted_text
    global original_text
    
    inp = list(text.value)
    
    last_word = inp[len(original_text):]
    inp = inp[:len(original_text)]    
    original_text = text.value    
    last_word.append(' ')
    
    inp_text = [char_to_int[c] for c in inp]
    last_word = [char_to_int[c] for c in last_word]
    
    if len(inp_text) > 100:
        inp_text = inp_text[len(inp_text)-100: ]
    if len(inp_text) < 100:
        pad = []
        space = char_to_int[' ']
        pad = [space for i in range(100-len(inp_text))]
        inp_text = pad + inp_text
    
    while len(last_word)>0:
        X = np.reshape(inp_text, (1, SEQ_LENGTH, 1))
        next_char = model.predict(X/float(VOCABULARY))
        inp_text.append(last_word[0])
        inp_text = inp_text[1:]
        last_word.pop(0)
    
    next_word = []
    next_char = ':'
    while next_char != ' ':
        X = np.reshape(inp_text, (1, SEQ_LENGTH, 1))
        next_char = model.predict(X/float(VOCABULARY))
        index = np.argmax(next_char)        
        next_word.append(int_to_char[index])
        inp_text.append(index)
        inp_text = inp_text[1:]
        next_char = int_to_char[index]
    
    predicted_text = predicted_text + [''.join(next_word)]
    print(" " + ''.join(next_word), end='|')
    
text.on_submit(handle_submit)

Text(value='')

 use | of | the |