# Word Prediction

In this notebook, we are going to predict the next word that the writer is going to write. This will help us evaluate that how much the neural network has understood about dependencies between different letters that combine to form a word. We can also get an idea of how much the model has understood about the order of different types of word in a sentence.

Code segments [1] to [5] are same as that in 'train.ipynb' notebook and their detailed explanation can be found over their itself.

In [1]:
import numpy as np
import keras
from keras.models import Sequential
from keras.layers import Dense, LSTM, Dropout, Activation
from keras.optimizers import RMSprop, Adam
from keras.callbacks import ModelCheckpoint
from keras.utils import np_utils

Using TensorFlow backend.


In [2]:
SEQ_LENGTH = 100

In [3]:
def buildmodel(VOCABULARY):
    model = Sequential()
    model.add(LSTM(256, input_shape = (SEQ_LENGTH, 1), return_sequences = True))
    model.add(Dropout(0.2))
    model.add(LSTM(256))
    model.add(Dropout(0.2))
    model.add(Dense(VOCABULARY, activation = 'softmax'))
    model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')
    return model

In [4]:
file = open('wonderland.txt', encoding = 'utf8')
raw_text = file.read()
raw_text = raw_text.lower()

In [5]:
chars = sorted(list(set(raw_text)))
print(chars)
bad_chars = ['#', '*', '@', '_', '\ufeff']
for i in range(len(bad_chars)):
    raw_text = raw_text.replace(bad_chars[i],"")
chars = sorted(list(set(raw_text)))
print(chars)
VOCABULARY = len(chars)

int_to_char = dict((i, c) for i, c in enumerate(chars))
char_to_int = dict((c, i) for i, c in enumerate(chars))

['\n', ' ', '!', '#', '$', '%', '(', ')', '*', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '@', '[', ']', '_', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”', '\ufeff']
['\n', ' ', '!', '$', '%', '(', ')', ',', '-', '.', '/', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', ':', ';', '?', '[', ']', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z', '‘', '’', '“', '”']


Now that our model has been defined and we have preprocessed our input file and redefinded our vocabulary, as in train.ipynb file we are ready to proceed. The best model with least loss as we obtained in the last epoch of training is loaded and the model is build and recompiled.

In [6]:
filename = 'saved_models/weights-improvement-49-1.3420.hdf5'
model = buildmodel(VOCABULARY)
model.load_weights(filename)
model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

In [7]:
from ipywidgets import widgets
from IPython.display import display

The original model has been defined in a manner to take in 100 character inputs. So when the user initially starts typing the words, the total length of input string will be less than 100 characters. To solve this issue, the input has been padded with series of spaces in the beginning in ordert to make the total length of 100 characters. As the total length exceeds 100 characters, only last 100 characters as the LSTM nodes take care of remembering the context of the document from before.

Succeeding characters are predicted by the model until a space or full stop is encountered. The predicted characters are joined to form the next word, predicted by the model.

In [8]:
original_text = []
predicted_text = []

text = widgets.Text()
display(text)

def handle_submit(sender):
    global predicted_text
    global original_text
    
    inp = list(text.value)
    
    last_word = inp[len(original_text):]
    inp = inp[:len(original_text)]    
    original_text = text.value    
    last_word.append(' ')
    
    inp_text = [char_to_int[c] for c in inp]
    last_word = [char_to_int[c] for c in last_word]
    
    if len(inp_text) > 100:
        inp_text = inp_text[len(inp_text)-100: ]
    if len(inp_text) < 100:
        pad = []
        space = char_to_int[' ']
        pad = [space for i in range(100-len(inp_text))]
        inp_text = pad + inp_text
    
    while len(last_word)>0:
        X = np.reshape(inp_text, (1, SEQ_LENGTH, 1))
        next_char = model.predict(X/float(VOCABULARY))
        inp_text.append(last_word[0])
        inp_text = inp_text[1:]
        last_word.pop(0)
    
    next_word = []
    next_char = ':'
    while next_char != ' ':
        X = np.reshape(inp_text, (1, SEQ_LENGTH, 1))
        next_char = model.predict(X/float(VOCABULARY))
        index = np.argmax(next_char)        
        next_word.append(int_to_char[index])
        inp_text.append(index)
        inp_text = inp_text[1:]
        next_char = int_to_char[index]
    
    predicted_text = predicted_text + [''.join(next_word)]
    print(" " + ''.join(next_word), end='|')
    
text.on_submit(handle_submit)

 ’ | of | y | a | the | dia | out | see | it | see | and | she | kean | the | a | wind, | of | the | would | and | it |  | and | shink | an | was | i | and | dou’e | the | at | hear | of | the | hear | thing | to | heard | the | hev | at | must | do | you |

The text box above shows the text as written by the user. The text used over here is the first few characters of the famous children book 'The Cat in the Hat' by Dr. Seuss available [here](http://www.stylist.co.uk/books/100-best-opening-lines-from-childrens-books#gallery-1). As the text is typed over, pressing enter just after the character ends (before the space), gives us the next word suggesstion, followed by a vertical bar to seperate the words, as shown above and in the gif.

Next we summarize the predictions made by the model, in a nice tabular form listing the actual word typed by the user and the word suggessted by the model, before typing it side by side as shown after the code segment below.

In [9]:
from tabulate import tabulate

original_text = original_text.split()
predicted_text.insert(0,"")
predicted_text.pop()

table = []
for i in range(len(original_text)):
    table.append([original_text[i], predicted_text[i]])
print(tabulate(table, headers = ['Actual Word', 'Predicted Word']))

Actual Word    Predicted Word
-------------  ----------------
the
sun            ’
did            of
not            y
shine,         a
it             the
was            dia
too            out
wet            see
to             it
play,          see
so             and
we             she
sat            kean
in             the
the            a
house          wind,
all            of
that           the
cold,          would
cold           and
wet            it
day.
i              and
sat            shink
there          an
with           was
sally.         i
we             and
sat            dou’e
here           the
we             at
two            hear
and            of
we             the
said           hear
how            thing
we             to
wish           heard
we             the
had            hev
something      at
to             must
do.            do


## Conclusions
A lot of observations can be made from the table above-
* Most of the words generated by the model are proper english words, although there are exceptions at many places. This shows that the model has a good understanding of how letters are combined to form different words. Even though it is very obvious to do for a human, but for a computer model to give a reasonable performance at word formation is itself a huge feat.
* The model has also understood to some extent about grammar of english language. In the above case, we can see that it often suggests verb at place of a verb like 'wet to see' in place of 'wet to play'. Also many a times, words of other part of speech are suggessted but they fit well, for example, 'we sat in the wind' is suggessted in place of 'we sat in the house'. Relationships like this show great hope, although the model has to further learn a lot in this area.
* There are a few drawbacks as well. One of them is that the model often suggests 'and', both after a comma and a full stop which may be correct in case 1, but is always wrong for case 2.

Overall, this makes up a nice demonstration for word prediction using RNNs with LSTM nodes. Seeing the performance of these models show that how advanced models phone keyboard suggestions use, which are very accurate. Further improvements in this model can be made by further training, tuning the hyperparameters, using a deeper network etc.