# The Bee Movie Script Predictor
In this project, we are gonna use the entire Bee Movie script to create a predictive keyboard
We use the Recurrent Neural Network for this purpose. This model was chosen because it provides a way to examine the previous input. LSTM, a special kind of RNN is also used for this purpose. The LSTM provides the mechanism to preserve the errors that can be backpropagated through time and layers which helps to reduce vanishing gradient problem.

#### Importing the libraries

In [1]:
import numpy as np
from nltk.tokenize import RegexpTokenizer
from tensorflow.keras.layers import LSTM
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Conv2D, Flatten, Dense,Activation
from tensorflow.keras.optimizers import RMSprop
import matplotlib.pyplot as plt
import heapq
import warnings
warnings.filterwarnings('ignore')

#### Importing the dataset
First we are going to read the .txt file with the entire script then split the data into a list without any special characters

In [2]:
text = open("beeMovie.txt").read().lower()
print('Script Length', len(text))

Script Length 55315


In [3]:
tokenizer = RegexpTokenizer(r'\w+')
words = tokenizer.tokenize(text)
words = [item for item in words if item.isalpha()] #Any item in the list that contains a numbers
words[0:10] #To see the first 10 words

['according',
 'to',
 'all',
 'known',
 'laws',
 'of',
 'aviation',
 'there',
 'is',
 'no']

#### Feature Selection

We gonna make a unique sorted word list
Then we are going to make a dictionary. The key is going to be the words and the corresponding position is going to be the value. 

In [4]:
uniqueWords = np.unique(words)
wordIndex = dict((c, i) for i, c in enumerate(uniqueWords))
wordIndex

{'a': 0,
 'able': 1,
 'abort': 2,
 'aborting': 3,
 'about': 4,
 'absolutely': 5,
 'absurd': 6,
 'according': 7,
 'account': 8,
 'across': 9,
 'action': 10,
 'actor': 11,
 'actual': 12,
 'actually': 13,
 'adam': 14,
 'addicted': 15,
 'adjusted': 16,
 'adrenaline': 17,
 'ads': 18,
 'advancement': 19,
 'advantage': 20,
 'advisory': 21,
 'affect': 22,
 'affects': 23,
 'affirmative': 24,
 'afraid': 25,
 'after': 26,
 'afternoon': 27,
 'aftertaste': 28,
 'again': 29,
 'against': 30,
 'agreed': 31,
 'ahead': 32,
 'aim': 33,
 'aiming': 34,
 'air': 35,
 'airport': 36,
 'alaska': 37,
 'alert': 38,
 'alive': 39,
 'all': 40,
 'allergic': 41,
 'allow': 42,
 'almost': 43,
 'alone': 44,
 'already': 45,
 'also': 46,
 'always': 47,
 'am': 48,
 'amazing': 49,
 'amen': 50,
 'amusement': 51,
 'an': 52,
 'anchor': 53,
 'and': 54,
 'angel': 55,
 'anger': 56,
 'angry': 57,
 'animal': 58,
 'animals': 59,
 'another': 60,
 'ant': 61,
 'antennae': 62,
 'antennas': 63,
 'antonio': 64,
 'anxiously': 65,
 'any': 66

Now the actual feauture selection process begins.
We set the word limit as 4
This specifies that only the previous 4 words will be used to predict the next word.
We create two lists, prevW and nextW. prevW stores the previous 4 words and nextW stores its corresponding next word

In [5]:
wordLim = 4
prevW = []
nextW = []
for i in range(len(words) - wordLim): #Iterating through the 4 less than length of list
    prevW.append(words[i:i + wordLim])
    nextW.append(words[i + wordLim])
print(prevW[0:5])
print("\n")
print(nextW[0:5])

[['according', 'to', 'all', 'known'], ['to', 'all', 'known', 'laws'], ['all', 'known', 'laws', 'of'], ['known', 'laws', 'of', 'aviation'], ['laws', 'of', 'aviation', 'there']]


['laws', 'of', 'aviation', 'there', 'is']


#### One-Hot Encoding
We are going to create two numpy arrays
x is for storing the feature
y is the corresponding next word
We will iterate through x and y, if the word is present, then the corresponding position is made 1 

In [6]:
x = np.zeros((len(prevW), wordLim, len(uniqueWords)), dtype=bool)
y = np.zeros((len(nextW), len(uniqueWords)), dtype=bool)
for i,eWords  in enumerate(prevW):
    for j, word in enumerate(eWords):
        x[i, j, wordIndex[word]] = 1
    y[i, wordIndex[nextW[i]]] = 1
print(x[0:3][2])
print("\n")
print(y[0:3][0])

[[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]


[False False False ... False False False]


### Building the model
We use a single-layer LSTM model with 128 neurons, a fully connected layer, and a softmax function for activation.

In [7]:
kbModel = Sequential()
kbModel.add(LSTM(256, input_shape=(wordLim, len(uniqueWords))))
kbModel.add(Dense(len(uniqueWords)))
kbModel.add(Activation('softmax'))

### Training the model
Using an RMSprop optimizer, the model is going to be trained with 30 epochs

In [8]:
optimizer = RMSprop(lr=0.01)
kbModel.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
hist = kbModel.fit(x, y, validation_split=0.05, batch_size=128, epochs=4, shuffle=True).history #Evaluation results can be seen from variable

Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


In [9]:
hist

{'loss': [6.3454909324646,
  5.457568645477295,
  4.708028793334961,
  3.832205295562744],
 'accuracy': [0.045630648732185364,
  0.11934997886419296,
  0.18133878707885742,
  0.27744296193122864],
 'val_loss': [5.87396764755249,
  5.838569164276123,
  5.937586784362793,
  6.240148544311523],
 'val_accuracy': [0.10408163070678711,
  0.11224489659070969,
  0.11836734414100647,
  0.11428571492433548]}

# Predictions
First we need to create a function that encodes the given input after it removes the punctuations and numbers from the input
then we need to create a function that chooses the top 5 best predictions made by the model
Finally, we create a function that uses the model to predict and then returns 5 possible words


In [10]:
def textCleaner(inp,check=True ):
    inp= inp.lower()
    inp = tokenizer.tokenize(inp)
    inp = [item for item in inp if item.isalpha()] 
    if check==True:
        return " ".join(inp[0:4])
    else:
        return " ".join(inp[-4:])

def inputEncoder(string,check=True):
    string = textCleaner(string,check)
    x = np.zeros((1, wordLim, len(uniqueWords)))
    print("Resulted Sequence\n")
    for t, word in enumerate(string.split()):
        print(word)
        x[0, t, wordIndex[word]] = 1
    return x

def bestResult(preds):
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds)
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)

    return heapq.nlargest(5, range(len(preds)), preds.take)

def predictor(string,check=True):
    if string == "":
        return("0")
    x = inputEncoder(string,check)
    prediction = kbModel.predict(x, verbose=0)[0]
    nextInd = bestResult(prediction)
    return [uniqueWords[idx] for idx in nextInd]

In [11]:
inputEncoder("Hello how are you doing".lower(),False)

Resulted Sequence

how
are
you
doing


array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

Finally lets see how our predictor works.
Using tokenizer, we remove the punctuations. Then we remove the numbers. Then we choose the first four words to make the prediction and check

In [12]:
inp="Wait. One of these flowers seems to be on the move."
print("sequence is: ",inp)
print("The possible words can be: ",predictor(inp))


sequence is:  Wait. One of these flowers seems to be on the move.
Resulted Sequence

wait
one
of
these
The possible words can be:  ['flowers', 'bees', 'life', 'roses', 'these']


In [13]:
inp="picking up a lot of bright yellow ."
print("sequence is: ",inp)
print("The possible words can be: ",predictor(inp,False))

sequence is:  picking up a lot of bright yellow .
Resulted Sequence

lot
of
bright
yellow
The possible words can be:  ['let', 'oould', 'yellow', 'as', 've']


In [14]:
inp="Are bees really dead or not"
print("sequence is: ",inp)
print("The possible words can be: ",predictor(inp,False))

sequence is:  Are bees really dead or not
Resulted Sequence

really
dead
or
not
The possible words can be:  ['dead', 'the', 'all', 'at', 'there']
