# Cogs 181
# Final Project - Saikiran Komatineni



## Imports
#### General Packages

In [121]:
%matplotlib inline
import matplotlib.pyplot as plt
import tensorflow as tf
import numpy as np
from scipy.spatial.distance import cdist

#### TensorFlow packages

In [176]:
# from tf.keras.models import Sequential  # This does not work!
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, GRU, Embedding, LSTM, SimpleRNN, Dropout
from tensorflow.python.keras.optimizers import Adam
from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequences

In [123]:
print("tensorflow version: ", tf.__version__)
print("keras version: ", tf.keras.__version__)

tensorflow version:  1.13.1
keras version:  2.2.4-tf


## Load Data
#### IMDB dataset

In [124]:
import imdb                              #download details stored in a seperate file
imdb.data_dir = "data/IMDB/"             #set directory for where data is stored

imdb.maybe_download_and_extract()        #function declared in seperate file

Data has apparently already been downloaded and unpacked.


#### Load the training- and test-sets.

In [125]:
x_train_text, y_train = imdb.load_data(train=True)
x_test_text, y_test = imdb.load_data(train=False)

print("Train-set size: ", len(x_train_text))
print("Test-set size:  ", len(x_test_text))

data_text = x_train_text + x_test_text

Train-set size:  25000
Test-set size:   25000


#### Example data and label 

Note: For the IMDB dataset, a positive 1.0 corresponds to a positive statement and accordingly scales down to -1.0 which corresponds to a negative statement.

In [126]:
print(x_train_text[1])
print("\n\nPrediction: ", y_train[1])

Now, for all of the cinematographical buffs out there, this film may not rank high on your list of things to see. But if you know anything about plot development, profound truth, and the intentions that this film (the series) had, you'd understand my p.o.v.<br /><br />Granted, the specifics of the film are renderings of the writer, who cannot be expected to know what will happen in the end. But the film is biblically accurate and justifiably "scares" viewers into thinking about what may be. I'm a Christian, not due to this movie, but due to my personal decision to accept Jesus as my Savior. The film and potential that something similar to the circumstances portrayed therein can remarkably scare someone into thinking about their actions and decisions. It's not some cheap attempt to scare people into believing in God, but rather, a means to get your attention.<br /><br />As a Christian, I know I'll not be left behind, and thanks to movies like this, I can look beyond the superficialities

## Tokenizer

- Purpose: To be able to convert human interprettable text to computer interprettable numbers in the form of matrices.
- Stratergy: The approach here is analogous to a one hot encoding approach where each word in the text is converted to a uniqie integer. The computer then uses these words to differentiate between each word.

In [288]:
num_words = 10000                                        #Maximum number of words to translate
tokenizer = Tokenizer(num_words=num_words)                #Package from Tesnsorflow

#### Model Fitting
- Both the training and testing data are being fitted to the tokenizer. This is to convert them both into their numerical representations. They will later be seperated for training and testing.
- The entire vocabulary can be used by setting `num_words=None` 

In [289]:
#%%time
tokenizer.fit_on_texts(data_text)                                               

In [290]:
if num_words is None:
    num_words = len(tokenizer.word_index)

We can then inspect the vocabulary that has been gathered by the tokenizer. This is ordered by the number of occurrences of the words in the data-set. These integer-numbers are called word indices or "tokens" because they uniquely identify each word in the vocabulary.

In [291]:
tokenizer.word_index

{'the': 1,
 'and': 2,
 'a': 3,
 'of': 4,
 'to': 5,
 'is': 6,
 'br': 7,
 'in': 8,
 'it': 9,
 'i': 10,
 'this': 11,
 'that': 12,
 'was': 13,
 'as': 14,
 'for': 15,
 'with': 16,
 'movie': 17,
 'but': 18,
 'film': 19,
 'on': 20,
 'not': 21,
 'you': 22,
 'are': 23,
 'his': 24,
 'have': 25,
 'be': 26,
 'one': 27,
 'he': 28,
 'all': 29,
 'at': 30,
 'by': 31,
 'an': 32,
 'they': 33,
 'so': 34,
 'who': 35,
 'from': 36,
 'like': 37,
 'or': 38,
 'just': 39,
 'her': 40,
 'out': 41,
 'about': 42,
 'if': 43,
 "it's": 44,
 'has': 45,
 'there': 46,
 'some': 47,
 'what': 48,
 'good': 49,
 'when': 50,
 'more': 51,
 'very': 52,
 'up': 53,
 'no': 54,
 'time': 55,
 'my': 56,
 'even': 57,
 'would': 58,
 'she': 59,
 'which': 60,
 'only': 61,
 'really': 62,
 'see': 63,
 'story': 64,
 'their': 65,
 'had': 66,
 'can': 67,
 'me': 68,
 'well': 69,
 'were': 70,
 'than': 71,
 'much': 72,
 'we': 73,
 'bad': 74,
 'been': 75,
 'get': 76,
 'do': 77,
 'great': 78,
 'other': 79,
 'will': 80,
 'also': 81,
 'into': 82,
 'p

Convert all texts in the training-set to lists of these tokens.

In [292]:
x_train_tokens = tokenizer.texts_to_sequences(x_train_text)
x_test_tokens = tokenizer.texts_to_sequences(x_test_text)

For example, here is a text from the training-set:

This text corresponds to the following list of tokens:

In [293]:
print("Sample text interprettation: \n\n", x_train_text[1])

sample = np.array(x_train_tokens[1])
print("\n\nSample numerical interprettation", sample)
print("\n\nData shape: ", sample.shape)

Sample text interprettation: 

 Now, for all of the cinematographical buffs out there, this film may not rank high on your list of things to see. But if you know anything about plot development, profound truth, and the intentions that this film (the series) had, you'd understand my p.o.v.<br /><br />Granted, the specifics of the film are renderings of the writer, who cannot be expected to know what will happen in the end. But the film is biblically accurate and justifiably "scares" viewers into thinking about what may be. I'm a Christian, not due to this movie, but due to my personal decision to accept Jesus as my Savior. The film and potential that something similar to the circumstances portrayed therein can remarkably scare someone into thinking about their actions and decisions. It's not some cheap attempt to scare people into believing in God, but rather, a means to get your attention.<br /><br />As a Christian, I know I'll not be left behind, and thanks to movies like this, I can 

## Padding and Truncating Data
- Purpose: Not all texts are of the same length. Although the RNN model can take in an arbitry length of text, reshaping the input so that they are all of the same length allows training to be done in batches of data.  
- Stratergy: 
    - A: Use the longest text from the training data as the default size and padd all other texts so that the are of that size.This is computationally very expensive as it requires a lot of memory which is wasteful.
    - B: Write a custom data generator function that resizes all text into a specific lenght about the average of the entire dataset. Although this is more complicated, it is more efficient. 

In [294]:
num_tokens = [len(tokens) for tokens in x_train_tokens + x_test_tokens]
num_tokens = np.array(num_tokens)

In [295]:
print("Average number of tokens in a sequence: ", np.mean(num_tokens))
print("Maximum number of tokens in a sequence: ", np.max(num_tokens))

Average number of tokens in a sequence:  221.27716
Maximum number of tokens in a sequence:  2209


The max number of tokens we will allow is set to the average plus 2 standard deviations.

In [296]:
max_tokens = np.mean(num_tokens) + 2 * np.std(num_tokens)
max_tokens = int(max_tokens)
print("Maximum number of tokens allowed: ", max_tokens)
print("Percentage of data covered: ", np.sum(num_tokens < max_tokens) / len(num_tokens))

Maximum number of tokens allowed:  544
Percentage of data covered:  0.9453


#### Padding
Note: There are two options for padding: 'pre' and 'post'. They determine whether the front or the rear portion of the text gets padded/truncated. 

In [297]:
pad = 'pre'                                #Arbitrarily chosen
x_train_pad = pad_sequences(x_train_tokens, maxlen=max_tokens,            
                            padding=pad, truncating=pad)                      #Padded training data 
x_test_pad = pad_sequences(x_test_tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)                       #Padded testing data

#Print shapes
print("Final training matrix shape: ", x_train_pad.shape)
print("Final testing matrix shape: ", x_test_pad.shape)

Final training matrix shape:  (25000, 544)
Final testing matrix shape:  (25000, 544)


We have now transformed the training-set into one big matrix of integers (tokens) with this shape:

In [298]:
np.array(x_train_tokens[1])

array([ 146,   15,   29,    4,    1, 4494,   41,   46,   11,   19,  200,
         21, 4033,  299,   20,  125, 1011,    4,  177,    5,   63,   18,
         43,   22,  118,  231,   42,  111,  979, 3241,  876,    2,    1,
       3215,   12,   11,   19,    1,  204,   66, 1421,  385,   56, 1494,
       1461, 2067,    7,    7, 2469,    1,    4,    1,   19,   23,    4,
          1,  563,   35,  576,   26,  862,    5,  118,   48,   80,  581,
          8,    1,  127,   18,    1,   19,    6, 1754,    2, 2744,  818,
         82,  532,   42,   48,  200,   26,  145,    3, 1545,   21,  693,
          5,   11,   17,   18,  693,    5,   56,  939, 2105,    5, 1727,
       2269,   14,   56,    1,   19,    2, 1050,   12,  137,  742,    5,
          1, 2282,  995, 9935,   67, 4557, 2389,  296,   82,  532,   42,
         65, 1640,    2, 3994,   44,   21,   47,  691,  602,    5, 2389,
         83,   82, 3611,    8,  545,   18,  248,    3,  811,    5,   76,
        125,  686,    7,    7,   14,    3, 1545,   

This has simply been padded to create the following sequence. Note that when this is input to the Recurrent Neural Network, then it first inputs a lot of zeros. If we had padded 'post' then it would input the integer-tokens first and then a lot of zeros. This may confuse the Recurrent Neural Network.

In [299]:
x_train_pad[1]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,   

## Tokenizer Inverse Map

Purpose: reconstruct text-strings from lists of tokens.

In [300]:
idx = tokenizer.word_index
inverse_map = dict(zip(idx.values(), idx.keys()))

Helper-function for converting a list of tokens back to a string of words.

In [301]:
def tokens_to_string(tokens):
    # Map from tokens back to words.
    words = [inverse_map[token] for token in tokens if token != 0]
    
    # Concatenate all words.
    text = " ".join(words)

    return text

For example, this is the original text from the data-set:

In [302]:
x_train_text[1]

'Now, for all of the cinematographical buffs out there, this film may not rank high on your list of things to see. But if you know anything about plot development, profound truth, and the intentions that this film (the series) had, you\'d understand my p.o.v.<br /><br />Granted, the specifics of the film are renderings of the writer, who cannot be expected to know what will happen in the end. But the film is biblically accurate and justifiably "scares" viewers into thinking about what may be. I\'m a Christian, not due to this movie, but due to my personal decision to accept Jesus as my Savior. The film and potential that something similar to the circumstances portrayed therein can remarkably scare someone into thinking about their actions and decisions. It\'s not some cheap attempt to scare people into believing in God, but rather, a means to get your attention.<br /><br />As a Christian, I know I\'ll not be left behind, and thanks to movies like this, I can look beyond the superficial

We can recreate this text except for punctuation and other symbols, by converting the list of tokens back to words:

In [303]:
tokens_to_string(x_train_tokens[1])

"now for all of the buffs out there this film may not rank high on your list of things to see but if you know anything about plot development profound truth and the intentions that this film the series had you'd understand my p o v br br granted the of the film are of the writer who cannot be expected to know what will happen in the end but the film is accurate and scares viewers into thinking about what may be i'm a christian not due to this movie but due to my personal decision to accept jesus as my the film and potential that something similar to the circumstances portrayed therein can remarkably scare someone into thinking about their actions and decisions it's not some cheap attempt to scare people into believing in god but rather a means to get your attention br br as a christian i know i'll not be left behind and thanks to movies like this i can look beyond the of entertainment acting and film to appreciate the depth that the film has to offer this is a movie you shouldn't not o

## Create the Recurrent Neural Network


In [304]:
model = Sequential()                              #Initializing the Sequential model using Keras

#### Layer #1 - Embedding Layer
Note: Embedding size is a way to represent documents with a dense vector (like in this case). It can only be set as the first layer of the RNN network. 

In [305]:
embedding_size = 8                                #Arbitrarily chosen

#Add the the first layer (Embedding) to the model by passing in the following parameters: max number of words text,
#output size, lenght of theinput matrix. The name of the layer is also specificed. 
model.add(Embedding(input_dim=num_words,
                    output_dim=embedding_size,
                    input_length=max_tokens,
                    name='layer_embedding'))

In [306]:
#model.add(GRU(units=32, return_sequences=True))

#### Layer #2 - GRU with 16 inout sequences

In [307]:
#Returns sequences because there is another GRU following this

model.add(GRU(units=16, return_sequences=True))
#model.add(LSTM(units=16, return_sequences=True))
#model.add(SimpleRNN(units=16, return_sequences=True))

#### Layer #3 - GRU with 8 output units


In [308]:
#Returns sequences because there is another GRU following this

model.add(GRU(units=8, return_sequences=True))
#model.add(LSTM(units=8, return_sequences=True))
#model.add(SimpleRNN(units=8, return_sequences=True))

#### Layer #4 - GRU with 4 output units

In [309]:
model.add(GRU(units=4))
#model.add(LSTM(units=4))
#model.add(SimpleRNN(units=4))

#### Layer #5 - Dense with an output between 0 and 1

In [310]:
#rate = 0.5
#model.add(Dropout(rate))
model.add(Dense(1, activation='sigmoid'))

#### Optimizer - Adam

In [311]:
optimizer = Adam(lr=1e-3)
#optimizer = Adam(lr=0.01)

Compile the Keras model so it is ready for training.

In [312]:
model.compile(loss='binary_crossentropy',
              optimizer=optimizer,
              metrics=['accuracy'])

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
layer_embedding (Embedding)  (None, 544, 8)            80000     
_________________________________________________________________
gru_23 (GRU)                 (None, 544, 32)           3936      
_________________________________________________________________
gru_24 (GRU)                 (None, 544, 16)           2352      
_________________________________________________________________
gru_25 (GRU)                 (None, 544, 8)            600       
_________________________________________________________________
gru_26 (GRU)                 (None, 4)                 156       
_________________________________________________________________
dropout_4 (Dropout)          (None, 4)                 0         
_________________________________________________________________
dense_6 (Dense)              (None, 1)                 5         
Total para

## Train the Recurrent Neural Network

Note: Validation spli is set to 5%

In [313]:
%%time
model.fit(x_train_pad, y_train,
          validation_split=0.05, epochs=3, batch_size=64)

Train on 23750 samples, validate on 1250 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
CPU times: user 3h 27min 47s, sys: 16min 44s, total: 3h 44min 31s
Wall time: 1h 12min 51s


<tensorflow.python.keras.callbacks.History at 0x7f43a0c46710>

## Performance on Test-Set

Calculate its classification accuracy on the test-set.

In [None]:
%%time
result = model.evaluate(x_test_pad, y_test)



In [None]:
print("Accuracy: {0:.2%}".format(result[1]))

## Example of Mis-Classified Text

In [227]:
#Calculate prediction on the first 1000 texts
#%%time
y_pred = model.predict(x=x_test_pad[0:1000])
y_pred = y_pred.T[0]

In [157]:
cls_pred = np.array([1.0 if p>0.5 else 0.0 for p in y_pred])
cls_true = np.array(y_test[0:1000])

incorrect = np.where(cls_pred != cls_true)
incorrect = incorrect[0]

print("Number of incorrect predictions: ", len(incorrect))

Number of incorrect predictions:  134


Let us look at the first mis-classified text. We will use its index several times.

In [158]:
idx = incorrect[0]

text = x_test_text[idx]
print("Misclassified text: \n\n", text)

print("\n\nPredicted class: ", y_pred[idx])
print("True class: ", cls_true[idx])

Misclassified text: 



Predicted class:  0.16811341
True class:  1.0


## New Data

In [159]:
text1 = "This movie is fantastic! I really like it because it is so good!"
text2 = "Good movie!"
text3 = "Maybe I like this movie."
text4 = "Meh ..."
text5 = "If I were a drunk teenager then this movie might be good."
text6 = "Bad movie!"
text7 = "Not a good movie!"
text8 = "This movie really sucks! Can I get my money back please?"
texts = [text1, text2, text3, text4, text5, text6, text7, text8]

We first convert these texts to arrays of integer-tokens because that is needed by the model.

In [160]:
tokens = tokenizer.texts_to_sequences(texts)

To input texts with different lengths into the model, we also need to pad and truncate them.

In [161]:
tokens_pad = pad_sequences(tokens, maxlen=max_tokens,
                           padding=pad, truncating=pad)
tokens_pad.shape

(8, 544)

We can now use the trained model to predict the sentiment for these texts.

In [162]:
model.predict(tokens_pad)

array([[0.9533864 ],
       [0.88785267],
       [0.7494634 ],
       [0.8475419 ],
       [0.7336414 ],
       [0.5393318 ],
       [0.8263191 ],
       [0.4391596 ]], dtype=float32)

A value close to 0.0 means a negative sentiment and a value close to 1.0 means a positive sentiment. These numbers will vary every time you train the model.

## Embeddings

The model cannot work on integer-tokens directly, because they are integer values that may range between 0 and the number of words in our vocabulary, e.g. 10000. So we need to convert the integer-tokens into vectors of values that are roughly between -1.0 and 1.0 which can be used as input to a neural network.

This mapping from integer-tokens to real-valued vectors is also called an "embedding". It is essentially just a matrix where each row contains the vector-mapping of a single token. This means we can quickly lookup the mapping of each integer-token by simply using the token as an index into the matrix. The embeddings are learned along with the rest of the model during training.

Ideally the embedding would learn a mapping where words that are similar in meaning also have similar embedding-values. Let us investigate if that has happened here.

First we need to get the embedding-layer from the model:

In [163]:
layer_embedding = model.get_layer('layer_embedding')

We can then get the weights used for the mapping done by the embedding-layer.

In [164]:
weights_embedding = layer_embedding.get_weights()[0]

Note that the weights are actually just a matrix with the number of words in the vocabulary times the vector length for each embedding. That's because it is basically just a lookup-matrix.

In [165]:
weights_embedding.shape

(10000, 8)

Let us get the integer-token for the word 'good', which is just an index into the vocabulary.

In [166]:
token_good = tokenizer.word_index['good']
token_good

49

Let us also get the integer-token for the word 'great'.

In [167]:
token_great = tokenizer.word_index['great']
token_great

78

These integertokens may be far apart and will depend on the frequency of those words in the data-set.

Now let us compare the vector-embeddings for the words 'good' and 'great'. Several of these values are similar, although some values are quite different. Note that these values will change every time you train the model.

In [168]:
weights_embedding[token_good]

array([-0.05894952,  0.0198911 ,  0.02607876,  0.06406524,  0.05504864,
       -0.00151495,  0.03171392,  0.0512884 ], dtype=float32)

In [169]:
weights_embedding[token_great]

array([-0.09971578,  0.13216725,  0.11536891,  0.10132793,  0.09753028,
        0.12597694,  0.10624902,  0.13722652], dtype=float32)

Similarly, we can compare the embeddings for the words 'bad' and 'horrible'.

In [170]:
token_bad = tokenizer.word_index['bad']
token_horrible = tokenizer.word_index['horrible']

In [171]:
weights_embedding[token_bad]

array([ 0.10676818, -0.09850182, -0.08702047, -0.10772773, -0.0772931 ,
       -0.11368319, -0.06965162, -0.10611054], dtype=float32)

In [172]:
weights_embedding[token_horrible]

array([ 0.17445901, -0.18954237, -0.19637075, -0.1768326 , -0.18044399,
       -0.17615777, -0.12885524, -0.13841864], dtype=float32)

### Sorted Words

We can also sort all the words in the vocabulary according to their "similarity" in the embedding-space. We want to see if words that have similar embedding-vectors also have similar meanings.

Similarity of embedding-vectors can be measured by different metrics, e.g. Euclidean distance or cosine distance.

We have a helper-function for calculating these distances and printing the words in sorted order.

In [173]:
def print_sorted_words(word, metric='cosine'):
    """
    Print the words in the vocabulary sorted according to their
    embedding-distance to the given word.
    Different metrics can be used, e.g. 'cosine' or 'euclidean'.
    """

    # Get the token (i.e. integer ID) for the given word.
    token = tokenizer.word_index[word]

    # Get the embedding for the given word. Note that the
    # embedding-weight-matrix is indexed by the word-tokens
    # which are integer IDs.
    embedding = weights_embedding[token]

    # Calculate the distance between the embeddings for
    # this word and all other words in the vocabulary.
    distances = cdist(weights_embedding, [embedding],
                      metric=metric).T[0]
    
    # Get an index sorted according to the embedding-distances.
    # These are the tokens (integer IDs) for words in the vocabulary.
    sorted_index = np.argsort(distances)
    
    # Sort the embedding-distances.
    sorted_distances = distances[sorted_index]
    
    # Sort all the words in the vocabulary according to their
    # embedding-distance. This is a bit excessive because we
    # will only print the top and bottom words.
    sorted_words = [inverse_map[token] for token in sorted_index
                    if token != 0]

    # Helper-function for printing words and embedding-distances.
    def _print_words(words, distances):
        for word, distance in zip(words, distances):
            print("{0:.3f} - {1}".format(distance, word))

    # Number of words to print from the top and bottom of the list.
    k = 10

    print("Distance from '{0}':".format(word))

    # Print the words with smallest embedding-distance.
    _print_words(sorted_words[0:k], sorted_distances[0:k])

    print("...")

    # Print the words with highest embedding-distance.
    _print_words(sorted_words[-k:], sorted_distances[-k:])

We can then print the words that are near and far from the word 'great' in terms of their vector-embeddings. Note that these may change each time you train the model.

In [174]:
print_sorted_words('great', metric='cosine')

Distance from 'great':
0.000 - great
0.004 - man'
0.005 - gem
0.005 - balloon
0.005 - delight
0.007 - finest
0.008 - refreshing
0.008 - impressed
0.009 - appreciated
0.009 - wonderfully
...
1.991 - dire
1.991 - forgettable
1.992 - graves
1.992 - stupidity
1.992 - jerky
1.992 - inexplicably
1.993 - owned
1.993 - awful
1.993 - fails
1.997 - jordan


Similarly, we can print the words that are near and far from the word 'worst' in terms of their vector-embeddings.

In [175]:
print_sorted_words('worst', metric='cosine')

Distance from 'worst':
0.000 - worst
0.005 - below
0.006 - ridiculous
0.006 - champion
0.007 - unfunny
0.009 - cannibals
0.010 - awful
0.010 - appalling
0.011 - monotonous
0.011 - education
...
1.991 - glad
1.992 - kinnear
1.992 - strung
1.992 - terrific
1.993 - favorite
1.993 - beautifully
1.993 - unforgettable
1.994 - parker
1.994 - finest
1.997 - man'
