# <font color='#6629b2'>Predicting sentiment ratings with neural networks using Keras</font>
### https://github.com/roemmele/keras-rnn-notebooks
by Melissa Roemmele, 10/23/17, roemmele @ ict.usc.edu

## <font color='#6629b2'>Overview</font>

I am going to show how to use the Keras library to build both a multilayer perceptron (MLP) model and a recurrent neural network (RNN) model that predict sentiment ratings for text sequences. Specifically, the models will predict the ratings associated with movie reviews.

### <font color='#6629b2'>Neural Networks for Language Data</font>

At a high level, neural networks encoded encode some input variables via a set of parameters (weights) that are optimized to predict some output variable. The simplest type of neural network is a feed-forward multilayer perceptron (MLP) which operates on some feature representation of a linguistic input. Recurrent neural networks (RNNs) are an extension of this simple model that specifically model the sequential aspect of the input and thus are particularly useful for natural language processing tasks. The notebook demonstrates the code needed to assemble MLP and RNN models for an NLP task using the Keras library, as well as some data processing tools that facilitate building the model. 

If you understand how to structure the input and output of the model, and know the fundamental concepts in machine learning, then a high-level understanding of how a neural network works is sufficient for using Keras. You'll see that most of the code here is actually just data manipulation, and I'll visualize each step in this process. The code used to assemble the models themselves is more minimal. It is of course useful to know these details, so you can theorize on the results and innovate the model to make it better. For a better understanding of neural networks and RNNs in particular, see the resources at the bottom of the notebook.

Here a neural network will be used to encode the text of a movie review, and this representation will be used to predict the numerical rating assigned by the reviewer. The model shown here can be applied to any task where the goal is to predict a numerical score associated with a piece of text. Hopefully you can substitute your own datasets and/or modify the code to adapt it to other tasks.

### <font color='#6629b2'>Keras</font>

[Keras](https://keras.io/) is a Python deep learning framework that lets you quickly put together neural network models with a minimal amount of code. It can be run on top of the mathematical optimization libraries [Theano](http://deeplearning.net/software/theano/) or [Tensor Flow](https://www.tensorflow.org/) without you needing to know either of these underlying frameworks. It provides implementations of several of the layer architectures, objective functions, and optimization algorithms you need for building a model.

## <font color='#6629b2'>Dataset</font>

The [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) consists of 50,000 movie reviews from [IMDB](http://www.imdb.com/). The ratings are on a 1-10 scale, but the dataset only contains "polarized" reviews: positive reviews with a rating of 7 or higher, and negative reviews with a rating of 4 or lower. There are an equal number of positive and negative reviews. In the full dataset, the reviews are divided into train and test sets with 25,000 reviews each. Here I'm just going to load a sample training set of 100 reviews, so you can download the full dataset at the above link.

In [1]:
from __future__ import print_function #Python 2/3 compatibility for print statements
import pandas
pandas.set_option('display.max_colwidth', 170) #widen pandas rows display

I'll load the datasets using the [pandas library](https://pandas.pydata.org/), which is extremely useful for any task involving data storage and manipulation. This library puts a dataset into a readable table format, and makes it easy to retrieve specific columns and rows.

In [2]:
'''Load the training dataset'''

train_reviews = pandas.read_csv('dataset/example_train_imdb_reviews.csv', encoding='utf-8')
train_reviews[:10]

Unnamed: 0,Rating,Review
0,2,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...
1,8,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f..."
2,4,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ..."
3,9,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi..."
4,9,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov..."
5,1,"""Trigger Man"" is definitely the most boring and silliest movie I've ever seen in my life. My aunt's holiday videos are more fascinating. The actors seem to be recrui..."
6,10,If you havn't seen this movie I highly recommend you do.It's an excellent true story.I love Alison Lohman she is so talented side note: I also loved her in 7th heaven...
7,9,"I went to see Fever Pitch with my Mom, and I can say that we both loved it. It wasn't the typical romantic comedy where someone is pining for the other, and blah blah..."
8,9,"First ever viewing: July 21, 2008 Very impressive screenplay and comedic acting and timing in this film. Now 40 years old, it has lost none of it's power. Neil Simon..."
9,7,"Weak, fast and multicolor,this is the Valvoline's movie in fact you can see always this brand of oil in a lot of scene. The real protagonist are the cars,weak perform..."


## <font color='#6629b2'>Preparing the data</font>

###  <font color='#6629b2'>Tokenization</font>

The first pre-processing step is to tokenize each of the reviews into (lowercased) individual words, since the RNN will encode the reviews word by word. For this I'll use [spaCy](https://spacy.io/), which is a fast and extremely user-friendly library that performs various language processing tasks. Once you load a spaCy model for a particular language, you can provide any text as input to the model (e.g. encoder(text)) and access its linguistic features.

In [3]:
'''Split texts into lists of words (tokens)'''

import spacy

encoder = spacy.load('en')

def text_to_tokens(text_seqs):
    token_seqs = [[word.lower_ for word in encoder(text_seq)] for text_seq in text_seqs]
    return token_seqs

train_reviews['Tokenized_Review'] = text_to_tokens(train_reviews['Review'])
    
train_reviews[['Review','Tokenized_Review']][:10]

Unnamed: 0,Review,Tokenized_Review
0,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...,"[this, movie, only, gets, a, second, star, because, i, work, downtown, and, liked, seeing, it, destroyed, ., the, effects, were, pretty, good-, i, hear, it, was, the,..."
1,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f...","[as, i, watched, this, movie, ,, and, i, began, to, see, its, ', characters, develop, i, could, feel, this, would, be, an, excellent, picture, ., when, you, get, that..."
2,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ...","[this, seemed, an, odd, combination, of, withnail, and, i, with, a, room, with, a, view, .., sometimes, it, worked, ,, other, times, it, did, not, ., tragedy, that, t..."
3,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi...","[when, i, saw, the, exterminators, of, year, 3000, at, first, time, ,, i, had, no, expectations, for, that, movie, ., although, ,, it, was, n't, so, bad, as, i, was, ..."
4,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov...","[this, is, a, very, entertaining, flick, ,, considering, the, budget, and, its, length, ., the, storyline, is, hardly, ever, touched, on, in, the, movie, world, so, i..."
5,"""Trigger Man"" is definitely the most boring and silliest movie I've ever seen in my life. My aunt's holiday videos are more fascinating. The actors seem to be recrui...","["", trigger, man, "", is, definitely, the, most, boring, and, silliest, movie, i, 've, ever, seen, in, my, life, ., my, aunt, 's, holiday, videos, are, more, fascinati..."
6,If you havn't seen this movie I highly recommend you do.It's an excellent true story.I love Alison Lohman she is so talented side note: I also loved her in 7th heaven...,"[if, you, havn't, seen, this, movie, i, highly, recommend, you, do, ., it, 's, an, excellent, true, story, ., i, love, alison, lohman, she, is, so, talented, side, no..."
7,"I went to see Fever Pitch with my Mom, and I can say that we both loved it. It wasn't the typical romantic comedy where someone is pining for the other, and blah blah...","[i, went, to, see, fever, pitch, with, my, mom, ,, and, i, can, say, that, we, both, loved, it, ., it, was, n't, the, typical, romantic, comedy, where, someone, is, p..."
8,"First ever viewing: July 21, 2008 Very impressive screenplay and comedic acting and timing in this film. Now 40 years old, it has lost none of it's power. Neil Simon...","[first, ever, viewing, :, july, 21, ,, 2008, , very, impressive, screenplay, and, comedic, acting, and, timing, in, this, film, ., now, 40, years, old, ,, it, has, l..."
9,"Weak, fast and multicolor,this is the Valvoline's movie in fact you can see always this brand of oil in a lot of scene. The real protagonist are the cars,weak perform...","[weak, ,, fast, and, multicolor, ,, this, is, the, valvoline, 's, movie, in, fact, you, can, see, always, this, brand, of, oil, in, a, lot, of, scene, ., the, real, p..."


###  <font color='#6629b2'>Lexicon</font>

Then we need to assemble a lexicon (aka vocabulary) of words that the model needs to know. Each tokenized word in the reviews is added to the lexicon, and then each word is mapped to a numerical index that can be read by the model. Since large datasets may contain a huge number of unique words, it's common to filter all words occurring less than a certain number of times, and replace them with some generic &lt;UNK&gt; token. The min_freq parameter in the function below defines this threshold. When assigning the indices, the number 1 will represent unknown words. The number 0 will represent "empty" word slots, which is explained below. Therefore "real" words will have indices of 2 or higher.

In [4]:
'''Count tokens (words) in texts and add them to the lexicon'''

import pickle

def make_lexicon(token_seqs, min_freq=1):
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

    # Then, assign each word to a numerical index. Filter words that occur less than min_freq times.
    lexicon = [token for token, count in token_counts.items() if count >= min_freq]
    # Indices start at 2. 0 is reserved for padding, and 1 for unknown words.
    lexicon = {token:idx + 2 for idx,token in enumerate(lexicon)}
    lexicon[u'<UNK>'] = 1 # Unknown words are those that occur fewer than min_freq times
    lexicon_size = len(lexicon)

    print("LEXICON SAMPLE ({} total items):".format(len(lexicon)))
    print(list(lexicon.items())[:20])
    
    return lexicon

lexicon = make_lexicon(token_seqs=train_reviews['Tokenized_Review'], min_freq=1)

with open('example_model/mlp_bow/lexicon.pkl', 'wb') as f: # Save the lexicon by pickling it
    pickle.dump(lexicon, f)

LEXICON SAMPLE (2630 total items):
[('this', 2), ('movie', 3), ('only', 4), ('gets', 5), ('a', 6), ('second', 7), ('star', 8), ('because', 9), ('i', 10), ('work', 11), ('downtown', 12), ('and', 13), ('liked', 14), ('seeing', 15), ('it', 16), ('destroyed', 17), ('.', 18), ('the', 19), ('effects', 20), ('were', 21)]


###  <font color='#6629b2'>From strings to numbers</font>

Once the lexicon is built, we can use it to transform each review from a list of string tokens into a list of numerical indices.

In [5]:
'''Convert each text from a list of tokens to a list of numbers (indices)'''

def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]  
                                                                     for token_seq in token_seqs]
    return idx_seqs

train_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=train_reviews['Tokenized_Review'], 
                                              lexicon=lexicon)
                                   
train_reviews[['Tokenized_Review', 'Review_Idxs']][:10]

Unnamed: 0,Tokenized_Review,Review_Idxs
0,"[this, movie, only, gets, a, second, star, because, i, work, downtown, and, liked, seeing, it, destroyed, ., the, effects, were, pretty, good-, i, hear, it, was, the,...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 10, 24, 16, 25, 19, 26, 27, 28, 29, 30, 31, 18, 32, 19, 26, 27, 13, 33, 34, 35, 36, 1..."
1,"[as, i, watched, this, movie, ,, and, i, began, to, see, its, ', characters, develop, i, could, feel, this, would, be, an, excellent, picture, ., when, you, get, that...","[112, 10, 113, 2, 3, 51, 13, 10, 114, 74, 115, 116, 117, 118, 119, 10, 120, 121, 2, 122, 123, 124, 125, 126, 18, 127, 128, 58, 55, 129, 51, 13, 19, 3, 130, 131, 132, ..."
2,"[this, seemed, an, odd, combination, of, withnail, and, i, with, a, room, with, a, view, .., sometimes, it, worked, ,, other, times, it, did, not, ., tragedy, that, t...","[2, 169, 124, 170, 171, 39, 172, 13, 10, 173, 6, 174, 173, 6, 175, 176, 177, 16, 178, 51, 179, 180, 16, 93, 69, 18, 181, 55, 102, 182, 19, 183, 159, 19, 184, 185, 186..."
3,"[when, i, saw, the, exterminators, of, year, 3000, at, first, time, ,, i, had, no, expectations, for, that, movie, ., although, ,, it, was, n't, so, bad, as, i, was, ...","[127, 10, 200, 19, 201, 39, 202, 203, 166, 204, 73, 51, 10, 136, 205, 133, 159, 55, 3, 18, 206, 51, 16, 25, 44, 42, 89, 112, 10, 25, 207, 18, 16, 208, 209, 39, 210, 2..."
4,"[this, is, a, very, entertaining, flick, ,, considering, the, budget, and, its, length, ., the, storyline, is, hardly, ever, touched, on, in, the, movie, world, so, i...","[2, 77, 6, 137, 263, 266, 51, 267, 19, 268, 13, 116, 269, 18, 19, 270, 77, 271, 30, 272, 106, 215, 19, 3, 273, 42, 16, 228, 274, 6, 275, 39, 276, 18, 19, 64, 25, 277,..."
5,"["", trigger, man, "", is, definitely, the, most, boring, and, silliest, movie, i, 've, ever, seen, in, my, life, ., my, aunt, 's, holiday, videos, are, more, fascinati...","[287, 288, 289, 287, 77, 290, 19, 26, 291, 13, 292, 3, 10, 293, 30, 294, 215, 295, 76, 18, 295, 296, 208, 297, 298, 227, 299, 300, 18, 301, 19, 302, 303, 74, 123, 304..."
6,"[if, you, havn't, seen, this, movie, i, highly, recommend, you, do, ., it, 's, an, excellent, true, story, ., i, love, alison, lohman, she, is, so, talented, side, no...","[79, 128, 340, 294, 2, 3, 10, 341, 70, 128, 68, 18, 16, 208, 124, 125, 342, 221, 18, 10, 343, 344, 345, 346, 77, 42, 347, 348, 349, 237, 10, 228, 350, 351, 215, 352, ..."
7,"[i, went, to, see, fever, pitch, with, my, mom, ,, and, i, can, say, that, we, both, loved, it, ., it, was, n't, the, typical, romantic, comedy, where, someone, is, p...","[10, 360, 74, 115, 361, 362, 173, 295, 363, 51, 13, 10, 161, 162, 55, 364, 150, 350, 16, 18, 16, 25, 44, 19, 365, 366, 367, 91, 368, 77, 369, 159, 19, 179, 51, 13, 37..."
8,"[first, ever, viewing, :, july, 21, ,, 2008, , very, impressive, screenplay, and, comedic, acting, and, timing, in, this, film, ., now, 40, years, old, ,, it, has, l...","[204, 30, 397, 237, 398, 399, 51, 400, 301, 137, 401, 402, 13, 403, 64, 13, 404, 215, 2, 29, 18, 405, 406, 110, 407, 51, 16, 408, 409, 410, 39, 16, 208, 411, 18, 412,..."
9,"[weak, ,, fast, and, multicolor, ,, this, is, the, valvoline, 's, movie, in, fact, you, can, see, always, this, brand, of, oil, in, a, lot, of, scene, ., the, real, p...","[450, 51, 451, 13, 452, 51, 2, 77, 19, 453, 208, 3, 215, 454, 128, 161, 115, 455, 2, 456, 39, 457, 215, 6, 458, 39, 459, 18, 19, 378, 460, 227, 19, 461, 51, 450, 426,..."


##  <font color='#6629b2'>Building a Multi-layer Perceptron</font>

Before I show how to build an RNN for this task, I'll demonstrate an even simpler model, a multilayer perceptron (MLP). Unlike an RNN, an MLP model is not a sequence model - it represents data as a flat matrix of features rather than a time-ordered sequence of features. For language data, this generally means that the word order of a sequence will not be explicitly encoded into a model. The importance of word order varies for different NLP tasks; in some cases, order-sensitive approaches do not necessarily perform better.

###  <font color='#6629b2'>Numerical lists to bag-of-words vectors</font>

The simplest and most common representation of a text in NLP is as a bag-of-words vector. A bag-of-words vector encodes a sequence as an array with a dimension for each word in the lexicon. The value for each dimension is the number of times the word corresponding to that dimension appears in the text. Thus a dataset of text sequences is encoded as a matrix where each row represents a sequence and each column represents a word whose value is the frequency of that word in the sequence. (it is also common to apply some weighting function to these values such as [tf-idf](https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html), but here we'll just use counts).

In [6]:
'''Encode reviews as bag-of-words vectors'''

import numpy 

def idx_seqs_to_bows(idx_seqs, matrix_length):
    bow_seqs = numpy.array([numpy.bincount(numpy.array(idx_seq), minlength=matrix_length) 
                            for idx_seq in idx_seqs])
    return bow_seqs
    
bow_train_reviews = idx_seqs_to_bows(train_reviews['Review_Idxs'], 
                                     matrix_length=len(lexicon) + 1) #add one to length for padding)
print("TRAIN INPUT:\n", bow_train_reviews)
print("SHAPE:", bow_train_reviews.shape, "\n")

#Show an example mapping string words to counts
lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
lexicon_lookup[0] = ""
pandas.DataFrame([(lexicon_lookup[idx], count) for idx, count in enumerate(bow_train_reviews[0])], 
                 columns=['Word', 'Count'])

TRAIN INPUT:
 [[0 0 4 ..., 0 0 0]
 [0 0 4 ..., 0 0 0]
 [0 0 1 ..., 0 0 0]
 ..., 
 [0 0 4 ..., 0 0 0]
 [0 0 1 ..., 0 0 0]
 [0 0 4 ..., 1 1 1]]
SHAPE: (100, 2631) 



Unnamed: 0,Word,Count
0,,0
1,<UNK>,0
2,this,4
3,movie,2
4,only,1
5,gets,1
6,a,3
7,second,2
8,star,1
9,because,1


###  <font color='#6629b2'>Keras Model</font>

To assemble the model, we'll use Keras' [Functional API](https://keras.io/getting-started/functional-api-guide/), which is one of two ways to use Keras to assemble models (the alternative is the [Sequential API](https://keras.io/getting-started/sequential-model-guide/), which is a bit simpler but has more constraints). A model consists of a series of layers. As shown in the code below, we initialize instances for each layer. Each layer can be called with another layer as input, e.g. Dense()(input_layer). A model instance is initialized with the Model() object, which defines the initial input and final output layers for that model. Before the model can be trained, the compile() function must be called with the loss function and optimization algorithm specified (see below).

###  <font color='#6629b2'>Layers</font>

We'll build an MLP with four layers:

**1. Input**: The input layer takes in the matrix of sequence vectors.

**2. Dense (sigmoid activation)**: A hidden [layer](https://keras.io/layers/core/#dense), which is what defines the model as a multi-layer perceptron. This layer transforms the input matrix by applying a nonlinear transformation function (here, the sigmoid function). Intuitively, this layer can be thought of as computing a "feature representation" of the input words matrix. 

**3. Dense (linear activation)**: An output layer that predicts the rating for the review based on its hidden representation given by the previous layer. This output is continuous (i.e. ranging from 1-10) rather than categorical, which means it has linear activation rather than nonlinear like the hidden layer (by default, activation='linear' for the Dense layer in Keras). The model gets feedback during training about what the actual ratings for the reviews should be.

The term "layer" is just an abstraction, when really all these layers are just matrices. The "weights" that connect the layers are also matrices. The process of training a neural network is a series of matrix multiplications. The weight matrices are the values that are adjusted during training in order for the model to learn to predict ratings. 

###  <font color='#6629b2'>Parameters</font>

Our function for creating the model takes two parameters:

**n_input_nodes**: In the case of reviews encoded as bag-of-words vectors, this is the number of unique words in the lexicon, plus one to account for the padding represented by 0 values (which are only relevant for the RNN model, but this dimension can be included here without any cost to the model).

**n_hidden_nodes**: the number of dimensions in the hidden layers. This can be freely chosen; here, it is set to 500.

###  <font color='#6629b2'>Procedure</font>

The output of the model is a single continuous value (the predicted rating), making this a regression rather than a classification model. There is only one dimension in the output layer, which contains the predicted rating. All neural networks learn by updating the parameters (weights) to optimize an objective (loss) function. For this model, the objective is to minimize the mean squared error between the predicted ratings and the actual ratings for the training reviews, thus bringing the predicted ratings closer to the real ratings. The details of this process are extensive; see the resources at the bottom of the notebook if you want a deeper understanding. One huge benefit of Keras is that it implements many of these details for you. Not only does it already have implementations of the types of layer architectures, it also has many of the [loss functions](https://keras.io/losses/) and [optimization methods](https://keras.io/optimizers/) you need for training various models. The specific loss function and optimization method you use is specified when compiling the model with the model.compile() function.

In [8]:
'''Create the Multi-layer Perceptron model'''

from keras.models import Model
from keras.layers import Input, Dense

def create_mlp_model(n_input_nodes, n_hidden_nodes):
    
    # Layer 1 -  Technically the shape of this layer is (batch_size, len(n_input_nodes).
    # The batch size is implicitly included in the shape of the input, so it does not need to 
    # be specified as a dimension of the input.
    input_layer = Input(shape=(n_input_nodes,))
    #Shape = (batch_size, n_input_nodes)
    
    hidden_layer = Dense(units=n_hidden_nodes, activation='sigmoid')(input_layer)
    #Output shape = (batch_size, n_hidden_nodes)
    
    #Layer 4
    output_layer = Dense(units=1)(hidden_layer)
    #Output shape = (batch_size, 1)
    
    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')
    
    return model

Using Theano backend.


In [9]:
mlp_bow_model = create_mlp_model(n_input_nodes=len(lexicon) + 1, n_hidden_nodes=500)

###  <font color='#6629b2'>Training</font>

Now we can train an MLP model on the training reviews encoded as a bag-of-words matrix. Keras will apply batch training by default, even though we didn't specify the batch size when creating the model. If a batch size isn't given, Keras will use its default (32). The training function also indicates the number of times to iterate through the training data (epochs). Keras reports the mean squared error loss after each epoch - if the model is learning correctly, it should progressively decrease.

In [11]:
'''Train the MLP model with bag-of-words representation'''

mlp_bow_model.fit(x=bow_train_reviews, y=train_reviews['Rating'], batch_size=20, epochs=100)
mlp_bow_model.save('example_model/mlp_bow/model.h5') #save model

Epoch 1/100
Epoch 2/100
Epoch 3/100
Epoch 4/100
Epoch 5/100
Epoch 6/100
Epoch 7/100
Epoch 8/100
Epoch 9/100
Epoch 10/100
Epoch 11/100
Epoch 12/100
Epoch 13/100
Epoch 14/100
Epoch 15/100
Epoch 16/100
Epoch 17/100
Epoch 18/100
Epoch 19/100
Epoch 20/100
Epoch 21/100
Epoch 22/100
Epoch 23/100
Epoch 24/100
Epoch 25/100
Epoch 26/100
Epoch 27/100
Epoch 28/100
Epoch 29/100
Epoch 30/100
Epoch 31/100
Epoch 32/100
Epoch 33/100
Epoch 34/100
Epoch 35/100
Epoch 36/100
Epoch 37/100
Epoch 38/100
Epoch 39/100
Epoch 40/100
Epoch 41/100
Epoch 42/100
Epoch 43/100
Epoch 44/100
Epoch 45/100
Epoch 46/100
Epoch 47/100
Epoch 48/100
Epoch 49/100
Epoch 50/100
Epoch 51/100
Epoch 52/100
Epoch 53/100
Epoch 54/100
Epoch 55/100
Epoch 56/100
Epoch 57/100
Epoch 58/100
Epoch 59/100
Epoch 60/100
Epoch 61/100
Epoch 62/100
Epoch 63/100
Epoch 64/100
Epoch 65/100
Epoch 66/100
Epoch 67/100
Epoch 68/100
Epoch 69/100
Epoch 70/100
Epoch 71/100
Epoch 72/100
Epoch 73/100
Epoch 74/100
Epoch 75/100
Epoch 76/100
Epoch 77/100
Epoch 78

### <font color='#6629b2'>Predicting ratings for reviews</font>

Once the model is trained, we can use it predict the ratings for the reviews in the test set. To demonstrate this, I'll load a saved model previously trained (for 25 epochs) on all 25,000 reviews in the training set. I'll apply this model to an example test set of 100 reviews (again, this a tiny subset of the 25,000 reviews in the full test set provided at the above link).

In [12]:
'''Load saved model'''

# Load lexicon
with open('pretrained_model/mlp_bow/lexicon.pkl', 'rb') as f:
    mlp_bow_lexicon = pickle.load(f)

# Load MLP BOW model
from keras.models import load_model
mlp_bow_model = load_model('pretrained_model/mlp_bow/model.h5')

In [13]:
'''Load the test dataset, tokenize, and transform to numerical indices'''

test_reviews = pandas.read_csv('dataset/example_test_imdb_reviews.csv', encoding='utf-8')
test_reviews['Tokenized_Review'] = text_to_tokens(test_reviews['Review'])
test_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=test_reviews['Tokenized_Review'],
                                             lexicon=mlp_bow_lexicon)

In [14]:
'''Transform test reviews to a bag-of-words matrix'''

bow_test_reviews = idx_seqs_to_bows(test_reviews['Review_Idxs'], 
                                    matrix_length=len(mlp_bow_lexicon) + 1) #add one to length for padding)

print("TEST INPUT:\n", bow_test_reviews)
print("SHAPE:", bow_test_reviews.shape, "\n")

TEST INPUT:
 [[ 0  3  0 ...,  0  0  0]
 [ 0  4  0 ...,  0  0  0]
 [ 0  1  0 ...,  0  0  0]
 ..., 
 [ 0 21  2 ...,  0  0  0]
 [ 0  5  2 ...,  0  0  0]
 [ 0  8  0 ...,  0  0  0]]
SHAPE: (100, 13409) 



Then we can call the predict() function on the test reviews to get the predicted ratings.

In [15]:
'''Show predicted ratings for test reviews alongside actual ratings'''

#Since ratings are integers, need to round predicted rating to nearest integer
test_reviews['MLP_BOW_Pred_Rating'] = numpy.round(mlp_bow_model.predict(bow_test_reviews)[:,0]).astype(int)
test_reviews[['Review', 'Rating', 'MLP_BOW_Pred_Rating']]

Unnamed: 0,Review,Rating,MLP_BOW_Pred_Rating
0,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...",10,10
1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...",1,2
2,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...",4,1
3,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...",2,3
4,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...",3,3
5,Barbara Payton is the suppose-to-be sultry sexy young hot Chickie wife of the geezer plantation owner somewhere in a jungley back lot set at a cheap studio in Hollywo...,1,0
6,Three distinct and distant individuals' lives intersect with the brutal killing of one by another. The one-hour film only reveals the event that brings the three indi...,8,5
7,I never dreamed when I started watching this DVD that I would be totally mesmerized by it within minutes. The story was completely absorbing and entertaining. The act...,10,9
8,Forget Jimmy Stewart reliving his life and opt for this smart comedy of errors instead. I suppose only institutionalized sexism explains why this flick and Stanwyck's...,8,7
9,Have you ever wondered what its like to feel FREE? I am sure that each one of us know the meaning of freedom and never seriously think of using it to our advantage. H...,8,12


###  <font color='#6629b2'>Evaluation</font>

A common evaluation for regression models like this one is $R^2$, called the the coefficient of determination. This metric indicates the proportion of variance in the output variable (the rating) that is predictable from the input variable (the review text). The best possible score is 1.0, which indicates the model always predicts the correct rating. The scikit-learn library provides several [evaluation metrics](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) for machine learning models, including $R^2$.

In [16]:
'''Evaluate the model with R^2'''

from sklearn.metrics import r2_score

r2 = r2_score(y_true=test_reviews['Rating'], y_pred=test_reviews['MLP_BOW_Pred_Rating'])
print("COEFFICIENT OF DETERMINATION (R2): {:3f}".format(r2))

COEFFICIENT OF DETERMINATION (R2): 0.438647


On the full test dataset of 25,000 reviews, the $R^2$ for this model is 0.545692.

###  <font color='#6629b2'>Alternative input to MLP: continuous bag-of-words vectors</font>

An alternative to the traditional bag-of-words representation is to encode sequences as a combination of their individual word embeddings. A word embedding is an n-dimensional vector of real values that together are intended to encode the "meaning" of a word. Word embedding models explicitly learn to represent words by trying to correctly predict other words that appear in the same context (or alternatively, trying to predict a word based on the context words). The result of these models are embedding vectors for words such that words with similar meanings (should) end up having similar vectors.

####  <font color='#6629b2'>spaCy word embeddings</font>

The spaCy library provides [GloVe embeddings](https://spacy.io/usage/vectors-similarity) for each word, which can be accessed simply with word.vector after loading the text into spaCy. There are 300 dimensions in these embeddings.

In [17]:
emb_vector = encoder("creepy").vector
print(emb_vector)
print("SHAPE:", emb_vector.shape)

[ -2.44230002e-01  -5.93640029e-01  -5.52590013e-01  -9.91759971e-02
  -1.22070000e-01   2.27109998e-01   2.10749999e-01  -6.90429986e-01
   2.79650003e-01   1.40419996e+00   3.32180001e-02  -2.13149995e-01
  -3.46340001e-01  -2.74949998e-01   3.78939998e-03  -1.39689997e-01
   3.55099998e-02   1.69880003e-01  -5.39589999e-03  -3.28079998e-01
   6.06500030e-01   4.51980010e-02  -2.30459999e-02  -1.33680001e-01
  -4.16900009e-01  -3.26429993e-01   4.91929986e-02  -1.55489996e-01
   1.17660001e-01  -3.91609997e-01   2.16220006e-01  -1.48680001e-01
  -4.75100011e-01   1.84489995e-01   2.08820000e-01   1.67030007e-01
  -6.95450008e-02   6.93149984e-01  -5.08109987e-01   1.00450002e-01
   2.43379995e-01  -4.16370004e-01   8.56589973e-02   2.23639999e-02
   7.76250005e-01  -1.01549998e-02  -2.67170012e-01  -2.18829997e-02
  -1.12340006e-03  -1.35230003e-02   1.57180000e-02  -2.46270001e-02
   3.31739992e-01   5.85189983e-02   3.95569988e-02   2.00330004e-01
  -1.94079995e-01  -1.07039995e-02

spaCy also has a built-in similarity function that returns the cosine similarity between the GloVe vectors for two words. For example, the vector for "creepy" is more similar to that of "scary" than "nice", as expected. See the link to the spaCy documentation for other functions that operate on the vectors.

In [21]:
print(encoder("love"))#.similarity(encoder("loved")))
#print(encoder("creepy").similarity(encoder("nice")))
[word.lemma_ for word in encoder("I loved him")]

['-PRON-', 'love', '-PRON-']

####  <font color='#6629b2'>Combining embeddings</font>

We can use the embeddings as an alternative to the simple bag-of-words input to the model, by averaging the embeddings for all words in the review across each corresponding dimension (you could also sum them). So instead of having an input matrix with a column for each word in the lexicon, each column represents a word embedding dimension. This is referred to as a continuous bag-of-words vector. The advantage of this representation over the standard bag-of-words representation is that it more explicitly represents the meaning of the words in the review. For example, two reviews may express similar content (and have similar ratings) but may vary in the exact words they use, so their similarity be represented in the continous bag-of-word vectors but less so in the standard vectors. The model can more readily learn that these reviews should receive similar ratings.

In [22]:
'''First encode reviews as sequences of word embeddings'''

def text_to_emb_seqs(seqs):
    emb_seqs = [numpy.array([word.vector for word in encoder(seq)]) for seq in seqs]
    return emb_seqs
    
emb_train_reviews = text_to_emb_seqs(train_reviews['Review'])

#Example of word embedding sequence for first review
pandas.DataFrame(list(zip(train_reviews['Tokenized_Review'][0], emb_train_reviews[0])),
                columns=['Word', 'Embedding'])

Unnamed: 0,Word,Embedding
0,this,"[-0.087595, 0.35502, 0.063868, 0.29292, -0.23635, -0.062773, -0.16105, -0.22842, 0.041587, 2.4844, -0.38217, 0.032806, 0.12348, -0.0018422, -0.13848, -0.0010005, -0.0..."
1,movie,"[0.2071, -0.47656, 0.15479, -0.38965, 0.48447, 0.59815, -0.060361, -0.66422, 0.53934, 1.8491, -0.30595, 0.35849, 0.4876, -0.17715, -0.15448, -0.016732, 0.49752, 0.607..."
2,only,"[-0.12253, 0.18693, 0.048162, -0.054006, 0.14699, -0.26139, -0.014913, -0.11215, 0.19526, 2.4849, 0.40386, 0.25374, -0.093879, -0.25714, 0.099929, 0.09112, -0.12029, ..."
3,gets,"[-0.65521, 0.19128, 0.047891, -0.061405, -0.25688, -0.14779, 0.068909, -0.14565, 0.089657, 2.5228, 0.19451, -0.277, -0.45222, -0.36297, -0.69617, 0.21397, 0.12312, 0...."
4,a,"[0.043798, 0.024779, -0.20937, 0.49745, 0.36019, -0.37503, -0.052078, -0.60555, 0.036744, 2.2085, -0.23389, -0.06836, -0.22355, -0.053989, -0.15198, -0.17319, 0.05335..."
5,second,"[0.19572, 0.39581, -0.05169, 0.42562, 0.41168, -0.23053, 0.038026, -0.13892, 0.17919, 2.4403, -0.027461, 0.16545, 0.53568, 0.048475, 0.23609, -0.26697, 0.080609, 1.13..."
6,star,"[0.44581, 0.2305, 0.077434, -0.10145, 0.49246, -0.55001, -0.0208, -0.24442, -0.052061, 2.0611, -0.037488, -0.61058, -0.17151, -0.22025, -0.085697, -0.16796, 0.22962, ..."
7,because,"[-0.20476, 0.19932, -0.39701, -0.12486, 0.031775, 0.12264, 0.065732, -0.11796, -0.14564, 3.0679, -0.077553, 0.082201, 0.12264, 0.19981, -0.35298, -0.25108, -0.073947,..."
8,i,"[0.18733, 0.40595, -0.51174, -0.55482, 0.039716, 0.12887, 0.45137, -0.59149, 0.15591, 1.5137, -0.8702, 0.050672, 0.15211, -0.19183, 0.11181, 0.12131, -0.27212, 1.6203..."
9,work,"[-3.0251e-05, 0.084473, -0.12865, -0.30777, -0.28069, 0.35496, -0.14539, -0.24887, 0.024285, 2.7458, -0.23302, 0.28894, -0.091569, -0.084646, 0.052142, 0.081376, -0.1..."


In [23]:
'''Encode reviews as continuous bag-of-words (mean of word embeddings)'''

def emb_seqs_to_cont_bows(emb_seqs):
    cont_bow_seqs =  numpy.array([numpy.mean(emb_seq, axis=0) for emb_seq in emb_seqs])
    return cont_bow_seqs

cont_bow_train_reviews = emb_seqs_to_cont_bows(emb_train_reviews)

print("TRAIN INPUT:\n", cont_bow_train_reviews)
print("SHAPE:", cont_bow_train_reviews.shape, "\n")

TRAIN INPUT:
 [[-0.03669927  0.17467687 -0.15197457 ..., -0.03327936  0.03567346
   0.11649708]
 [-0.00929734  0.14320128 -0.14427485 ..., -0.06805342  0.02270603
   0.09228051]
 [-0.03707573  0.1152564  -0.10762771 ..., -0.07170004 -0.00770159
   0.11234298]
 ..., 
 [ 0.00960823  0.16496091 -0.12408715 ..., -0.01148788 -0.02245397
   0.10386018]
 [-0.01765456  0.18093151 -0.18793055 ..., -0.0436646   0.0093438
   0.13540518]
 [-0.02792743  0.15883775 -0.12517592 ..., -0.02544595  0.01546408
   0.07832429]]
SHAPE: (100, 300) 



####  <font color='#6629b2'>Continuous bag-of-words MLP</font>

Now we can train the same MLP model to predict ratings from the reviews encoded as continuous bag-of-words vectors. The only difference between in the parameters of this model compared to the previous model is that n_input_nodes is equal to the number of embedding dimensions instead of the number of words in the lexicon.

In [24]:
mlp_cont_bow_model = create_mlp_model(n_input_nodes=cont_bow_train_reviews.shape[-1], n_hidden_nodes=500)

####  <font color='#6629b2'>Training</font>

In [26]:
'''Train the model'''

mlp_cont_bow_model.fit(x=cont_bow_train_reviews, y=train_reviews['Rating'], batch_size=20, epochs=5)
mlp_cont_bow_model.save('example_model/mlp_cont_bow/model.h5') #save model

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


####  <font color='#6629b2'>Prediction</font>

Again, I'll load this same model I previously trained on all 25,000 reviews in the training set and apply it to the example test set of 100 reviews.

In [27]:
'''Load saved model'''

mlp_cont_bow_model = load_model('pretrained_model/mlp_cont_bow/model.h5')

In [28]:
'''Transform test reviews to a continuous bag-of-words matrix'''

cont_bow_test_reviews = emb_seqs_to_cont_bows(text_to_emb_seqs(test_reviews['Review']))

print("TEST INPUT:\n", cont_bow_test_reviews)
print("SHAPE:", cont_bow_test_reviews.shape, "\n")

TEST INPUT:
 [[-0.014648    0.11290316 -0.15222767 ..., -0.07238988  0.03650679
   0.10832701]
 [-0.03069711  0.17876795 -0.16690718 ..., -0.07472198  0.04088686
   0.05915005]
 [-0.044769    0.17864974 -0.16780178 ..., -0.07106451  0.03871405
   0.07376453]
 ..., 
 [-0.02858565  0.13633321 -0.11530332 ..., -0.07878534 -0.00065763
   0.06962511]
 [-0.0039024   0.16644047 -0.12762967 ..., -0.02781611 -0.01023514
   0.08669586]
 [-0.04171144  0.1474475  -0.10443845 ..., -0.07805648 -0.01242174
   0.07046781]]
SHAPE: (100, 300) 



In [29]:
'''Show ratings predicted by this model alongside previous model and actual ratings'''

#Since ratings are integers, need to round predicted rating to nearest integer
test_reviews['MLP_Cont_BOW_Pred_Rating'] = numpy.round(mlp_cont_bow_model.predict(cont_bow_test_reviews)[:,0]).astype(int)
test_reviews[['Review', 'Rating', 'MLP_BOW_Pred_Rating', 'MLP_Cont_BOW_Pred_Rating']]

Unnamed: 0,Review,Rating,MLP_BOW_Pred_Rating,MLP_Cont_BOW_Pred_Rating
0,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...",10,10,7
1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...",1,2,4
2,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...",4,1,1
3,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...",2,3,5
4,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...",3,3,1
5,Barbara Payton is the suppose-to-be sultry sexy young hot Chickie wife of the geezer plantation owner somewhere in a jungley back lot set at a cheap studio in Hollywo...,1,0,2
6,Three distinct and distant individuals' lives intersect with the brutal killing of one by another. The one-hour film only reveals the event that brings the three indi...,8,5,5
7,I never dreamed when I started watching this DVD that I would be totally mesmerized by it within minutes. The story was completely absorbing and entertaining. The act...,10,9,6
8,Forget Jimmy Stewart reliving his life and opt for this smart comedy of errors instead. I suppose only institutionalized sexism explains why this flick and Stanwyck's...,8,7,8
9,Have you ever wondered what its like to feel FREE? I am sure that each one of us know the meaning of freedom and never seriously think of using it to our advantage. H...,8,12,11


####  <font color='#6629b2'>Evaluation</font>

In [30]:
'''Evaluate the model with R^2'''

r2 = r2_score(y_true=test_reviews['Rating'], y_pred=test_reviews['MLP_Cont_BOW_Pred_Rating'])
print("COEFFICIENT OF DETERMINATION (R2): {:3f}".format(r2))

COEFFICIENT OF DETERMINATION (R2): 0.505674


On the full test dataset of 25,000 reviews, the $R^2$ for this model is 0.494190. So it turns out this model overall does not actually do better at predicting ratings than the standard bag-of-words model.

##  <font color='#6629b2'>Building a Recurrent Neural Network </font>

Now I'll show how this same task can be modeled with an RNN, which processes text sequentially.

###  <font color='#6629b2'>Numerical lists to matrices</font>

The input representation for the RNN is different from the MLP because it explicitly encodes the order of words in the review. We'll return to the lists of the word indices contained in train_reviews['Review_Idxs']. The input to the model will be these number sequences themselves. We need to put all the reviews in the training set into a single matrix, where each row is a review and each column is a word index in that sequence. This enables the model to process multiple sequences in parallel (batches) as opposed to one at a time. Using batches significantly speeds up training. However, each review has a different number of words, so we create a padded matrix equal to the length on the longest review in the training set. For all reviews with fewer words, we prepend the row with zeros representing an empty word position. We can tell Keras to ignore these zeros during training.

In [31]:
'''Create a padded matrix of input reviews'''

from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs):
    max_seq_len = max([len(idx_seq) for idx_seq in idx_seqs]) # Get length of longest sequence
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len) # Keras provides a convenient padding function
    return padded_idxs

train_padded_idxs = pad_idx_seqs(train_reviews['Review_Idxs'])

print("TRAIN INPUT:\n", train_padded_idxs)
print("SHAPE:", train_padded_idxs.shape, "\n")

TRAIN INPUT:
 [[   0    0    0 ...,  110  111   97]
 [   0    0    0 ...,   69  168   18]
 [   0    0    0 ...,  199   29  176]
 ..., 
 [   0    0    0 ..., 2599 2337   18]
 [   0    0    0 ...,   18  301 2609]
 [   0    0    0 ...,   73  572   18]]
SHAPE: (100, 189) 



###  <font color='#6629b2'>Model Layers</font>

We'll use the same scheme as before (the Functional API) to assemble the RNN. The RNN will have four layers:

**1. Input**: The input layer takes in the matrix of word indices.

**2. Embedding**: A [layer](https://keras.io/layers/embeddings/) that converts integer word indices into distributed vector representations (embeddings), which were introduced above. The difference here is that rather than plugging in embeddings from a pretrained model as before, the word embeddings will be learned inside the model itself. Thus, the input to the model will be the word indices rather than their embeddings, and the embeddings will change as the model is trained. The mask_zero=True parameter in this layer indicates that values of 0 in the matrix (the padding) will be ignored by the model.

**3. GRU**: A [recurrent (GRU) hidden layer](https://keras.io/layers/recurrent/), the central component of the model. As it observes each word in the story, it integrates the word embedding representation with what it's observed so far to compute a representation (hidden state) of the review at that timepoint. There are a few architectures for this layer - I use the GRU variation, Keras also provides LSTM or just the simple vanilla recurrent layer (see the materials at the bottom for an explanation of the difference). This layer outputs the last hidden state of the sequence (i.e. the hidden representation of the review after its last word is observed).

**4. Dense**: An output [layer](https://keras.io/layers/core/#dense) that predicts the rating for the review based on its GRU representation given by the previous layer. This is the same output layer used in the MLP, so it has one dimension that contains a continuous value (the rating).

###  <font color='#6629b2'>Parameters</font>

Our function for creating the RNN takes the following parameters:

**n_input_nodes**: As with the standard bag-of-words MLP, this is the number of unique words in the lexicon, plus one to account for the padding represented by 0 values. This indicates the number of rows in the embedding layer, where each row corresponds to a word.

**n_embedding_nodes**: the number of dimensions (units) in the embedding layer, which can be freely defined. Here, it is set to 300.

**n_hidden_nodes**: the number of dimensions in the GRU hidden layer. Like the embedding layer, this can be freely chosen. Here, it is set to 500.

In [33]:
'''Create the model'''

from keras.layers.embeddings import Embedding
from keras.layers.recurrent import GRU

def create_rnn_model(n_input_nodes, n_embedding_nodes, n_hidden_nodes):
    
    # Layer 1 -  Technically the shape of this layer is (batch_size, len(train_padded_idxs)).
    # However, both the batch size and the length of the input matrix can be inferred from the input at training time. 
    # The batch size is implicitly included in the shape of the input, so it does not need to 
    # be specified as a dimension of the input. None can be given as placeholder for the input matrix length.
    # By defining it as None, the model is flexible in accepting inputs with different lengths.
    input_layer = Input(shape=(None,))
    
    # Layer 2
    embedding_layer = Embedding(input_dim=n_input_nodes,
                                output_dim=n_embedding_nodes,
                                mask_zero=True)(input_layer) #mask_zero tells the model to ignore 0 values (padding)
    #Output shape = (batch_size, input_matrix_length, n_embedding_nodes)
    
    # Layer 3
    gru_layer = GRU(units=n_hidden_nodes)(embedding_layer)
    #Output shape = (batch_size, n_hidden_nodes)
    
    #Layer 4
    output_layer = Dense(units=1)(gru_layer)
    #Output shape = (batch_size, 1)
    
    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')
    
    return model

In [34]:
rnn_model = create_rnn_model(n_input_nodes=len(lexicon) + 1, n_embedding_nodes=300, n_hidden_nodes=500)

###  <font color='#6629b2'>Training</font>

The training function is exactly the same for the RNN as above, just with the padded review matrix provided as the input.

In [35]:
'''Train the model'''

rnn_model.fit(x=train_padded_idxs, y=train_reviews['Rating'], batch_size=20, epochs=5)
rnn_model.save('example_model/rnn/model.h5') #save model 

#Save lexicon to new model folder - same lexicon as above
with open('example_model/rnn/lexicon.pkl', 'wb') as f:
    pickle.dump(lexicon, f)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


###  <font color='#6629b2'>Prediction</font>

In [37]:
'''Load saved model'''

# Load lexicon
with open('pretrained_model/rnn/lexicon.pkl', 'rb') as f:
    rnn_lexicon = pickle.load(f)

# Load RNN model
from keras.models import load_model
rnn_model = load_model('pretrained_model/rnn/model.h5')

In [38]:
'''Put test reviews in padded matrix'''

test_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=test_reviews['Tokenized_Review'],
                                             lexicon=rnn_lexicon)
test_padded_idxs = pad_idx_seqs(test_reviews['Review_Idxs'])

print("TEST INPUT:\n", test_padded_idxs)
print("SHAPE:", test_padded_idxs.shape, "\n")

TEST INPUT:
 [[    0     0     0 ..., 19451  7875 12041]
 [    0     0     0 ..., 12884  8579   111]
 [    0     0     0 ..., 10307 11756   111]
 ..., 
 [    0     0     0 ...,   111  6736 16601]
 [    0     0     0 ...,  6896  7739   111]
 [    0     0     0 ...,  7572 17347   111]]
SHAPE: (100, 1045) 



In [39]:
'''Show ratings predicted by RNN alongside the other models' ratings'''

#Since ratings are integers, need to round predicted rating to nearest integer
test_reviews['RNN_Pred_Rating'] = numpy.round(rnn_model.predict(test_padded_idxs)[:,0]).astype(int)
test_reviews[['Review', 'Rating', 'MLP_BOW_Pred_Rating', 'MLP_Cont_BOW_Pred_Rating', 'RNN_Pred_Rating']]

Unnamed: 0,Review,Rating,MLP_BOW_Pred_Rating,MLP_Cont_BOW_Pred_Rating,RNN_Pred_Rating
0,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...",10,10,7,10
1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...",1,2,4,1
2,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...",4,1,1,2
3,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...",2,3,5,1
4,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...",3,3,1,2
5,Barbara Payton is the suppose-to-be sultry sexy young hot Chickie wife of the geezer plantation owner somewhere in a jungley back lot set at a cheap studio in Hollywo...,1,0,2,3
6,Three distinct and distant individuals' lives intersect with the brutal killing of one by another. The one-hour film only reveals the event that brings the three indi...,8,5,5,3
7,I never dreamed when I started watching this DVD that I would be totally mesmerized by it within minutes. The story was completely absorbing and entertaining. The act...,10,9,6,10
8,Forget Jimmy Stewart reliving his life and opt for this smart comedy of errors instead. I suppose only institutionalized sexism explains why this flick and Stanwyck's...,8,7,8,7
9,Have you ever wondered what its like to feel FREE? I am sure that each one of us know the meaning of freedom and never seriously think of using it to our advantage. H...,8,12,11,10


###  <font color='#6629b2'>Evaluation</font>

In [40]:
'''Evaluate the model with R^2'''

r2 = r2_score(y_true=test_reviews['Rating'], y_pred=test_reviews['RNN_Pred_Rating'])
print("COEFFICIENT OF DETERMINATION (R2): {:3f}".format(r2))

COEFFICIENT OF DETERMINATION (R2): 0.532671


On the full test dataset of 25,000 reviews, the $R^2$ for this model is 0.622525. So the RNN outperforms the continuous bag-of-words MLP as well as the standard bag-of-words approach. 

### <font color='#6629b2'>Visualizing data inside the model</font>

To help visualize the data representation inside the model, we can look at the output of each layer in a model individually. Keras' Functional API lets you derive a new model with the layers from an existing model, so you can define the output to be a layer below the output layer in the original model. Calling predict() on this new model will produce the output of that layer for a given input. Of course, glancing at the numbers by themselves doesn't provide any interpretation of what the model has learned (although there are opportunities to [interpret these values](https://www.civisanalytics.com/blog/interpreting-visualizing-neural-networks-text-processing/)), but seeing them verifies the model is just a series of transformations from one matrix to another. The model stores its layers as the list model.layers, and you can retrieve specific layer by its position index in the model.

In [41]:
'''Show the output of the RNN embedding layer (second layer) for the test reviews'''

embedding_layer = Model(inputs=rnn_model.layers[0].input, 
                        outputs=rnn_model.layers[1].output) #embedding layer is 2nd layer (index 1)
embedding_output = embedding_layer.predict(test_padded_idxs)
print("EMBEDDING LAYER OUTPUT SHAPE:", embedding_output.shape)
print(embedding_output[0])

EMBEDDING LAYER OUTPUT SHAPE: (100, 1045, 300)
[[-0.04613248  0.03407229  0.01457988 ...,  0.04840017  0.02082825
  -0.0346043 ]
 [-0.04613248  0.03407229  0.01457988 ...,  0.04840017  0.02082825
  -0.0346043 ]
 [-0.04613248  0.03407229  0.01457988 ...,  0.04840017  0.02082825
  -0.0346043 ]
 ..., 
 [ 0.05621845  0.0821346  -0.05775728 ...,  0.02890298 -0.04882133
   0.026389  ]
 [ 0.11277185 -0.02899412 -0.01214423 ...,  0.0121905   0.00099667
   0.13167386]
 [ 0.00972118  0.02033608  0.0676541  ...,  0.00688954 -0.06753249
   0.03044227]]


It is also easy to look at the weight matrices that connect the layers. The get_weights() function will show the incoming weights for a particular layer.

In [42]:
'''Show weights that connect the RNN hidden layer to the output layer (final layer)'''

hidden_to_output_weights = rnn_model.layers[-1].get_weights()[0]
print("HIDDEN-TO_OUTPUT WEIGHTS SHAPE:", hidden_to_output_weights.shape)
print(hidden_to_output_weights)

HIDDEN-TO_OUTPUT WEIGHTS SHAPE: (500, 1)
[[ -1.29779382e-02]
 [  1.46091029e-01]
 [ -6.11346513e-02]
 [  2.08659321e-02]
 [ -8.56614485e-02]
 [ -1.24008268e-01]
 [ -2.92715225e-02]
 [ -2.10262723e-02]
 [  6.79068640e-02]
 [  3.12057231e-02]
 [  5.59645519e-02]
 [ -2.96079479e-02]
 [ -1.11443818e-01]
 [  4.93971538e-03]
 [ -9.87519547e-02]
 [  6.62838593e-02]
 [ -7.68962577e-02]
 [  1.39914781e-01]
 [  9.17546526e-02]
 [  5.06903604e-02]
 [  9.02168602e-02]
 [ -1.69267729e-01]
 [ -1.01738617e-01]
 [ -9.78948735e-03]
 [ -3.36330496e-02]
 [  3.07646189e-02]
 [ -1.02849320e-01]
 [  7.84067661e-02]
 [  1.36644328e-02]
 [  5.30993799e-03]
 [  1.83690563e-02]
 [  3.70506607e-02]
 [ -9.64718238e-02]
 [ -3.60641219e-02]
 [  3.10728699e-02]
 [ -1.01482728e-02]
 [  1.46242306e-01]
 [  1.19888475e-02]
 [ -1.31578311e-01]
 [ -1.17450636e-02]
 [  1.23016298e-01]
 [  7.38845244e-02]
 [ -3.76717001e-02]
 [ -8.45564157e-02]
 [ -8.21571797e-02]
 [ -1.14506958e-02]
 [ -1.14080613e-03]
 [  2.89450251e-02]

## <font color='#6629b2'>Conclusion</font>

As mentioned above, the models shown here could be applied to any task where the goal is to predict a score for a particular sequence. For ratings prediction, this score is ordinal, but it could also be categorical with a few simple changes to the output layer of the model. My other notebooks in this repository for language modeling/generation and part-of-speech tagging demonstrate this type of prediction with categorical variables. They also show how to build an RNN in Keras when the output is a sequence of labels, rather than a single value as shown here.

## <font color='#6629b2'>More resources</font>

Yoav Goldberg's book [Neural Network Methods for Natural Language Processing](http://www.morganclaypool.com/doi/abs/10.2200/S00762ED1V01Y201703HLT037) is a thorough introduction to neural networks for NLP tasks in general.

If you'd like to learn more about what Keras is doing under the hood, there is a [Theano tutorial](http://deeplearning.net/tutorial/lstm.html) that also applies an RNN to sentiment prediction, using the same dataset here

Andrej Karpathy's blog post [The Unreasonable Effectiveness of Recurrent Neural Networks](http://karpathy.github.io/2015/05/21/rnn-effectiveness/) is very helpful for understanding the mathematical details of an RNN, applied to the task of language modeling. It also provides raw Python code with an implementation of the backpropagation algorithm.

TensorFlow also has an RNN language model [tutorial](https://www.tensorflow.org/versions/r0.12/tutorials/recurrent/index.html) using the Penn Treebank dataset

Chris Olah provides a good [explanation](http://colah.github.io/posts/2015-08-Understanding-LSTMs/) of how LSTM RNNs work (this explanation also applies to the GRU model used here)

Denny Britz's [tutorial](http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/) documents well both the technical details of RNNs and their implementation in Python.