# <font color='#6629b2'>**Predicting Sentiment Ratings Using RNN with Keras**</font>

---

### <font color='#b28cd9'>**Author**:</font>  
**Vincent Paul**  
📅 **Date**: *October 23, 2023*  
📧 **Contact**: [vincentpaul.vin1990@gmail.com](mailto:vincentpaul.vin1990@gmail.com)  

---

### <font color='#b28cd9'>**About This Project**</font>  

This project leverages Recurrent Neural Networks (RNN) to predict sentiment ratings using the Keras framework. The dataset and techniques used in this notebook are publicly available and demonstrate core expertise in deep learning for natural language processing (NLP).  

⚠️ *Note*: All sensitive or proprietary data and code from real-world projects have been excluded to maintain confidentiality and ensure compliance with ethical guidelines. The techniques demonstrated here reflect advanced skills in building, training, and evaluating RNN models for practical applications.

---

### <font color='#b28cd9'>**Project Highlights**</font>

- 🚀 **Deep Learning Approach**: Uses RNNs to tackle sentiment prediction with a focus on temporal data dependencies.  
- 📊 **Model Implementation**: Comprehensive model architecture with training, evaluametricsfor professionals alike.  


## <font color='#6629b2'>Overview</font>

I am going to show how to use the Keras library to build both a multilayer perceptron (MLP) model and a recurrent neural network (RNN) model that predict sentiment ratings for text sequences. Specifically, the models will predict the ratings associated with movie reviews.

### <font color='#6629b2'>Neural Networks for Language Data</font>

At a high level, neural networks encoded encode some input variables via a set of parameters (weights) that are optimized to predict some output variable. The simplest type of neural network is a feed-forward multilayer perceptron (MLP) which operates on some feature representation of a linguistic input. Recurrent neural networks (RNNs) are an extension of this simple model that specifically model the sequential aspect of the input and thus are particularly useful for natural language processing tasks. The notebook demonstrates the code needed to assemble MLP and RNN models for an NLP task using the Keras library, as well as some data processing tools that facilitate building the model.

If you understand how to structure the input and output of the model, and know the fundamental concepts in machine learning, then a high-level understanding of how a neural network works is sufficient for using Keras. You'll see that most of the code here is actually just data manipulation, and I'll visualize each step in this process. The code used to assemble the models themselves is more minimal. It is of course useful to know these details, so you can theorize on the results and innovate the model to make it better. For a better understanding of neural networks and RNNs in particular, see the resources at the bottom of the notebook.

Here a neural network will be used to encode the text of a movie review, and this representation will be used to predict the numerical rating assigned by the reviewer. The model shown here can be applied to any task where the goal is to predict a numerical score associated with a piece of text. Hopefully you can substitute your own datasets and/or modify the code to adapt it to other tasks.

### <font color='#6629b2'>Keras</font>

[Keras](https://keras.io/) is a Python deep learning framework that lets you quickly put together neural network models with a minimal amount of code. It can be run on top of the mathematical optimization libraries [Theano](http://deeplearning.net/software/theano/) or [Tensor Flow](https://www.tensorflow.org/) without you needing to know either of these underlying frameworks. It provides implementations of several of the layer architectures, objective functions, and optimization algorithms you need for building a machine learning model.

## <font color='#6629b2'>Dataset</font>

The [Large Movie Review Dataset](http://ai.stanford.edu/~amaas/data/sentiment/) consists of 50,000 movie reviews from [IMDB](http://www.imdb.com/). The ratings are on a 1-10 scale, but the dataset specifically contains "polarized" reviews: positive reviews with a rating of 7 or higher, and negative reviews with a rating of 4 or lower. There are an equal number of positive and negative reviews. In the full dataset, the reviews are divided into train and test sets with 25,000 reviews each. Here I'm just going to load a sample training set of 100 reviews. You can download the full dataset at the above link.

In [1]:
from __future__ import print_function #Python 2/3 compatibility for print statements
import pandas
pandas.set_option('display.max_colwidth', 170) #widen pandas rows display

I'll load the datasets using the [pandas library](https://pandas.pydata.org/), which is extremely useful for any task involving data storage and manipulation. This library puts a dataset into a readable table format, and makes it easy to retrieve specific columns and rows.

In [3]:
'''Load the training dataset'''
train_reviews = pandas.read_csv('/content/sample_data/example_train_imdb_reviews.csv', encoding='utf-8')
train_reviews[:10]

Unnamed: 0,Rating,Review
0,2,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...
1,8,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f..."
2,4,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ..."
3,9,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi..."
4,9,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov..."
5,1,"""Trigger Man"" is definitely the most boring and silliest movie I've ever seen in my life. My aunt's holiday videos are more fascinating. The actors seem to be recrui..."
6,10,If you havn't seen this movie I highly recommend you do.It's an excellent true story.I love Alison Lohman she is so talented side note: I also loved her in 7th heaven...
7,9,"I went to see Fever Pitch with my Mom, and I can say that we both loved it. It wasn't the typical romantic comedy where someone is pining for the other, and blah blah..."
8,9,"First ever viewing: July 21, 2008 Very impressive screenplay and comedic acting and timing in this film. Now 40 years old, it has lost none of it's power. Neil Simon..."
9,7,"Weak, fast and multicolor,this is the Valvoline's movie in fact you can see always this brand of oil in a lot of scene. The real protagonist are the cars,weak perform..."


## <font color='#6629b2'>Preparing the data</font>

###  <font color='#6629b2'>Tokenization</font>

The first preprocessing step is to tokenize each of the reviews into (lowercased) individual words, since the models will encode the reviews at the word level (rather than subword units like characters, for example). For this I'll use [spaCy](https://spacy.io/), which is a fast and extremely user-friendly library that performs various language processing tasks. Once you load a spaCy model for a particular language, you can provide any text as input to the model, e.g. encoder(text) and access its linguistic features.

In [6]:
'''Split texts into lists of words (tokens)'''
#!pip install spacy
#!python -m spacy download en_core_web_sm # Install the en_core_web_sm model
import spacy

encoder = spacy.load('en_core_web_sm') # Load the en_core_web_sm model

def text_to_tokens(text_seqs):
    token_seqs = [[word.lower_ for word in encoder(text_seq)] for text_seq in text_seqs]
    return token_seqs

train_reviews['Tokenized_Review'] = text_to_tokens(train_reviews['Review'])

train_reviews[['Review','Tokenized_Review']][:10]

Unnamed: 0,Review,Tokenized_Review
0,this movie only gets a second star because i work downtown and liked seeing it destroyed. the effects were pretty good- i hear it was the most expensive Korean film e...,"[this, movie, only, gets, a, second, star, because, i, work, downtown, and, liked, seeing, it, destroyed, ., the, effects, were, pretty, good-, i, hear, it, was, the,..."
1,"As I watched this movie, and I began to see its' characters develop I could feel this would be an excellent picture. When you get that feeling, and the movie indeed f...","[as, i, watched, this, movie, ,, and, i, began, to, see, its, ', characters, develop, i, could, feel, this, would, be, an, excellent, picture, ., when, you, get, that..."
2,"this seemed an odd combination of Withnail and I with A Room with a View.. sometimes it worked, other times it did not. tragedy that they changed the name for the US ...","[this, seemed, an, odd, combination, of, withnail, and, i, with, a, room, with, a, view, .., sometimes, it, worked, ,, other, times, it, did, not, ., tragedy, that, t..."
3,"When I saw the Exterminators of year 3000 at first time, I had no expectations for that movie. Although, it wasn't so bad as I was thought. It's kind of Italian versi...","[when, i, saw, the, exterminators, of, year, 3000, at, first, time, ,, i, had, no, expectations, for, that, movie, ., although, ,, it, was, n't, so, bad, as, i, was, ..."
4,"This is a very entertaining flick, considering the budget and its length. The storyline is hardly ever touched on in the movie world so it also brought a sense of nov...","[this, is, a, very, entertaining, flick, ,, considering, the, budget, and, its, length, ., the, storyline, is, hardly, ever, touched, on, in, the, movie, world, so, i..."
5,"""Trigger Man"" is definitely the most boring and silliest movie I've ever seen in my life. My aunt's holiday videos are more fascinating. The actors seem to be recrui...","["", trigger, man, "", is, definitely, the, most, boring, and, silliest, movie, i, 've, ever, seen, in, my, life, ., my, aunt, 's, holiday, videos, are, more, fascinati..."
6,If you havn't seen this movie I highly recommend you do.It's an excellent true story.I love Alison Lohman she is so talented side note: I also loved her in 7th heaven...,"[if, you, havn't, seen, this, movie, i, highly, recommend, you, do, ., it, 's, an, excellent, true, story, ., i, love, alison, lohman, she, is, so, talented, side, no..."
7,"I went to see Fever Pitch with my Mom, and I can say that we both loved it. It wasn't the typical romantic comedy where someone is pining for the other, and blah blah...","[i, went, to, see, fever, pitch, with, my, mom, ,, and, i, can, say, that, we, both, loved, it, ., it, was, n't, the, typical, romantic, comedy, where, someone, is, p..."
8,"First ever viewing: July 21, 2008 Very impressive screenplay and comedic acting and timing in this film. Now 40 years old, it has lost none of it's power. Neil Simon...","[first, ever, viewing, :, july, 21, ,, 2008, , very, impressive, screenplay, and, comedic, acting, and, timing, in, this, film, ., now, 40, years, old, ,, it, has, l..."
9,"Weak, fast and multicolor,this is the Valvoline's movie in fact you can see always this brand of oil in a lot of scene. The real protagonist are the cars,weak perform...","[weak, ,, fast, and, multicolor, ,, this, is, the, valvoline, 's, movie, in, fact, you, can, see, always, this, brand, of, oil, in, a, lot, of, scene, ., the, real, p..."


###  <font color='#6629b2'>Lexicon</font>

Then we need to assemble a lexicon (aka vocabulary) of words that the model needs to know. Each tokenized word in the reviews is added to the lexicon, and then each word is mapped to a numerical index that can be read by the model. Since large datasets may contain a huge number of unique words, it's common to filter all words occurring less than a certain number of times, and replace them with some generic &lt;UNK&gt; token. The min_freq parameter in the function below defines this threshold. When assigning the indices, the number 1 will represent unknown words. The number 0 will represent "empty" word slots, which is explained below. Therefore "real" words will have indices of 2 or higher.

In [12]:
import os
import pickle

def make_lexicon(token_seqs, min_freq=1, use_padding=False):
    # First, count how often each word appears in the text.
    token_counts = {}
import os
import pickle

def make_lexicon(token_seqs, min_freq=1, use_padding=False):
    # First, count how often each word appears in the text.
    token_counts = {}
    for seq in token_seqs:
        for token in seq:
            if token in token_counts:
                token_counts[token] += 1
            else:
                token_counts[token] = 1

###  <font color='#6629b2'>From strings to numbers</font>

Once the lexicon is built, we can use it to transform each review from a list of string tokens into a list of numerical indices.

In [13]:
'''Convert each text from a list of tokens to a list of numbers (indices)'''

def tokens_to_idxs(token_seqs, lexicon):
    idx_seqs = [[lexicon[token] if token in lexicon else lexicon['<UNK>'] for token in token_seq]
                                                                     for token_seq in token_seqs]
    return idx_seqs

train_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=train_reviews['Tokenized_Review'],
                                              lexicon=lexicon)

train_reviews[['Tokenized_Review', 'Review_Idxs']][:10]

Unnamed: 0,Tokenized_Review,Review_Idxs
0,"[this, movie, only, gets, a, second, star, because, i, work, downtown, and, liked, seeing, it, destroyed, ., the, effects, were, pretty, good-, i, hear, it, was, the,...","[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 10, 24, 16, 25, 19, 26, 27, 28, 29, 30, 31, 18, 32, 19, 26, 27, 13, 33, 34, 35, 36, 1..."
1,"[as, i, watched, this, movie, ,, and, i, began, to, see, its, ', characters, develop, i, could, feel, this, would, be, an, excellent, picture, ., when, you, get, that...","[112, 10, 113, 2, 3, 51, 13, 10, 114, 74, 115, 116, 117, 118, 119, 10, 120, 121, 2, 122, 123, 124, 125, 126, 18, 127, 128, 58, 55, 129, 51, 13, 19, 3, 130, 131, 132, ..."
2,"[this, seemed, an, odd, combination, of, withnail, and, i, with, a, room, with, a, view, .., sometimes, it, worked, ,, other, times, it, did, not, ., tragedy, that, t...","[2, 169, 124, 170, 171, 39, 172, 13, 10, 173, 6, 174, 173, 6, 175, 176, 177, 16, 178, 51, 179, 180, 16, 93, 69, 18, 181, 55, 102, 182, 19, 183, 159, 19, 184, 185, 186..."
3,"[when, i, saw, the, exterminators, of, year, 3000, at, first, time, ,, i, had, no, expectations, for, that, movie, ., although, ,, it, was, n't, so, bad, as, i, was, ...","[127, 10, 200, 19, 201, 39, 202, 203, 166, 204, 73, 51, 10, 136, 205, 133, 159, 55, 3, 18, 206, 51, 16, 25, 44, 42, 89, 112, 10, 25, 207, 18, 16, 208, 209, 39, 210, 2..."
4,"[this, is, a, very, entertaining, flick, ,, considering, the, budget, and, its, length, ., the, storyline, is, hardly, ever, touched, on, in, the, movie, world, so, i...","[2, 77, 6, 137, 263, 266, 51, 267, 19, 268, 13, 116, 269, 18, 19, 270, 77, 271, 30, 272, 106, 215, 19, 3, 273, 42, 16, 228, 274, 6, 275, 39, 276, 18, 19, 64, 25, 277,..."
5,"["", trigger, man, "", is, definitely, the, most, boring, and, silliest, movie, i, 've, ever, seen, in, my, life, ., my, aunt, 's, holiday, videos, are, more, fascinati...","[287, 288, 289, 287, 77, 290, 19, 26, 291, 13, 292, 3, 10, 293, 30, 294, 215, 295, 76, 18, 295, 296, 208, 297, 298, 227, 299, 300, 18, 301, 19, 302, 303, 74, 123, 304..."
6,"[if, you, havn't, seen, this, movie, i, highly, recommend, you, do, ., it, 's, an, excellent, true, story, ., i, love, alison, lohman, she, is, so, talented, side, no...","[79, 128, 340, 294, 2, 3, 10, 341, 70, 128, 68, 18, 16, 208, 124, 125, 342, 221, 18, 10, 343, 344, 345, 346, 77, 42, 347, 348, 349, 237, 10, 228, 350, 351, 215, 352, ..."
7,"[i, went, to, see, fever, pitch, with, my, mom, ,, and, i, can, say, that, we, both, loved, it, ., it, was, n't, the, typical, romantic, comedy, where, someone, is, p...","[10, 360, 74, 115, 361, 362, 173, 295, 363, 51, 13, 10, 161, 162, 55, 364, 150, 350, 16, 18, 16, 25, 44, 19, 365, 366, 367, 91, 368, 77, 369, 159, 19, 179, 51, 13, 37..."
8,"[first, ever, viewing, :, july, 21, ,, 2008, , very, impressive, screenplay, and, comedic, acting, and, timing, in, this, film, ., now, 40, years, old, ,, it, has, l...","[204, 30, 397, 237, 398, 399, 51, 400, 301, 137, 401, 402, 13, 403, 64, 13, 404, 215, 2, 29, 18, 405, 406, 110, 407, 51, 16, 408, 409, 410, 39, 16, 208, 411, 18, 412,..."
9,"[weak, ,, fast, and, multicolor, ,, this, is, the, valvoline, 's, movie, in, fact, you, can, see, always, this, brand, of, oil, in, a, lot, of, scene, ., the, real, p...","[450, 51, 451, 13, 452, 51, 2, 77, 19, 453, 208, 3, 215, 454, 128, 161, 115, 455, 2, 456, 39, 457, 215, 6, 458, 39, 459, 18, 19, 378, 460, 227, 19, 461, 51, 450, 426,..."


##  <font color='#6629b2'>Building a Multilayer Perceptron</font>

Before I show how to build an RNN for this task, I'll demonstrate an even simpler model, a multilayer perceptron (MLP). Unlike an RNN, an MLP model is not a sequence model - it represents data as a flat matrix of features rather than a time-ordered sequence of features. For language data, this generally means that the word order of a sequence will not be explicitly encoded into a model. The importance of word order varies for different NLP tasks; in some cases, order-sensitive approaches do not necessarily perform better.

###  <font color='#6629b2'>Numerical lists to bag-of-words vectors</font>

The simplest and most common representation of a text in NLP is as a bag-of-words vector. A bag-of-words vector encodes a sequence as an array with a dimension for each word in the lexicon. The value for each dimension is the number of times the word corresponding to that dimension appears in the text. Thus a dataset of text sequences is encoded as a matrix where each row represents a sequence and each column represents a word whose value is the frequency of that word in the sequence (it is also common to apply some weighting function to these values such as [tf-idf](https://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html), but here we'll just use counts).

You may notice that the first dimension of the matrix has a count of 0 for all reviews, since there are no words represented by 0. This dimension is only relevant for the RNN model, where 0 will be used in a special way to indicate null words. This is explained fully below. For the MLP model, including this dimension won't make a difference because the model will learn to ignore it in making predictions.

In [14]:
'''Encode reviews as bag-of-words vectors'''

import numpy

def idx_seqs_to_bows(idx_seqs, matrix_length):
    bow_seqs = numpy.array([numpy.bincount(numpy.array(idx_seq), minlength=matrix_length)
                            for idx_seq in idx_seqs])
    return bow_seqs

bow_train_reviews = idx_seqs_to_bows(train_reviews['Review_Idxs'],
                                     matrix_length=len(lexicon) + 1) #add one to length for padding)
print("TRAIN INPUT:\n", bow_train_reviews)
print("SHAPE:", bow_train_reviews.shape, "\n")

#Show an example mapping string words to counts
lexicon_lookup = {idx: lexicon_item for lexicon_item, idx in lexicon.items()}
lexicon_lookup[0] = ""
pandas.DataFrame([(lexicon_lookup[idx], count) for idx, count in enumerate(bow_train_reviews[0])],
                 columns=['Word', 'Count'])

TRAIN INPUT:
 [[0 0 4 ... 0 0 0]
 [0 0 4 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 ...
 [0 0 4 ... 0 0 0]
 [0 0 1 ... 0 0 0]
 [0 0 4 ... 1 1 1]]
SHAPE: (100, 2630) 



Unnamed: 0,Word,Count
0,,0
1,<UNK>,0
2,this,4
3,movie,2
4,only,1
...,...,...
2625,suck*d,0
2626,seriously,0
2627,overdose,0
2628,verdict,0


###  <font color='#6629b2'>Keras Model</font>

To assemble the model, we'll use Keras' [Functional API](https://keras.io/getting-started/functional-api-guide/), which is one of two ways to use Keras to assemble models (the alternative is the [Sequential API](https://keras.io/getting-started/sequential-model-guide/), which is a bit simpler but has more constraints). A model consists of a series of layers. As shown in the code below, we initialize instances for each layer. Each layer can be called with another layer as input, e.g. Dense()(input_layer). A model instance is initialized with the Model() object, which defines the initial input and final output layers for that model. Before the model can be trained, the compile() function must be called with the loss function and optimization algorithm specified (see below).

###  <font color='#6629b2'>Layers</font>

We'll build an MLP with four layers:

**1. Input**: The input layer takes in the matrix of sequence vectors.

**2. Dense (sigmoid activation)**: A hidden [layer](https://keras.io/layers/core/#dense), which is what defines the model as a multilayer perceptron. This layer transforms the input matrix by applying a nonlinear transformation function (here, the sigmoid function). Intuitively, this layer can be thought of as computing a "feature representation" of the input words matrix.

**3. Dense (linear activation)**: An output layer that predicts the rating for the review based on its hidden representation given by the previous layer. This output is continuous (i.e. ranging from 1-10) rather than categorical, which means it has linear activation rather than nonlinear like the hidden layer (by default, activation='linear' for the Dense layer in Keras). The model gets feedback during training about what the actual ratings for the reviews should be.

The term "layer" is just an abstraction, when really all these layers are just matrices. The "weights" that connect the layers are also matrices. The process of training a neural network is a series of matrix multiplications. The weight matrices are the values that are adjusted during training in order for the model to learn to predict ratings.

###  <font color='#6629b2'>Parameters</font>

Our function for creating the model takes two parameters:

**n_input_nodes**: In the case of reviews encoded as bag-of-words vectors, this is the number of unique words in the lexicon, plus one to account for the padding represented by 0 values (which are only relevant for the RNN model, but this dimension can be included here without any cost to the model).

**n_hidden_nodes**: the number of dimensions in the hidden layers. This can be freely chosen; here, it is set to 500.

###  <font color='#6629b2'>Procedure</font>

The output of the model is a single continuous value (the predicted rating), making this a regression rather than a classification model. There is only one dimension in the output layer, which contains the predicted rating. All neural networks learn by updating the parameters (weights) to optimize an objective (loss) function. For this model, the objective is to minimize the mean squared error between the predicted ratings and the actual ratings for the training reviews, thus bringing the predicted ratings closer to the real ratings. The details of this process are extensive; see the resources at the bottom of the notebook if you want a deeper understanding. One huge benefit of Keras is that it implements many of these details for you. Not only does it already have implementations of the types of layer architectures, it also has many of the [loss functions](https://keras.io/losses/) and [optimization methods](https://keras.io/optimizers/) you need for training various models. The specific loss function and optimization method you use is specified when compiling the model with the compile() function.

In [15]:
'''Create the Multilayer Perceptron model'''

from keras.models import Model
from keras.layers import Input, Dense

def create_mlp_model(n_input_nodes, n_hidden_nodes):

    # Layer 1 -  Technically the shape of this layer is (batch_size, len(n_input_nodes).
    # The batch size is implicitly included in the shape of the input, so it does not need to
    # be specified as a dimension of the input.
    input_layer = Input(shape=(n_input_nodes,))
    #Shape = (batch_size, n_input_nodes)

    hidden_layer = Dense(units=n_hidden_nodes, activation='sigmoid')(input_layer)
    #Output shape = (batch_size, n_hidden_nodes)

    #Layer 4
    output_layer = Dense(units=1)(hidden_layer)
    #Output shape = (batch_size, 1)

    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')

    return model

In [16]:
mlp_bow_model = create_mlp_model(n_input_nodes=len(lexicon) + 1, n_hidden_nodes=500)

###  <font color='#6629b2'>Training</font>

Now we can train an MLP model on the training reviews encoded as a bag-of-words matrix. Keras will apply batch training by default, even though we didn't specify the batch size when creating the model. If a batch size isn't given, Keras will use its default (32). The training function also indicates the number of times to iterate through the training data (epochs). Keras reports the mean squared error loss after each epoch - if the model is learning correctly, it should progressively decrease.

In [17]:
'''Train the MLP model with bag-of-words representation'''

mlp_bow_model.fit(x=bow_train_reviews, y=train_reviews['Rating'], batch_size=20, epochs=5)
mlp_bow_model.save('example_model/mlp_bow/model.h5') #save model

Epoch 1/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 19ms/step - loss: 33.3685
Epoch 2/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 11.7345
Epoch 3/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - loss: 11.2725
Epoch 4/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - loss: 8.0612
Epoch 5/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 16ms/step - loss: 5.0486




### <font color='#6629b2'>Predicting ratings for reviews</font>

Once the model is trained, we can use it to predict ratings for the reviews in the test set. To demonstrate this, I'll load a saved model previously trained (for 25 epochs) on all 25,000 reviews in the training set. I'll apply this model to an example test set of 100 reviews (again, this is a tiny subset of the 25,000 reviews in the full test set provided at the above link).

In [27]:
'''Load saved model'''

# Load lexicon
with open('/content/lexicon.pkl', 'rb') as f:
    mlp_bow_lexicon = pickle.load(f)

# Load MLP BOW model
from keras.models import load_model
# Change the file path to the correct location of your saved model.
# Assuming 'example_model/mlp_bow/model.h5' is the intended path based on the saving code
mlp_bow_model = load_model('example_model/mlp_bow/model.h5')



In [28]:
'''Load the test dataset, tokenize, and transform to numerical indices'''

test_reviews = pandas.read_csv('/content/sample_data/example_test_imdb_reviews.csv', encoding='utf-8')
test_reviews['Tokenized_Review'] = text_to_tokens(test_reviews['Review'])
test_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=test_reviews['Tokenized_Review'],
                                             lexicon=mlp_bow_lexicon)

In [29]:
'''Transform test reviews to a bag-of-words matrix'''

bow_test_reviews = idx_seqs_to_bows(test_reviews['Review_Idxs'],
                                    matrix_length=len(mlp_bow_lexicon) + 1) #add one to length for padding)

print("TEST INPUT:\n", bow_test_reviews)
print("SHAPE:", bow_test_reviews.shape, "\n")

TEST INPUT:
 [[ 0  2  0 ...  0  0  0]
 [ 0  4  0 ...  0  0  0]
 [ 0  1  0 ...  0  0  0]
 ...
 [ 0 21  2 ...  0  0  0]
 [ 0  5  2 ...  0  0  0]
 [ 0  8  0 ...  0  0  0]]
SHAPE: (100, 13409) 



In [36]:
'''Load saved model'''

# Load lexicon
with open('/content/lexicon.pkl', 'rb') as f:
    mlp_bow_lexicon = pickle.load(f)

# Load MLP BOW model
from keras.models import load_model
mlp_bow_model = load_model('example_model/mlp_bow/model.h5')

'''Load the test dataset, tokenize, and transform to numerical indices'''

test_reviews = pandas.read_csv('/content/sample_data/example_test_imdb_reviews.csv', encoding='utf-8')
test_reviews['Tokenized_Review'] = text_to_tokens(test_reviews['Review'])

# Use the training lexicon (



In [37]:
'''Load saved model'''

# Load lexicon
with open('/content/lexicon.pkl', 'rb') as f:
    mlp_bow_lexicon = pickle.load(f)

# Load MLP BOW model
from keras.models import load_model
mlp_bow_model = load_model('example_model/mlp_bow/model.h5')

'''Load the test dataset, tokenize, and transform to numerical indices'''

test_reviews = pandas.read_csv('/content/sample_data/example_test_imdb_reviews.csv', encoding='utf-8')
test_reviews['Tokenized_Review'] = text_to_tokens(test_reviews['Review'])

# Use the training lexicon (mlp_bow_lexicon) to transform the test reviews
test_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=test_reviews['Tokenized_Review'],
                                             lexicon=mlp_bow_lexicon)

'''Transform test reviews to a bag-of-words matrix'''

# Use the length of training lexicon to create bow_test_reviews
bow_test_reviews = idx_seqs_to_bows(test_reviews['Review_Idxs'],
                                    matrix_length=len(mlp_bow_lexicon) + 1)  # +1 for padding

print("TEST INPUT:\n", bow_test_reviews)
print("SHAPE:", bow_test_reviews.shape, "\n")

'''Show predicted ratings for test reviews alongside actual ratings'''

#Since ratings are integers, need to round predicted rating to nearest integer
# Reshape bow_test_reviews to match the expected input shape of the model
bow_test_reviews_reshaped = bow_test_reviews[:, :mlp_bow_model.input_shape[1]]  # Select the first 2630 columns

test_reviews['MLP_BOW_Pred_Rating'] = numpy.round(mlp_bow_model.predict(bow_test_reviews_reshaped)[:,0]).astype(int)
test_reviews[['Review', 'Rating', 'MLP_BOW_Pred_Rating']]



TEST INPUT:
 [[ 0  2  0 ...  0  0  0]
 [ 0  4  0 ...  0  0  0]
 [ 0  1  0 ...  0  0  0]
 ...
 [ 0 21  2 ...  0  0  0]
 [ 0  5  2 ...  0  0  0]
 [ 0  8  0 ...  0  0  0]]
SHAPE: (100, 13409) 

[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step


Unnamed: 0,Review,Rating,MLP_BOW_Pred_Rating
0,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...",10,5
1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...",1,4
2,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...",4,3
3,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...",2,4
4,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...",3,5
...,...,...,...
95,"I recently picked up all three Robocop films in one box set, rather cheaply and the only reason I did this was for the special edition of the superb first one. I have...",3,5
96,"This film as it is now is far shorter than it was when released in 1918. In fact, it is now more available with two other medium sized silent Chaplin features (A DOG'...",8,6
97,"The MTV sci-fi animated series ""Æon Flux"" is brought to life with Charlize Theron playing the title character, a freedom fighter who fights oppression in the walled c...",3,6
98,"I thought the movie was sub-par. The acting was good but not great, the story was funny but did not come out that way. The director dropped the ball on this movie. It...",4,5


###  <font color='#6629b2'>Evaluation</font>

A common evaluation for regression models like this one is $R^2$, called the the coefficient of determination. This metric indicates the proportion of variance in the output variable (the rating) that is predictable from the input variable (the review text). The best possible score is 1.0, which indicates the model always predicts the correct rating. The scikit-learn library provides several [evaluation metrics](http://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics) for machine learning models, including $R^2$.

In [38]:
'''Evaluate the model with R^2'''

from sklearn.metrics import r2_score

r2 = r2_score(y_true=test_reviews['Rating'], y_pred=test_reviews['MLP_BOW_Pred_Rating'])
print("COEFFICIENT OF DETERMINATION (R2): {:3f}".format(r2))

COEFFICIENT OF DETERMINATION (R2): -0.315410


On the full test dataset of 25,000 reviews, the $R^2$ for this model is 0.545692.

###  <font color='#6629b2'>Alternative input to MLP: continuous bag-of-words vectors</font>

An alternative to the traditional bag-of-words representation is to encode sequences as a combination of their individual word embeddings. A word embedding is an n-dimensional vector of real values that together are intended to encode the "meaning" of a word. Word embedding models explicitly learn to represent words by trying to correctly predict other words that appear in the same context (or alternatively, trying to predict a word based on the context words). The result of these models are embedding vectors for words such that words with similar meanings (should) end up having similar vectors.

####  <font color='#6629b2'>spaCy word embeddings</font>

The spaCy library provides [GloVe embeddings](https://spacy.io/usage/vectors-similarity) for each word, which can be accessed simply with word.vector after loading the text into spaCy. There are 300 dimensions in these embeddings.

In [39]:
emb_vector = encoder("creepy").vector
print(emb_vector)
print("SHAPE:", emb_vector.shape)

[-1.2771919  -0.5306119  -0.6723773   0.9399318   0.7014829  -0.06380877
 -0.17946607  1.2284752   0.3254117  -0.8401215   0.51544535 -0.47977853
 -1.8846526  -0.65224385 -0.5812104  -0.4413007  -0.5419064  -1.205464
  0.3714249   0.4398198  -2.0586605  -0.24578388 -0.64598906  0.7789111
  0.25806653  1.3895837  -0.07748687  2.6762967  -0.39110157  0.8439399
 -0.8933209  -0.2239567   1.1071227   0.6287627  -0.6351315  -0.8106915
  1.443773    1.5100088   1.3998082  -0.7222898  -0.43959406 -0.33109254
 -0.6244304   0.8159715   0.05237168 -0.09510709 -0.9770825  -0.30301273
  0.05190959  0.36668953  0.8388773   0.08583759 -0.24948218  0.79838246
 -0.01086752 -0.46395302  0.599818   -0.06811841  0.10504094  0.7174291
 -0.8300198  -1.0940253  -0.5625801  -0.9197243   0.3828925   0.28666058
  0.2914844   0.12213916 -0.30752128  0.6026769   0.95919716  0.828692
 -0.22078224 -0.11761147 -0.19814283  0.40180498 -0.46817616 -0.54180527
 -0.0052039  -1.2295475  -0.05248913 -0.09211217 -1.416068 

spaCy also has a built-in similarity function that returns the cosine similarity between the GloVe vectors for two words. For example, the vector for "creepy" is more similar to that of "scary" than "nice", as expected. See the link to the spaCy documentation for other functions that operate on the vectors.

In [40]:
print(encoder("creepy").similarity(encoder("scary")))
print(encoder("creepy").similarity(encoder("nice")))

0.59054191256348
0.4365483112105294


  print(encoder("creepy").similarity(encoder("scary")))
  print(encoder("creepy").similarity(encoder("nice")))


####  <font color='#6629b2'>Combining embeddings</font>

We can use the embeddings as an alternative to the simple bag-of-words input to the model, by averaging the embeddings for all words in the review across each corresponding dimension (you could also sum them). So instead of having an input matrix with a column for each word in the lexicon, each column represents a word embedding dimension. This is referred to as a continuous bag-of-words vector. The advantage of this representation over the standard bag-of-words representation is that it more explicitly represents the meaning of the words in the review. For example, two reviews may express similar content (and have similar ratings) but may vary in the exact words they use, so their continous bag-of-word vectors may be more similar than their standard bag-of-words vectors. The model may more readily observe that these reviews should receive similar ratings.

In [41]:
'''First encode reviews as sequences of word embeddings'''

def text_to_emb_seqs(seqs):
    emb_seqs = [numpy.array([word.vector for word in encoder(seq)]) for seq in seqs]
    return emb_seqs

emb_train_reviews = text_to_emb_seqs(train_reviews['Review'])

#Example of word embedding sequence for first review
pandas.DataFrame(list(zip(train_reviews['Tokenized_Review'][0], emb_train_reviews[0])),
                columns=['Word', 'Embedding'])

Unnamed: 0,Word,Embedding
0,this,"[1.0398248, -0.47319657, 0.4531712, 1.3767627, 0.13983172, -0.1368148, -1.165497, 0.8130773, -0.5901467, 1.5544857, 1.6845319, 2.4399178, 0.43748826, -0.6981137, -1.6..."
1,movie,"[0.2698257, -0.97961146, -0.7395812, 0.02972819, -0.064505786, -0.31669927, 1.8255165, 0.66397226, 0.34995073, -0.7414883, -0.47951263, -1.1280963, -0.41341066, 0.451..."
2,only,"[0.9798147, -1.197475, -0.23086295, 0.12655483, -0.37298548, -1.2982873, -1.5004623, 0.1415935, 0.43173927, 1.5707593, -0.72698927, -0.07494712, -1.2454133, -1.075552..."
3,gets,"[-0.20549291, 0.80465597, -0.35104805, 0.5999114, -0.1816248, -0.7037534, -0.20952076, 0.030374974, 0.31106865, 0.29921508, 1.3892739, -0.20258045, -0.05074957, -0.31..."
4,a,"[1.6478114, -0.046070993, 0.3715225, 1.2782764, 0.079325736, -0.3920662, -0.6950065, 0.41578826, 0.33306852, 1.9801154, -0.16126329, 0.3554897, 0.83267903, 1.1572309,..."
...,...,...
156,just,"[0.76390237, 0.05665025, 0.46596625, 1.5707626, 0.7354234, -1.0024043, 0.17120439, 0.87097025, -0.022323482, -1.5567553, -0.9469857, -0.24100772, -0.7424408, -0.43806..."
157,500,"[-0.29780307, 0.7686159, 0.03812629, 1.9336425, 1.8273361, -1.6827999, -0.97570264, 0.37755412, -0.5997787, -1.0221064, -1.1130649, 1.9261907, 0.24612485, 1.9928187, ..."
158,years,"[0.06757997, 0.9435554, -1.1780691, -0.5619186, 0.9982927, -2.4237146, 0.2672168, 0.9435766, -0.08155033, -1.553586, 0.5974688, -0.92620003, -0.28414994, 0.16658504, ..."
159,ago,"[3.2652576, -0.31532535, -0.65931463, -1.6294123, 0.2012563, -1.0684756, -0.43926695, 1.06173, 1.0890417, -1.3164845, -0.44607443, 0.1143742, -0.7743993, -0.94982773,..."


In [42]:
'''Encode reviews as continuous bag-of-words (mean of word embeddings)'''

def emb_seqs_to_cont_bows(emb_seqs):
    cont_bow_seqs =  numpy.array([numpy.mean(emb_seq, axis=0) for emb_seq in emb_seqs])
    return cont_bow_seqs

cont_bow_train_reviews = emb_seqs_to_cont_bows(emb_train_reviews)

print("TRAIN INPUT:\n", cont_bow_train_reviews)
print("SHAPE:", cont_bow_train_reviews.shape, "\n")

TRAIN INPUT:
 [[ 0.13532615 -0.23883443 -0.01957789 ...  0.41713232 -0.16439337
  -0.03107447]
 [ 0.04157898 -0.06338805 -0.0876648  ...  0.24281259 -0.2147435
   0.07698419]
 [ 0.18818249 -0.07859164 -0.12669191 ...  0.23263513 -0.24604952
   0.19342686]
 ...
 [ 0.20599732 -0.22576106 -0.05279769 ...  0.3049593  -0.42441097
   0.05485389]
 [ 0.05409485 -0.18928146 -0.0492447  ...  0.27937636 -0.3363585
   0.19121288]
 [-0.05725833 -0.29599538  0.03008783 ...  0.30035576 -0.15102664
   0.165662  ]]
SHAPE: (100, 96) 



####  <font color='#6629b2'>Continuous bag-of-words MLP</font>

Now we can train the same MLP model to predict ratings from the reviews encoded as continuous bag-of-words vectors. The only difference between the parameters of this model compared to the previous model is that n_input_nodes is equal to the number of embedding dimensions instead of the number of words in the lexicon.

In [43]:
mlp_cont_bow_model = create_mlp_model(n_input_nodes=cont_bow_train_reviews.shape[-1], n_hidden_nodes=500)

####  <font color='#6629b2'>Training</font>

In [44]:
'''Train the model'''

mlp_cont_bow_model.fit(x=cont_bow_train_reviews, y=train_reviews['Rating'], batch_size=20, epochs=5)
mlp_cont_bow_model.save('example_model/mlp_cont_bow/model.h5') #save model

Epoch 1/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 3ms/step - loss: 32.6929  
Epoch 2/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 22.6229 
Epoch 3/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 12.4116 
Epoch 4/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 11.0165
Epoch 5/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step - loss: 11.7283 




####  <font color='#6629b2'>Prediction</font>

Again, I'll load this same model I previously trained on all 25,000 reviews in the training set and apply it to the example test set of 100 reviews.

In [48]:
'''Transform test reviews to a continuous bag-of-words matrix'''

cont_bow_test_reviews = emb_seqs_to_cont_bows(text_to_emb_seqs(test_reviews['Review']))

print("TEST INPUT:\n", cont_bow_test_reviews)
print("SHAPE:", cont_bow_test_reviews.shape, "\n")

TEST INPUT:
 [[-0.08643828 -0.30220205 -0.01058856 ...  0.19418764 -0.26124507
  -0.0298257 ]
 [ 0.14763236 -0.19857235 -0.12171064 ...  0.25155184 -0.22191487
   0.05345387]
 [ 0.09135521 -0.10806065 -0.01514891 ...  0.17516163 -0.20707077
   0.01969389]
 ...
 [ 0.19005351 -0.19684045 -0.08009551 ...  0.28915673 -0.3525466
   0.02061167]
 [ 0.14262317 -0.12892422  0.02113029 ...  0.3345954  -0.16054617
   0.12531607]
 [ 0.16125666 -0.3273875  -0.05189675 ...  0.28409258 -0.22093466
   0.10834946]]
SHAPE: (100, 96) 



In [49]:
'''Load saved model'''

# Change 'pretrained_model' to 'example_model' in the file path
mlp_cont_bow_model = load_model('example_model/mlp_cont_bow/model.h5')



In [50]:
'''Show ratings predicted by this model alongside previous model and actual ratings'''

#Since ratings are integers, need to round predicted rating to nearest integer
test_reviews['MLP_Cont_BOW_Pred_Rating'] = numpy.round(mlp_cont_bow_model.predict(cont_bow_test_reviews)[:,0]).astype(int)
test_reviews[['Review', 'Rating', 'MLP_BOW_Pred_Rating', 'MLP_Cont_BOW_Pred_Rating']]



[1m1/4[0m [32m━━━━━[0m[37m━━━━━━━━━━━━━━━[0m [1m0s[0m 48ms/step



[1m4/4[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 12ms/step


Unnamed: 0,Review,Rating,MLP_BOW_Pred_Rating,MLP_Cont_BOW_Pred_Rating
0,"First of all i'd like to say that this movie is the greatest thing that ever happened to mankind. It is the best out of all the excellent Muppet movies, and every oth...",10,5,6
1,"Terrible writing, highly contrived, from a ""do-gooder"" who knows absolutely nothing about race relations in L.A., or the USA in the present day. The gushing positive ...",1,4,6
2,"I didn't expect too much from this movie, but I was still disappointed. It's supposed to be a comedy, but there are only four or five scenes where I actually laughed,...",4,3,6
3,"Corey Haim is never going to be known as one of the great actors of his time, but at least in movies like ""Licensed To Drive"", he was more in his element... lowbrow h...",2,4,6
4,"Being a great fan of Disney, i was really disappointed when i watched this garbage.The animation was pretty,and the backgrounds were amazing,but i believe that good a...",3,5,6
...,...,...,...,...
95,"I recently picked up all three Robocop films in one box set, rather cheaply and the only reason I did this was for the special edition of the superb first one. I have...",3,5,6
96,"This film as it is now is far shorter than it was when released in 1918. In fact, it is now more available with two other medium sized silent Chaplin features (A DOG'...",8,6,6
97,"The MTV sci-fi animated series ""Æon Flux"" is brought to life with Charlize Theron playing the title character, a freedom fighter who fights oppression in the walled c...",3,6,6
98,"I thought the movie was sub-par. The acting was good but not great, the story was funny but did not come out that way. The director dropped the ball on this movie. It...",4,5,6


####  <font color='#6629b2'>Evaluation</font>

In [51]:
'''Evaluate the model with R^2'''

r2 = r2_score(y_true=test_reviews['Rating'], y_pred=test_reviews['MLP_Cont_BOW_Pred_Rating'])
print("COEFFICIENT OF DETERMINATION (R2): {:3f}".format(r2))

COEFFICIENT OF DETERMINATION (R2): -0.024958


On the full test dataset of 25,000 reviews, the $R^2$ for this model is 0.494190. So it turns out this model overall does not actually do better at predicting ratings than the standard bag-of-words model.

##  <font color='#6629b2'>Building a Recurrent Neural Network </font>

Now I'll show how this same task can be modeled with an RNN, which processes text sequentially.

###  <font color='#6629b2'>Numerical lists to matrices</font>

The input representation for the RNN is different from the MLP because it explicitly encodes the order of words in the review. We'll return to the lists of the word indices contained in train_reviews['Review_Idxs']. The input to the model will be these number sequences themselves. We need to put all the reviews in the training set into a single matrix, where each row is a review and each column is a word index in that sequence. This enables the model to process multiple sequences in parallel (batches) as opposed to one at a time. Using batches significantly speeds up training. However, each review has a different number of words, so we create a padded matrix equal to the length on the longest review in the training set. For all reviews with fewer words, we prepend the row with zeros representing an empty word position. This is why the number 0 was not assigned as a word index in the lexicon. We can tell Keras to ignore these zeros during training.

In [52]:
'''Create a padded matrix of input reviews'''

from keras.preprocessing.sequence import pad_sequences

def pad_idx_seqs(idx_seqs):
    max_seq_len = max([len(idx_seq) for idx_seq in idx_seqs]) # Get length of longest sequence
    padded_idxs = pad_sequences(sequences=idx_seqs, maxlen=max_seq_len) # Keras provides a convenient padding function
    return padded_idxs

train_padded_idxs = pad_idx_seqs(train_reviews['Review_Idxs'])

print("TRAIN INPUT:\n", train_padded_idxs)
print("SHAPE:", train_padded_idxs.shape, "\n")

TRAIN INPUT:
 [[   0    0    0 ...  110  111   97]
 [   0    0    0 ...   69  168   18]
 [   0    0    0 ...  199   29  176]
 ...
 [   0    0    0 ... 2598 2336   18]
 [   0    0    0 ...   18  301 2608]
 [   0    0    0 ...   73  572   18]]
SHAPE: (100, 189) 



###  <font color='#6629b2'>Model Layers</font>

We'll use the same scheme as before (the Functional API) to assemble the RNN. The RNN will have four layers:

**1. Input**: The input layer takes in the matrix of word indices.

**2. Embedding**: A [layer](https://keras.io/layers/embeddings/) that converts integer word indices into distributed vector representations (embeddings), which were introduced above. The difference here is that rather than plugging in embeddings from a pretrained model as before, the word embeddings will be learned inside the model itself. Thus, the input to the model will be the word indices rather than their embeddings, and the embedding values will change as the model is trained. The mask_zero=True parameter in this layer indicates that values of 0 in the matrix (the padding) will be ignored by the model.

**3. GRU**: A [recurrent (GRU) hidden layer](https://keras.io/layers/recurrent/), the central component of the model. As it observes each word in the review, it integrates the word embedding representation with what it's observed so far to compute a representation (hidden state) of the review at that timepoint. There are a few architectures for this layer - I use the GRU variation, Keras also provides LSTM or just the simple vanilla recurrent layer (see the materials at the bottom for an explanation of the difference). This layer outputs the last hidden state of the sequence (i.e. the hidden representation of the review after its last word is observed).

**4. Dense**: An output [layer](https://keras.io/layers/core/#dense) that predicts the rating for the review based on its GRU representation given by the previous layer. This is the same output layer used in the MLP, so it has one dimension that contains a continuous value (the rating).

###  <font color='#6629b2'>Parameters</font>

Our function for creating the RNN takes the following parameters:

**n_input_nodes**: As with the standard bag-of-words MLP, this is the number of unique words in the lexicon, plus one to account for the padding represented by 0 values. This indicates the number of rows in the embedding layer, where each row corresponds to a word.

**n_embedding_nodes**: the number of dimensions (units) in the embedding layer, which can be freely defined. Here, it is set to 300.

**n_hidden_nodes**: the number of dimensions in the GRU hidden layer. Like the embedding layer, this can be freely chosen. Here, it is set to 500.

In [56]:
'''Create the model'''

# Instead of importing directly from keras
# from keras.layers.embeddings import Embedding
# from keras.layers.recurrent import GRU

# Import from tensorflow.keras
from tensorflow.keras.layers import Embedding, GRU, Input # Import Input from tensorflow.keras.layers
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense
def create_rnn_model(n_input_nodes, n_embedding_nodes, n_hidden_nodes):

    # Layer 1 -  Technically the shape of this layer is (batch_size, len(train_padded_idxs)).
    # However, both the batch size and the length of the input matrix can be inferred from the input at training time.
    # The batch size is implicitly included in the shape of the input, so it does not need to
    # be specified as a dimension of the input. None can be given as placeholder for the input matrix length.
    # By defining it as None, the model is flexible in accepting inputs with different lengths.
    input_layer = Input(shape=(None,))

    # Layer 2
    embedding_layer = Embedding(input_dim=n_input_nodes,
                                output_dim=n_embedding_nodes,
                                mask_zero=True)(input_layer) #mask_zero tells the model to ignore 0 values (padding)
    #Output shape = (batch_size, input_matrix_length, n_embedding_nodes)

    # Layer 3
    gru_layer = GRU(units=n_hidden_nodes)(embedding_layer)
    #Output shape = (batch_size, n_hidden_nodes)

    #Layer 4
    output_layer = Dense(units=1)(gru_layer)
    #Output shape = (batch_size, 1)

    #Specify which layers are input and output, compile model with loss and optimization functions
    model = Model(inputs=[input_layer], outputs=output_layer)
    model.compile(loss="mean_squared_error", optimizer='adam')

    return model

In [57]:
rnn_model = create_rnn_model(n_input_nodes=len(lexicon) + 1, n_embedding_nodes=300, n_hidden_nodes=500)

###  <font color='#6629b2'>Training</font>

The training function is exactly the same for the RNN as above, just with the padded review matrix now provided as the input.

In [58]:
'''Train the model'''

rnn_model.fit(x=train_padded_idxs, y=train_reviews['Rating'], batch_size=20, epochs=5)
rnn_model.save('example_model/rnn/model.h5') #save model

#Save lexicon to new model folder - same lexicon as above
with open('example_model/rnn/lexicon.pkl', 'wb') as f:
    pickle.dump(lexicon, f)

Epoch 1/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m47s[0m 2s/step - loss: 35.4011
Epoch 2/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 1s/step - loss: 55.4997
Epoch 3/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1s/step - loss: 23.1068
Epoch 4/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 1s/step - loss: 20.7223
Epoch 5/5
[1m5/5[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 1s/step - loss: 15.5255




###  <font color='#6629b2'>Prediction</font>

In [None]:
'''Load saved model'''

# Load lexicon
with open('pretrained_model/rnn/lexicon.pkl', 'rb') as f:
    rnn_lexicon = pickle.load(f)

# Load RNN model
from keras.models import load_model
rnn_model = load_model('pretrained_model/rnn/model.h5')

In [None]:
'''Put test reviews in padded matrix'''

test_reviews['Review_Idxs'] = tokens_to_idxs(token_seqs=test_reviews['Tokenized_Review'],
                                             lexicon=rnn_lexicon)
test_padded_idxs = pad_idx_seqs(test_reviews['Review_Idxs'])

print("TEST INPUT:\n", test_padded_idxs)
print("SHAPE:", test_padded_idxs.shape, "\n")

In [None]:
'''Show ratings predicted by RNN alongside the other models' ratings'''

#Since ratings are integers, need to round predicted rating to nearest integer
test_reviews['RNN_Pred_Rating'] = numpy.round(rnn_model.predict(test_padded_idxs)[:,0]).astype(int)
test_reviews[['Review', 'Rating', 'MLP_BOW_Pred_Rating', 'MLP_Cont_BOW_Pred_Rating', 'RNN_Pred_Rating']]

###  <font color='#6629b2'>Evaluation</font>

In [None]:
'''Evaluate the model with R^2'''

r2 = r2_score(y_true=test_reviews['Rating'], y_pred=test_reviews['RNN_Pred_Rating'])
print("COEFFICIENT OF DETERMINATION (R2): {:3f}".format(r2))

On the full test dataset of 25,000 reviews, the $R^2$ for this model is 0.622525. So the RNN outperforms the continuous bag-of-words MLP as well as the standard bag-of-words approach.

### <font color='#6629b2'>Visualizing data inside the model</font>

To help visualize the data representation inside the model, we can look at the output of each layer in a model individually. Keras' Functional API lets you derive a new model with the layers from an existing model, so you can define the output to be a layer below the output layer in the original model. Calling predict() on this new model will produce the output of that layer for a given input. Of course, glancing at the numbers by themselves doesn't provide any interpretation of what the model has learned (although there are opportunities to [interpret these values](https://www.civisanalytics.com/blog/interpreting-visualizing-neural-networks-text-processing/)), but seeing them verifies the model is just a series of transformations from one matrix to another. The model stores its layers as the list model.layers, and you can retrieve specific layer by its position index in the model.

In [None]:
'''Show the output of the RNN embedding layer (second layer) for the test reviews'''

embedding_layer = Model(inputs=rnn_model.layers[0].input,
                        outputs=rnn_model.layers[1].output) #embedding layer is 2nd layer (index 1)
embedding_output = embedding_layer.predict(test_padded_idxs)
print("EMBEDDING LAYER OUTPUT SHAPE:", embedding_output.shape)
print(embedding_output[0])

It is also easy to look at the weight matrices that connect the layers. The get_weights() function will show the incoming weights for a particular layer.

In [None]:
'''Show weights that connect the RNN hidden layer to the output layer (final layer)'''

hidden_to_output_weights = rnn_model.layers[-1].get_weights()[0]
print("HIDDEN-TO_OUTPUT WEIGHTS SHAPE:", hidden_to_output_weights.shape)
print(hidden_to_output_weights)

## <font color='#6629b2'>Conclusion</font>

As mentioned above, the models shown here could be applied to any task where the goal is to predict a score for a particular sequence. For ratings prediction, this output is ordinal, but it could also be categorical with a few simple changes to the output layer of the model.