# Project 3: Text classification

In this assignment, we will be using Neural Networks to classify movie reviews as positive or negative. This is commonly referred to as "sentiment analysis". While the actual classification might be unique to different tasks you might encounter, the first few steps to setting up a recurrent neural net (just as the steps to building a convolutional neural network) are common to all RNN structures.

Goals:
- Understand recurrent neural networks
- Understand an end to end NN
- Implement a recurrent neural network

In [1]:
%matplotlib inline
%load_ext autoreload
%autoreload 2

In [2]:
import os
import numpy as np
from keras.datasets import imdb
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

  from ._conv import register_converters as _register_converters
Using TensorFlow backend.


In [5]:
# Download the data if you don't already have it. 
# Otherwise, keras will load it in.
# The data will be saved in ~/.keras/datasets
dim = 300

(x, y), _ = imdb.load_data(num_words=5000)
y_vectorized = to_categorical(y)

In [19]:
words = imdb.get_word_index()

Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json


## Word embeddings

Words cannot be simply passed into a neural network. There are several common approaches for turning words into a format neural networks can understand, the most useful of which is associating each word with an "embedding". That is, representing the row as a vector of numbers (that can represent arbitrary features). You are free to train your own word embeddings, but it is common to simply take ones generated from huge text corpora.

We will be using the GloVe word embeddings in this exercise, as these vectors have been trained from over 6 billion English language tokens. They are free to download.

In [20]:
from exercise_3 import create_word_embeddings
try:
    with np.load(
        os.path.join("glove.6B.{}d.trimmed.npz".format(dim))) as data:
        embeddings = data["embeddings"]
except FileNotFoundError:
    embeddings = create_word_embeddings("glove", words, save=True)

## Sequence Padding

Neural networks tend not to handle inputs of different sizes very well. So, with sequences such as sentences or timeseries of varying length, the common approach is to "pad" the sequence with 0's on the end so that all inputs are the same length. 

In [21]:
from keras.preprocessing.sequence import pad_sequences

In [22]:
pad_sequences(x[:10]).shape

(10, 562)

## Exercise 1:

As always, we want to build a dense (fully-connected) model as our baseline. However, this is tricky because the number of units in fully connected layers is hard-coded. So, to pass a sequence into this you need to reshape your embedded words into the right shape. Practice this by implementing build_dense_model() in exercise_3.py.

In [23]:
from exercise_3 import build_dense_model

In [24]:
dense_max_sequence_length = max([len(xi) for xi in x])

In [25]:
dense_model = build_dense_model(embeddings)

In [26]:
# make sure your model is predicting the correct dimension:
example_prediction = dense_model.predict(pad_sequences(x[:10], 2494))
assert example_prediction.shape == (10,2)

In [27]:
# this will run model fitting and validation on the dev set
dense_model.fit(pad_sequences(x[:10], 2494), y_vectorized[:10],
                validation_data=(pad_sequences(x[:10], 2494), y_vectorized[:10]),
                batch_size=2, epochs=10, 
                verbose=1)

Train on 10 samples, validate on 10 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x124bc1b00>

## Exercise 2:

Clearly, the fully connected approach is not adequate for sequence modeling, especially if we run the risk of a future sentence being longer than our maximally allocated sentence length. Furthermore, if a few sentences in our dataset are especially long and most are short, this leads to a lot of unnecessary computation.

Enter the Recurrent neural network architecture (specifically, LSTM). Complete build_lstm_model() in exercise_3.py to compare it with the dense network.

In [28]:
from exercise_3 import build_lstm_model

In [29]:
lstm_model = build_lstm_model(embeddings)

In [30]:
# this will run model fitting and validation on the dev set
lstm_model.fit(pad_sequences(x[:10]), y_vectorized[:10],
                validation_data=(pad_sequences(x[:10]), y_vectorized[:10]),
                batch_size=2, epochs=5, 
                verbose=1)

Train on 10 samples, validate on 10 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x110343f28>

## Exercise 3: LSTM

In [31]:
from exercise_3 import build_bidirectional_lstm_model

In [32]:
bi_lstm_model = build_bidirectional_lstm_model(embeddings)
# this will run model fitting and validation on the dev set
bi_lstm_model.fit(pad_sequences(x[:10]), y_vectorized[:10],
                validation_data=(pad_sequences(x[:10]), y_vectorized[:10]),
                batch_size=2, epochs=5, 
                verbose=1)

Train on 10 samples, validate on 10 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x1278a94a8>

## Exercise 4:

You should notice that the bidirectional LSTM can perfectly overfit the data in under 5 epochs. Let's fit the full model using it. Implement build_final_model() in exercise_3.py to get the best review classification as you can using any RNN techniques discussed in class. 

In [33]:
x_train, x_dev, y_train, y_dev = train_test_split(x, y_vectorized)

In [34]:
from exercise_3 import build_final_model

In [None]:
final_model = build_final_model(embeddings)
# this will run model fitting and validation on the dev set
final_model.fit(pad_sequences(x_train), y_train,
                validation_data=(pad_sequences(x_dev), y_dev),
                batch_size=256, epochs=1, 
                verbose=1)

Train on 18750 samples, validate on 6250 samples
Epoch 1/1
 1792/18750 [=>............................] - ETA: 1:30:36 - loss: 0.7422 - acc: 0.4872

## Exercise 5:
Consider your experience implementing a simple RNN. What advantages and disadvantages do you notice?

(double click to edit) Or submit a separate file with your write up.