# Lab 7: Sentiment analysis with an LSTM network
This week, we'll build a model for sentiment analysis, the problem of taking a string of text and predicting how positive an opinion it expresses."
To do this, we'll use the last two big ideas in the course: vector embeddings and recurrent neural networks (with LSTM cells), trained on a dataset of [IMDB movie reviews](http://ai.stanford.edu/~amaas/data/sentiment/).

In [None]:
import numpy as np
import keras

## Section 0: Preprocess and understand the data
This dataset is built into Keras, so it's very easy to import.
I've written the preprocessing pipeline, but make sure to read it -- it'll be essential for understanding the data you're building a model for. 

### 0.1: Load the data
There are two hyperparameters here:
 - `maxlen`: The maximum number of words per review. Reviews longer than this are truncated. Keeping this low makes training faster by reducing the number of steps needed per example, but in practice we'd probably increase it. 
 - `num_words`: The number of distinct words the dataset will contain. The `num_words` most common words are assigned unique tokens, and the rest are grouped into a single token.
 
If training is taking forever, feel free to reduce `maxlen`.
You can also try changing `num_words` to investigate the tradeoff it induces between the statistical and computational efficiency of having fewer unique words by grouping uncommon words and the advantages of recognizing more words.

Reviews are returned as a sequence of integer tokens, each of which represents a distinct word.
There are 3 special tokens:
 - 0 is a padding token (see below)
 - 1 is a token that represents the start of a review
 - 2 is a token that represents a word not in the model's vocabulary

In [None]:
from keras.datasets import imdb

# Hyperparameters
maxlen = 256
num_words = 5000

(x_train, y_train), (x_test, y_test) = \
    imdb.load_data(maxlen=maxlen, num_words=num_words)

### 0.2: Pad all reviews to the same length
Training is much more efficient when we can stack an entire batch of reviews together in a single tensor, so Keras requires that every training sequence is of the same length.
To do this, we add padding tokens (the 0 token) to the beginning of every sequence to make them all of length `maxlen`.

We pad the beginning of the sequence instead because padding the end would cause many steps of the RNN after it's read the last word in the review, causing the hidden state to lose information.
In the model, we'll also tell the recurrent layers to mask out 0 values, so that the hidden state of the network is the same every time it reaches the start token (1).

In [None]:
from keras.preprocessing.sequence import pad_sequences
x_train = pad_sequences(x_train, maxlen=maxlen)
x_test = pad_sequences(x_test, maxlen=maxlen)

### 0.3: Build word-token dictionaries
In order to use the model with text outside of the dataset, we need to be able to convert words into tokens.
We build two dictionaries:
 - `word_index` maps words into tokens
 - `index_word` maps tokens into words

In [None]:
word_index = imdb.get_word_index()
index_word = {k + 3: v for (v, k) in word_index.items()}
index_word[0] = '<PAD>'   # Special padding token
index_word[1] = '<START>' # Special "start of review" token
index_word[2] = '<OOV>'   # Special "out of vocabulary" token 
word_index = {k: v for (v, k) in index_word.items()}

### 0.4: Using the dataset
Below we print some summary statistics of the dataset and show how to convert between text and tokenized form. 

In [None]:
# Print summary statistics
print(
'''
Training set size: {:}
Test set size: {:}
Numb|er of tokens: {:}
Vocabulary size: {:}
Proportion of words that are out-of-vocabulary: {:.4f}%\n
'''.format(x_train.shape, 
           x_test.shape, 
           len(index_word.keys()),
           num_words,
           np.mean(x_train == 1) * 100)
)

review_idx = 1
review_tokens = x_train[review_idx]
review_words = [index_word[idx] for idx in review_tokens]
print('Review converted from tokens:\n', ' '.join(review_words))
print('\nReview sentiment:', y_train[review_idx])

## Section 1: Build a model
The data and task really inform how we'll build the model here:
 - The input is variable-length sequences, so the feature extraction will be recurrent.
 - Each element of the input sequence is a word token, so the input is sparse and categorical. We'll deal with this by first computing embeddings.
 - The output is binary classification, so our model should produce a single probability independent of the length of the input sequence.
 
Since this model has a lot of components, including recurrent layers, we'll stick to building the model completely in Keras.
I used the functional API but the sequential API would also work here.

### 1.1: Input and embedding layers
Make an embedding layer that takes input of the correct shape and yields word embeddings.

Notes:
 - `mask_zero` should be set to True, which will mask off the padding tokens we added before.
 - I used 64-dimensional embeddings.
 - Each input in a batch is a sequence of scalars (integer tokens) of length `maxlen`.
 - If you want to pass variable-length sequences as input, use None as the dimension on the sequence length axis of the input and don't specify an `input_length` for the embedding layer

In [None]:
# Your code here

### 1.2: Recurrent feature-extraction layer
Make an LSTM layer to summarize the variable-length sequence of embedding vectors into a fixed-size feature vector.

Notes:
 - We're only interested in the last output of the LSTM layer.
 - I used 64 units.
 - You can add more layers if you like to make a deep LSTM network. If you do, the earlier layers should use `return_sequences` to yield an entire sequence of output vectors instead of just the last output.

In [None]:
# Your code here

### 1.3: Output layer
Add a dense layer to perform the final classification from the summary vector output by the LSTM layer to the probability that the input sequence expresses positive sentiment.

Note that this is binary classification, so choose the layer's output size and activation function appropriately.

In [None]:
# Your code here

### 1.4: Compile and train model
Compile and train the model.

Notes:
 - RMSProp is usually a good choice for optimizing RNNs.
 - I used `clipnorm=1` in my optimizer to prevent exploding gradients.
 - I got about 90% accuracy after a couple of training epochs.
 - RNN training can take a while. Try training for a small number of epochs, or reducing `maxlen` if it takes too long.

In [None]:
# Your code here

## Section 2: Evaluate the model
Below, I've pasted a review from IMDB and tokenized it.
Add code to run your model over the review to predict whether it expresses positive or negative sentiment.

Hint: Your model should output a single probability here, but expects a batch.
You might need to use `np.expand_dims()`.

In [None]:
review = \
'''
Pulp Fiction may be the single best film ever made, and quite appropriately
 it is by one of the most creative directors of all time, Quentin Tarantino.
 This movie is amazing from the beginning definition of pulp to the end 
 credits and boasts one of the best casts ever assembled with the likes of
 Bruce Willis, Samuel L. Jackson, John Travolta, Uma Thurman, Harvey Keitel,
 Tim Roth and Christopher Walken. The dialog is surprisingly humorous for
 this type of film, and I think that\'s what has made it so successful.
 Wrongfully denied the many Oscars it was nominated for, Pulp Fiction is by
 far the best film of the 90s and no Tarantino film has surpassed the 
 quality of this movie (although Kill Bill came close). As far as I\'m 
 concerned this is the top film of all-time and definitely deserves a 
 watch if you haven\'t seen it.
'''
review = ''.join(list(filter(lambda x: x not in '\',.()\n', review.lower())))

review_tokens = [1] # Begin with the <START> token
for word in review.split():
    review_tokens.append(word_index[word] if word in word_index.keys() and word_index[word] <= num_words else 2)

In [None]:
# Your code here