Tutorial 2: Sentiment analysis with a recurrent LSTM network
=============================================================

Train an recurrent neural network to parse movie reviews from IMDB and decide if they are positive or negative reviews.

Data preprocessing:

* Converting words to one-hot
  * Top 20,000 words
  * PAD, OOV, START tags
  * ID's based on frequency
* Pre-defined sentence length
* Targets binarized to positive (>=7), negative (<7)

Model Architecture:

* Embedding layer: Learning to embed words from a sparse representation to a dense space
* LSTM layer: Recurrent layer that learns features in the time series
* Recurrent sum layer: Go from time series to single output
* Affine layer: Readout / classfier

An example of a review:

`"Okay, sorry, but I loved this movie. I just love the whole 80's genre of these kind of movies, because you don't see many like this ...
~*~CupidGrl~*~"`

We have 25000 reviews like this.

1. Data preparation
-------------------
We have a script that takes reviews from a text file and store as one-hot encoded dataset

In [None]:
# read reviews from text file and store as one-hot encoded dataset
import prepare
fname='labeledTrainData.tsv'
prepare.main(fname)

2. Building the model
---------------------
Similar to the convnet example, we need dataset, layers, callbacks, backend

In [None]:
# hyperparameters
hidden_size = 128
embedding_dim = 128
vocab_size = 20000
sentence_length = 128
batch_size = 128
num_epochs = 2

# setup backend
from neon.backends import gen_backend
be = gen_backend(backend='cpu',
                 batch_size=batch_size,
                 rng_seed=0)

In [None]:
# load the h5 datasets, print stats
import h5py
h5f = h5py.File(fname + '.h5', 'r')
reviews, h5train, h5valid = h5f['reviews'], h5f['train'], h5f['valid']
ntrain, nvalid, nclass = reviews.attrs['ntrain'], reviews.attrs['nvalid'], reviews.attrs['nclass']
print "# of train examples - {0}, valid examples - {1}".format(ntrain, nvalid)
print "# of classes - ", nclass
print "class distribution - ", reviews.attrs['class_distribution']
print "vocab size - {0}, sentence_length - {1}".format(vocab_size, sentence_length)


### Create datsets
* Split data into a training and validation set.
* Pad / truncate reviews to 128 words. 
* Finally wrap them into a DataIterator.


In [None]:
# make train dataset
from preprocess_text import get_paddedXY
from neon.data import DataIterator
Xy = h5train[:ntrain]
X = [xy[1:] for xy in Xy]
y = [xy[0] for xy in Xy]
X_train, y_train = get_paddedXY(
    X, y, vocab_size=vocab_size, sentence_length=sentence_length)
train_set = DataIterator(X_train, y_train, nclass=nclass)

# make valid dataset
Xy = h5valid[:nvalid]
X = [xy[1:] for xy in Xy]
y = [xy[0] for xy in Xy]
X_valid, y_valid = get_paddedXY(
    X, y, vocab_size=vocab_size, sentence_length=sentence_length)
valid_set = DataIterator(X_valid, y_valid, nclass=nclass)


### Intializers

We use Xavier Glorot's initialization scheme to automatically scale the weights to preserve the variance of input activations on the output side.

In [None]:
# initialization
from neon.initializers import GlorotUniform, Uniform
init_glorot = GlorotUniform()
init_emb = Uniform(-0.1 / embedding_dim, 0.1 / embedding_dim)

### Model layers
The network consists of a word embedding layer, and LSTM, a RecurrentSum, Dropout and an Affine layer.
* **LookupTable** is a word embedding that maps from a sparse one-hot representation to dense word vectors. The embedding is learned from the data.
* **LSTM** is a recurrent layer with "long short-term memory" units. LSTM networks tend to be easier to train, and perform similar to standard RNN layers.
* **RecurrentSum** is a recurrent output layer that collapeses over the time dimension of the sequence by summing up outputs from individual steps.
* **Dropout** performs regularizaion by randomly zeroing out some of the units.
* **Affine** is a fully connected MLP layer that is used for the binary classification of the outputs.

In [None]:
# define layers
from neon.layers import LookupTable, LSTM, RecurrentSum, Dropout, Affine
from neon.transforms import Softmax, Tanh, Logistic
layers = [
    LookupTable(vocab_size=vocab_size, embedding_dim=embedding_dim, init=init_emb),
    LSTM(hidden_size, init_glorot, activation=Tanh(),
         gate_activation=Logistic(), reset_cells=True),
    RecurrentSum(),
    Dropout(keep=0.5),
    Affine(nclass, init_glorot, bias=init_glorot, activation=Softmax())
]

### Cost and Optimizer
Use a cross-entropy cost function and an Adagrad optimizer

In [None]:
# set the cost, metrics, optimizer
from neon.layers import GeneralizedCost
from neon.transforms import CrossEntropyMulti, Accuracy
from neon.models import Model
from neon.optimizers import Adagrad
cost = GeneralizedCost(costfunc=CrossEntropyMulti(usebits=True))
metric = Accuracy()
model = Model(layers=layers)
optimizer = Adagrad(learning_rate=0.01)

### Callbacks
In addition to the default progress bar, we set up a callback to save the model to a pickle file after every epoch

In [None]:
# configure callbacks
from neon.callbacks import Callbacks
callbacks = Callbacks(model, train_set, eval_set=valid_set, 
                      epochs=num_epochs, serialize=1,
                      save_path=fname + '.pickle')

### Training the model
We now have all the parts in place to train the model. Two epochs are sufficient to obtain some interesting results. 

In [None]:
# train model
model.fit(train_set, optimizer=optimizer, num_epochs=num_epochs,
          cost=cost, callbacks=callbacks)

Evaluate the model on test and valiadation set

In [None]:
test_pct = 100 * model.eval(valid_set, metric=metric)[0]
train_pct = 100 * model.eval(train_set, metric=metric)[0]

print "Test Accuracy: %2.1f%%" % test_pct
print "Train Accuracy: %2.1f%%" % train_pct

3. Inference
------------
The trained model can now be used to perform inference on new reviews. Set up a new model with a batch size of 1.

In [None]:
# setup backend
from neon.backends import gen_backend
be = gen_backend(batch_size=1)

Set up a new set of layers for batch size 1.

In [None]:
# define same model as in train. Layers need to be recreated with new batch size. 
layers = [
    LookupTable(vocab_size=vocab_size, embedding_dim=embedding_dim, init=init_emb),
    LSTM(hidden_size, init_glorot, activation=Tanh(),
         gate_activation=Logistic(), reset_cells=True),
    RecurrentSum(),
    Dropout(keep=0.5),
    Affine(nclass, init_glorot, bias=init_glorot, activation=Softmax())
]

Warp the new layers into a new model, initialize with the weights we just trained.

In [None]:
model_new = Model(layers=layers)

# load the weights
save_path= 'labeledTrainData.tsv' + '.pickle'
model_new.load_weights(save_path)
model_new.initialize(dataset=(sentence_length, batch_size))

Let's try in on some real reviews!

I went on [imdb](http://www.imdb.com/title/tt2379713/reviews?ref_=tt_ov_rt) to get some reviews of the latest Bond Movie.

*As a die hard fan of James Bond, I found this film to be simply nothing more than a classic. For any original James Bond fan, you will simply enjoy how the producers and Sam Mendes re-emerged the roots of James Bond. The roots of Spectre, Blofield and just the pure elements of James Bond that we all miss even from the gun barrel introduction. This film deserves higher ratings in my view. I don't want to spoil the film , but I am finally glad the writers brought back the roots of James Bond. A true fan nothing more nothing less. I don't know what else to expect from a James bond film and Spectre does just what I originally expected in a James Bond film. It opens a whole new extension to have many more films to come. The cast does a superb in their roles and many salutes to Christopher Waltz in his enemy role.*

and another one

*The plot/writing is completely unrealistic and just dumb at times. Bond is dressed up in a white tux on an overnight train ride? eh, OK. But then they just show up at the villain's compound like nothing bad is going to happen to them. How stupid is this Bond? And then the villain just happens to booby trap this huge building in London (across from the intelligence building) and previously or very quickly had some bullet proof glass installed.*

*And so on and so on... give me a break. And then there was the terrible credit sequence at the beginning that was hell bent on turning Daniel Craig into a sex object. I don't mind that, but when you show him in the credit sequence with his shirt off there isn't even the pretense of something else going on. They were trying way too hard.*

*There was some of the same writers as the previous newer (Craig) Bonds too, and I enjoyed those. Someone must have come along and thought they had some great ideas. That person should be fired.*

In [None]:
import preprocess_text
import cPickle
import numpy as np

# setup buffers before accepting reviews
xbuf = np.zeros((sentence_length, 1), dtype=np.int32)  # host buffer
xdev = be.zeros((sentence_length, 1), dtype=np.int32)  # device buffer

# tags for text pre-processing
oov = 2
start = 1
index_from = 3
pad_char = 0

# load dictionary from file (generated by prepare script)
vocab, rev_vocab = cPickle.load(open(fname + '.vocab', 'rb'))

while True:
    line = raw_input('Enter a Review from testData.tsv file: \n')

    # clean the input
    tokens = preprocess_text.clean_string(line).strip().split()

    # convert strings to one-hot. Check for oov and add start
    sent = [len(vocab) + 1 if t not in vocab else vocab[t] for t in tokens]
    sent = [start] + [w + index_from for w in sent]
    sent = [oov if w >= vocab_size else w for w in sent]

    # pad sentences
    xbuf[:] = 0
    trunc = sent[-sentence_length:]
    xbuf[-len(trunc):, 0] = trunc  # load list into numpy array
    xdev[:] = xbuf  # load numpy array into device tensor
    
    # run the sentence through the model
    y_pred = model_new.fprop(xdev, inference=True)
    
    print '-' * 100
    print "Sentence encoding: {0}".format(xbuf.T)
    print "\nPrediction: {:.1%} negative, {:.1%} positive".format(y_pred.get()[0,0], y_pred.get()[1,0])
    print '-' * 100