## Sentiment analysis with Recurrent Neural Networks

For this particular dataset a shallow method like tf-idf features into logistic regression will outperform the RNN. But, what this will illustrate is just how simple it is to implement an RNN for sentiment analysis with Keras and TF-Learn. The notebook was run with Keras and the equivalent TF-Learn code will be commented out. 

#### Load the packages

In [5]:
import numpy as np

from keras.preprocessing import sequence
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Embedding
from keras.layers import GRU
from keras.datasets import imdb

#import tflearn
#from tflearn.data_utils import to_categorical, pad_sequences
#from tflearn.datasets import imdb

### Load the data and define some parameters

maxlen - how many words to keep in each review
max_features = number of unique words to keep in our vocabulary
batch_size = how many sentences to run through for each gradient update

The way that we treat each word is to assign it an integer to each word from 0 to max_features. We will use each integer as a lookup index for the embedding matrix that we set up as the first layer of the network. These embeddings are learned similarly to those in word2vec. Each word is then converted into the 32 dimensional vector that will encode some of that words semantic meaning. 

Loading the data through the Keras loader already takes care of indexing the words. So if we print what we loaded we won't be able to read it, it'll just be a vector of integers. 

Additionally, we might have some sentences of different lengths. So if we have shorter ones then we will just pad them with an integer that represents a no-word or blank character. 

In [6]:
max_features = 20000
maxlen = 80  # cut texts after this number of words (among top max_features most common words)
batch_size = 32

print('Loading data...')
(X_train, y_train), (X_test, y_test) = imdb.load_data(nb_words=max_features)
print(len(X_train), 'train sequences')
print(len(X_test), 'test sequences')

print('Pad sequences (samples x time)')
X_train = sequence.pad_sequences(X_train, maxlen=maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=maxlen)
print('X_train shape:', X_train.shape)
print('X_test shape:', X_test.shape)

Loading data...
(25000, 'train sequences')
(25000, 'test sequences')
Pad sequences (samples x time)
('X_train shape:', (25000, 80))
('X_test shape:', (25000, 80))


In [7]:
print('Build model...')
model = Sequential()
model.add(Embedding(max_features, 32, dropout=0.2))
model.add(GRU(16, dropout_W=0.2, dropout_U=0.2))  # try using a GRU instead, for fun
model.add(Dense(1))
model.add(Activation('sigmoid'))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

print('Train...')
model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=10,
          validation_data=(X_test, y_test))
score, acc = model.evaluate(X_test, y_test,
                            batch_size=batch_size)
print('Test score:', score)
print('Test accuracy:', acc)

Build model...
Train...
Train on 25000 samples, validate on 25000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
('Test score:', 0.49611565612792968)
('Test accuracy:', 0.81784000000000001)
