In [1]:
from theano.sandbox import cuda

 https://github.com/Theano/Theano/wiki/Converting-to-the-new-gpu-back-end%28gpuarray%29

Using gpu device 0: GeForce GTX 950 (CNMeM is enabled with initial size: 90.0% of memory, cuDNN 5110)


In [2]:
%matplotlib inline
import utils; reload(utils)
from utils import *
from __future__ import division, print_function

Using Theano backend.


In [3]:
model_path = 'data/imdb/models/'
%mkdir -p $model_path

## Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.

In [4]:
from keras.datasets import imdb
idx = imdb.get_word_index()

This is the word list:

In [5]:
idx.items()[:5]

[('fawn', 34701),
 ('tsukino', 52006),
 ('nunnery', 52007),
 ('sonja', 16816),
 ('vani', 63951)]

In [6]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

...and this is the mapping from id to word

In [7]:
idx2word = {v: k for k, v in idx.iteritems()}

# items() vs iteritems()
# https://stackoverflow.com/questions/10458437/what-is-the-difference-between-dict-items-and-dict-iteritems

We download the reviews using code copied from keras.datasets:

In [8]:
path = get_file('imdb_full.pkl',
                origin='https://s3.amazonaws.com/text-datasets/imdb_full.pkl',
                md5_hash='d091312047c43cf9e4e38fef92437263')
f = open(path, 'rb')
(x_train, labels_train), (x_test, labels_test) = pickle.load(f)

In [9]:
len(x_train)

25000

Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.

In [10]:
', '.join(map(str, x_train[0]))

'23022, 309, 6, 3, 1069, 209, 9, 2175, 30, 1, 169, 55, 14, 46, 82, 5869, 41, 393, 110, 138, 14, 5359, 58, 4477, 150, 8, 1, 5032, 5948, 482, 69, 5, 261, 12, 23022, 73935, 2003, 6, 73, 2436, 5, 632, 71, 6, 5359, 1, 25279, 5, 2004, 10471, 1, 5941, 1534, 34, 67, 64, 205, 140, 65, 1232, 63526, 21145, 1, 49265, 4, 1, 223, 901, 29, 3024, 69, 4, 1, 5863, 10, 694, 2, 65, 1534, 51, 10, 216, 1, 387, 8, 60, 3, 1472, 3724, 802, 5, 3521, 177, 1, 393, 10, 1238, 14030, 30, 309, 3, 353, 344, 2989, 143, 130, 5, 7804, 28, 4, 126, 5359, 1472, 2375, 5, 23022, 309, 10, 532, 12, 108, 1470, 4, 58, 556, 101, 12, 23022, 309, 6, 227, 4187, 48, 3, 2237, 12, 9, 215'

The first word of the first review is 23022. Let's see what that is.

In [11]:
idx2word[23022]

'bromwell'

Here's the whole review, mapped from ids to words.

In [12]:
' '.join([idx2word[o] for o in x_train[0]])

"bromwell high is a cartoon comedy it ran at the same time as some other programs about school life such as teachers my 35 years in the teaching profession lead me to believe that bromwell high's satire is much closer to reality than is teachers the scramble to survive financially the insightful students who can see right through their pathetic teachers' pomp the pettiness of the whole situation all remind me of the schools i knew and their students when i saw the episode in which a student repeatedly tried to burn down the school i immediately recalled at high a classic line inspector i'm here to sack one of your teachers student welcome to bromwell high i expect that many adults of my age think that bromwell high is far fetched what a pity that it isn't"

The labels are 1 for positive, 0 for negative.

In [13]:
labels_train[:10] # list of 0 or 1

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Reduce vocab size by setting rare words to max index.

In [14]:
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_train] # list of numpy array
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s]) for s in x_test]

Look at distribution of lengths of sentences.

In [15]:
lens = np.array(map(len, trn))
(lens.max(), lens.min(), lens.mean())

(2493, 10, 237.71364)

Pad (with zero) or truncate each sentence to make consistent length.

In [16]:
# Let's check index of long text
sorted([(i, len(x))for i, x in enumerate(trn)], key=lambda x: x[1])[-5:] 

[(4346, 1628), (5917, 1732), (6258, 1850), (49, 1853), (1954, 2493)]

In [17]:
trn[1954]

array([1011,  297, 4346, ...,    3,    7,    7])

In [18]:
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0)

In [19]:
trn[1954]

array([  43,    4,    1, 1743,    2,    1,  738, 4999,   16, 2648, 2648,  517,    3,  198,    4,
       4999,   12,  559,  177,  738,   18,    1,   84,   28, 4999,   16,    3, 4999,   12, 1316,
          3,  104, 1579,    1,  545, 3498, 1002,    1, 1743,   16, 4999,  579,    5,  110,    2,
       4999, 2648,    2, 3724, 4999,    1,  738, 4999,    2,  738,  185,   80,    9,  142,   80,
          1, 1743,    2, 4999, 3441,    1,  738,   16,    3, 4999, 4999,    5,   76,    3,  104,
       1579,  738, 4999, 4999, 4999,   31,    1, 3678,    2, 4999,   87, 2648, 3092,   53,    1,
       4999,  586,   12, 1326,   59,   25,  345,    1,  738,    1,  422,    1,  738, 4999,   31,
       4999,   20,    1, 4999, 4999,    5, 3196, 2648, 2648, 4999,    1,  844,    2,  738, 4999,
          1, 2848, 4999,  512,  100,    1,  738, 4999,    4, 2648,    2,  566,    1, 2848, 4999,
         20,    1, 4999, 2648,  802,    5,  190, 3076,   31, 4999,    1,   84,   28, 1002,    1,
       1743,    2, 4999, 4999,

In [20]:
len(trn[1954])

500

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.

In [21]:
trn.shape # numpy.ndarray

(25000, 500)

## Create simple models

### Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [22]:
# input_length: Length of input sequences, when it is constant. This argument is required if you are going to 
# connect  Flatten then Dense layers upstream (without it, the shape of the dense outputs cannot be computed).

model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len), # seq_len = 500
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [23]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy']) # Use binary_crossentropy because last activation is sigmoid
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 16000)         0           embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 100)           1600100     flatten_1[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 100)           0           dense_1[0][0]                    
___________________________________________________________________________________________

In [24]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7abbb53590>

The [stanford paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [25]:
conv1 = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
    Dropout(0.2),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [26]:
conv1.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [27]:
conv1.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=4, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
Epoch 2/4
Epoch 3/4
Epoch 4/4


<keras.callbacks.History at 0x7f7ab5b41850>

That's well past the Stanford paper's accuracy - another win for CNNs!

In [28]:
conv1.save_weights(model_path + 'conv1.h5')

In [24]:
conv1.load_weights(model_path + 'conv1.h5')

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [29]:
def get_glove_dataset(dataset):
    """Download the requested glove dataset from files.fast.ai
    and return a location that can be passed to load_vectors.
    """
    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    %mkdir -p $glove_path
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [30]:
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb')),
        pickle.load(open(loc+'_idx.pkl','rb')))

In [31]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))

Untaring file...


The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [32]:
vecs.shape

(400000, 50)

In [33]:
def create_emb():
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))

    for i in range(1,len(emb)):
        word = idx2word[i]
        if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))

    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,)) # -1 means vocab_size - 1
    emb/=3 # http://forums.fast.ai/t/why-do-we-divide-the-embedding-by-3/241
    return emb

In [34]:
emb = create_emb()

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.

In [35]:
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, 
              weights=[emb], trainable=False),
    Dropout(0.25),
    Convolution1D(64, 5, border_mode='same', activation='relu'),
    Dropout(0.25),
    MaxPooling1D(),
    Flatten(),
    Dense(100, activation='relu'),
    Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [36]:
model.layers

[<keras.layers.embeddings.Embedding at 0x7f7aa7779350>,
 <keras.layers.core.Dropout at 0x7f7aa7779390>,
 <keras.layers.convolutional.Convolution1D at 0x7f7aa77793d0>,
 <keras.layers.core.Dropout at 0x7f7aa7779450>,
 <keras.layers.pooling.MaxPooling1D at 0x7f7aa7779490>,
 <keras.layers.core.Flatten at 0x7f7aa7779510>,
 <keras.layers.core.Dense at 0x7f7aa7779590>,
 <keras.layers.core.Dropout at 0x7f7aa7779610>,
 <keras.layers.core.Dense at 0x7f7aa7779650>]

In [37]:
# http://forums.fast.ai/t/lesson-5-discussion/233/23
# try starting the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. 
# It improved my accuracy from 0.83 to 0.9. Nothing else did.
model.layers[0].trainable=True

In [38]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_3 (Embedding)          (None, 500, 50)       250000      embedding_input_3[0][0]          
____________________________________________________________________________________________________
dropout_5 (Dropout)              (None, 500, 50)       0           embedding_3[0][0]                
____________________________________________________________________________________________________
convolution1d_2 (Convolution1D)  (None, 500, 64)       16064       dropout_5[0][0]                  
____________________________________________________________________________________________________
dropout_6 (Dropout)              (None, 500, 64)       0           convolution1d_2[0][0]            
___________________________________________________________________________________________

In [39]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [40]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7aa57461d0>

We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.

In [41]:
# model.layers[0].trainable=True

# http://forums.fast.ai/t/lesson-5-discussion/233/23
# try starting the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. 
# It improved my accuracy from 0.83 to 0.9. Nothing else did.
model.layers[0].trainable=False 

In [42]:
model.optimizer.lr=1e-4

In [43]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=1, batch_size=64)

# When I followed original steps, that is, start with trainable = False then set trainable = True,
# I ended up with poor accuracy 0.8167...
# 25000/25000 [==============================] - 8s - loss: 0.4843 - acc: 0.7733 - val_loss: 0.4406 - val_acc: 0.8167

Train on 25000 samples, validate on 25000 samples
Epoch 1/1


<keras.callbacks.History at 0x7f7aa72e9350>

As expected, that's given us a nice little boost. :)

In [44]:
model.save_weights(model_path+'glove50.h5')

## Multi-size CNN

This is an implementation of a multi-size CNN as shown in Ben Bowles' [excellent blog post](https://quid.com/feed/how-quid-uses-deep-learning-with-small-data).

In [45]:
from keras.layers import Merge

We use the functional API to create multiple conv layers of different sizes, and then concatenate them.

In [46]:
graph_in = Input ((vocab_size, 50))
convs = [ ] 
for fsz in range (3, 6): 
    x = Convolution1D(64, fsz, border_mode='same', activation="relu")(graph_in)
    x = MaxPooling1D()(x) 
    x = Flatten()(x) 
    convs.append(x)
out = Merge(mode="concat")(convs) # A Merge layer can be used to merge a list of tensors into a single tensor, following some merge mode.
graph = Model(graph_in, out) # keras.engine.training.Model

In [47]:
emb = create_emb()

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.

In [48]:
model = Sequential ([
    Embedding(vocab_size, 50, input_length=seq_len, dropout=0.2, weights=[emb]), # input_length is seq_len
    Dropout (0.2),
    graph, # You can insert a Model instance in Sequential, too.
    Dropout (0.5),
    Dense (100, activation="relu"),
    Dropout (0.7),
    Dense (1, activation='sigmoid')
    ])

In [49]:
model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

In [50]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7aa28e7e50>

Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!

In [51]:
model.layers[0].trainable=False

In [52]:
model.optimizer.lr=1e-5

In [53]:
model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=2, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
Epoch 2/2


<keras.callbacks.History at 0x7f7a9f39b2d0>

This more complex architecture has given us another boost in accuracy.

## LSTM

We haven't covered this bit yet!

In [23]:
# consume_less: 
#     one of "cpu", "mem", or "gpu" (LSTM/GRU only). If set to "cpu", 
#     the RNN will use an implementation that uses fewer, larger matrix products, 
#     thus running faster on CPU but consuming more memory. If set to "mem", 
#     the RNN will use more matrix products, but smaller ones, thus running slower (may actually be faster on GPU) 
#     while consuming less memory. If set to "gpu" (LSTM/GRU only), the RNN will combine the input gate, 
#     the forget gate and the output gate into a single matrix, enabling more time-efficient parallelization on the GPU. 
#     Note: RNN dropout must be shared for all gates, resulting in a slightly reduced regularization.

model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len, mask_zero=True, # seq_len=500
              W_regularizer=l2(1e-6), dropout=0.2),
    LSTM(100, consume_less='gpu'),
    Dense(1, activation='sigmoid')])
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
embedding_1 (Embedding)          (None, 500, 32)       160000      embedding_input_1[0][0]          
____________________________________________________________________________________________________
lstm_1 (LSTM)                    (None, 100)           53200       embedding_1[0][0]                
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 1)             101         lstm_1[0][0]                     
Total params: 213,301
Trainable params: 213,301
Non-trainable params: 0
____________________________________________________________________________________________________


In [27]:
# trn: (25000, 500)
# labels_train: (25000,)

model.fit(trn, labels_train, validation_data=(test, labels_test), nb_epoch=5, batch_size=64)

Train on 25000 samples, validate on 25000 samples
Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7fb068a18310>