In [1]:
%matplotlib inline

import utils_ted
from utils_ted import *

Using TensorFlow backend.


In [2]:
model_path = 'data/imdb/models/'
if not os.path.exists(model_path): os.makedirs(model_path)

In [3]:
batch_size = 64

## Setup data

We're going to look at the IMDB dataset, which contains movie reviews from IMDB, along with their sentiment. Keras comes with some helpers for this dataset.

In [4]:
from keras.datasets import imdb

[imdb.get_word_index](https://www.tensorflow.org/api_docs/python/tf/keras/datasets/imdb/get_word_index)

In [11]:
idx = imdb.get_word_index()
len(idx)

88584

This is the word list:

In [22]:
idx_arr = sorted(idx, key=idx.get)
idx_arr[:10]

['the', 'and', 'a', 'of', 'to', 'is', 'br', 'in', 'it', 'i']

...and this is the mapping from id to word

In [23]:
idx2word = {v: k for k, v in idx.items()}

We download the reviews using code copied from keras.datasets:

In [8]:
(x_train, y_train), (x_test, y_test) = imdb.load_data()

In [9]:
print(len(x_train))
print(len(x_test))

25000
25000


Here's the 1st review. As you see, the words have been replaced by ids. The ids can be looked up in idx2word.

In [30]:
#x_train[0][:3]
", ".join(map(str, x_train[0]))

'1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 22665, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 21631, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 31050, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32'

The first word of the first review is 1. Let's see what that is.

In [31]:
idx2word[1], idx2word[14]

('the', 'as')

In [27]:
' '.join([idx2word[o] for o in x_train[0]])

"the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room titillate it so heart shows to years of every never going villaronga help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of gilmore's br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but wh

In [32]:
# Ted own implementation
def num2word(x): return idx2word[x]
" ".join(map(num2word, x_train[0]))

"the as you with out themselves powerful lets loves their becomes reaching had journalist of lot from anyone to have after out atmosphere never more room titillate it so heart shows to years of every never going villaronga help moments or of every chest visual movie except her was several of enough more with is now current film as you of mine potentially unfortunately of you than him that with out themselves her get for was camp of you movie sometimes movie that with scary but pratfalls to story wonderful that in seeing in character to of 70s musicians with heart had shadows they of here that with her serious to have does when from why what have critics they is you that isn't one will very to as itself with other tricky in of seen over landed for anyone of gilmore's br show's to whether from than out themselves history he name half some br of 'n odd was two most of mean for 1 any an boat she he should is thought frog but of script you not while history he heart to real at barrel but wh

The labels are 1 for positive, 0 for negative.

In [33]:
y_train[:10]

array([1, 0, 0, 1, 0, 0, 1, 0, 1, 0], dtype=int64)

Reduce vocab size by setting rare words to max index.

In [34]:
vocab_size = 5000

trn = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s ])for s in x_train]
test = [np.array([i if i<vocab_size-1 else vocab_size-1 for i in s ])for s in x_test]

Look at distribution of lengths of sentences.

In [52]:
lens = np.array([len(s) for s in trn])
(lens.max(), lens.min(), lens.mean())

(2494, 11, 238.71364)

Pad (with zero) or truncate each sentence to make consistent length.

In [53]:
from keras.preprocessing import sequence

In [54]:
seq_len = 500

trn = sequence.pad_sequences(trn, maxlen=seq_len, value=0.)
test = sequence.pad_sequences(test, maxlen=seq_len, value=0.)

This results in nice rectangular matrices that can be passed to ML algorithms. Reviews shorter than 500 words are pre-padded with zeros, those greater are truncated.

In [55]:
print(trn.shape)
print(test.shape)

(25000, 500)
(25000, 500)


## Create simple models

### Single hidden layer NN

The simplest model that tends to give reasonable results is a single hidden layer net. So let's try that. Note that we can't expect to get any useful results by feeding word ids directly into a neural net - so instead we use an embedding to replace them with a vector of 32 (initially random) floats for each word in the vocab.

In [57]:
model = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    Flatten(),
    Dense(100, activation='relu'),
    #Dropout(0.7),
    Dense(1, activation='sigmoid')
])

In [58]:
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 32)           160000    
_________________________________________________________________
flatten_2 (Flatten)          (None, 16000)             0         
_________________________________________________________________
dense_3 (Dense)              (None, 100)               1600100   
_________________________________________________________________
dense_4 (Dense)              (None, 1)                 101       
Total params: 1,760,201
Trainable params: 1,760,201
Non-trainable params: 0
_________________________________________________________________


In [59]:
model.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=2, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
24s - loss: 0.4136 - acc: 0.7888 - val_loss: 0.2898 - val_acc: 0.8782
Epoch 2/2
48s - loss: 0.1484 - acc: 0.9468 - val_loss: 0.3402 - val_acc: 0.8648


<keras.callbacks.History at 0x22b69b8aa58>

The [stanford paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf) that this dataset is from cites a state of the art accuracy (without unlabelled data) of 0.883. So we're short of that, but on the right track.

### Single conv layer with max pooling

A CNN is likely to work better, since it's designed to take advantage of ordered data. We'll need to use a 1D CNN, since a sequence of words is 1D.

In [124]:
conv1 = Sequential([
    Embedding(vocab_size, 32, input_length=seq_len),
    #Embedding(vocab_size, 32, input_length=seq_len, dropout=0.2),
    #Dropout(0.2),
    BatchNormalization(),
    Conv1D(64, 5, padding='same', activation='relu'),
    #Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    BatchNormalization(),
    Dense(100, activation='relu'),
    BatchNormalization(),
    #Dropout(0.7),
    Dense(1, activation='sigmoid')])

In [125]:
conv1.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [126]:
conv1.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=2, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/2
206s - loss: 0.4417 - acc: 0.7759 - val_loss: 1.0215 - val_acc: 0.5156
Epoch 2/2
173s - loss: 0.1965 - acc: 0.9234 - val_loss: 0.3347 - val_acc: 0.8601


<keras.callbacks.History at 0x22b0e81bcc0>

In [127]:
conv1.save_weights(model_path+'conv1.h5')

In [128]:
conv1.load_weights(model_path+'conv1.h5')

## Pre-trained vectors

You may want to look at wordvectors.ipynb before moving on.

In this section, we replicate the previous CNN, but using pre-trained embeddings.

In [69]:
from keras.utils import get_file

In [70]:
def get_glove_dataset(dataset):
    """Download the requested glove dataset from files.fast.ai
    and return a location that can be passed to load_vectors.
    """
    # see wordvectors.ipynb for info on how these files were
    # generated from the original glove data.
    md5sums = {'6B.50d': '8e1557d1228decbda7db6dfd81cd9909',
               '6B.100d': 'c92dbbeacde2b0384a43014885a60b2c',
               '6B.200d': 'af271b46c04b0b2e41a84d8cd806178d',
               '6B.300d': '30290210376887dcc6d0a5a6374d8255'}
    glove_path = os.path.abspath('data/glove/results')
    #%mkdir -p $glove_path
    if not os.path.exists(glove_path): os.makedirs(glove_path)
    return get_file(dataset,
                    'http://files.fast.ai/models/glove/' + dataset + '.tgz',
                    cache_subdir=glove_path,
                    md5_hash=md5sums.get(dataset, None),
                    untar=True)

In [71]:
def load_vectors(loc):
    return (load_array(loc+'.dat'),
        pickle.load(open(loc+'_words.pkl','rb'), encoding='utf8'),
        pickle.load(open(loc+'_idx.pkl','rb'), encoding='utf8'))

In [72]:
vecs, words, wordidx = load_vectors(get_glove_dataset('6B.50d'))

In [82]:
vecs.shape, len(words), len(wordidx)

((400000, 50), 400000, 400000)

In [78]:
wordidx['hello'], wordidx['this']

(13075, 37)

In [81]:
words[13075], words[37] 

('hello', 'this')

In [101]:
wordidx[',']

1

In [110]:
wordidx['it\'s']

KeyError: "it's"

The glove word ids and imdb word ids use different indexes. So we create a simple function that creates an embedding matrix using the indexes from imdb, and the embeddings from glove (where they exist).

In [132]:
def create_emb(vocab_size):
    n_fact = vecs.shape[1]
    emb = np.zeros((vocab_size, n_fact))
    
    for i in range(1, len(emb)):
        word = idx2word[i]
        #if word and re.match(r"^[a-zA-Z0-9\-]*$", word):
        if word and re.match(r"^[a-zA-Z0-9\-_,]*$", word):
            src_idx = wordidx[word]
            emb[i] = vecs[src_idx]
        else:
            # If we can't find the word in glove, randomly initialize
            emb[i] = normal(scale=0.6, size=(n_fact,))
            #emb[i] = normal(scale=1, size=(n_fact,))
    
    # This is our "rare word" id - we want to randomly initialize
    emb[-1] = normal(scale=0.6, size=(n_fact,))
    #emb[-1] = normal(scale=1, size=(n_fact,))
    emb = emb / 3  #??
    return emb

In [133]:
emb = create_emb(vocab_size=vocab_size)

In [134]:
emb.shape

(5000, 50)

We pass our embedding matrix to the Embedding constructor, and set it to non-trainable.

In [135]:
model = Sequential([
    # Embedding(vocab_size, 50, input_length=seq_len, weights=[emb], trainable=False, dropout=0.2),
    Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]),
    #Dropout(0.2),
    BatchNormalization(),
    Conv1D(64, 5, padding='same', activation='relu'),
    #Dropout(0.2),
    MaxPooling1D(),
    Flatten(),
    BatchNormalization(),
    Dense(100, activation='relu'),
    #Dropout(0.7),
    BatchNormalization(),
    Dense(1, activation='sigmoid')])

In [136]:
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [137]:
model.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=4, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
252s - loss: 0.5359 - acc: 0.7024 - val_loss: 0.3439 - val_acc: 0.8554
Epoch 2/4
266s - loss: 0.2284 - acc: 0.9098 - val_loss: 0.3949 - val_acc: 0.8398
Epoch 3/4
262s - loss: 0.1249 - acc: 0.9520 - val_loss: 0.4080 - val_acc: 0.8568
Epoch 4/4
262s - loss: 0.0660 - acc: 0.9759 - val_loss: 0.4872 - val_acc: 0.8563


<keras.callbacks.History at 0x22b13a32c88>

In [138]:
model.save_weights(model_path+'emb-trainable1.h5')

We already have beaten our previous model! But let's fine-tune the embedding weights - especially since the words we couldn't find in glove just have random embeddings.

In [139]:
model.layers[0].trainable = False #True

In [140]:
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [141]:
model.optimizer.lr=1e-4

In [142]:
model.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=4, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
252s - loss: 0.0261 - acc: 0.9922 - val_loss: 0.5039 - val_acc: 0.8622
Epoch 2/4
230s - loss: 0.0168 - acc: 0.9954 - val_loss: 0.5468 - val_acc: 0.8590
Epoch 3/4
218s - loss: 0.0118 - acc: 0.9972 - val_loss: 0.5730 - val_acc: 0.8594
Epoch 4/4
234s - loss: 0.0083 - acc: 0.9984 - val_loss: 0.6111 - val_acc: 0.8589


<keras.callbacks.History at 0x22b1f9143c8>

As expected, that's given us a nice little boost. :)

In [143]:
model.save_weights(model_path+'glove50.h5')

## Multi-size CNN

This is an implementation of a multi-size CNN as shown in Ben Bowles' [excellent blog post](https://quid.com/feed/how-quid-uses-deep-learning-with-small-data).

In [144]:
from keras.layers import Add

In [145]:
graph_in = Input((vocab_size, 50))
convs = []
for filter_size in range(3, 6):
    x = BatchNormalization()(graph_in)
    x = Conv1D(64, filter_size, padding='same', activation='relu')(x)
    x = MaxPooling1D()(x)
    x = Flatten()(x)
    convs.append(x)
out = Add()(convs)
graph = Model(graph_in, out)

In [146]:
emb = create_emb(vocab_size=vocab_size)

We then replace the conv/max-pool layer in our original CNN with the concatenated conv layers.

In [147]:
model = Sequential([
    # Embedding(vocab_size, 50, input_length=seq_len, weights=[emb], trainable=False, dropout=0.2),
    Embedding(vocab_size, 50, input_length=seq_len, weights=[emb]),
    #Dropout(0.2),
    graph,
    BatchNormalization(),
    Dense(100, activation='relu'),
    #Dropout(0.7),
    BatchNormalization(),
    Dense(1, activation='sigmoid')])

In [148]:
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [149]:
model.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=4, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
504s - loss: 0.5199 - acc: 0.7123 - val_loss: 0.3355 - val_acc: 0.8607
Epoch 2/4
533s - loss: 0.2355 - acc: 0.9068 - val_loss: 0.3247 - val_acc: 0.8625
Epoch 3/4
514s - loss: 0.1227 - acc: 0.9546 - val_loss: 0.4028 - val_acc: 0.8583
Epoch 4/4
487s - loss: 0.0648 - acc: 0.9765 - val_loss: 0.5035 - val_acc: 0.8464


<keras.callbacks.History at 0x22b23d35ac8>

Interestingly, I found that in this case I got best results when I started the embedding layer as being trainable, and then set it to non-trainable after a couple of epochs. I have no idea why!

In [150]:
model.layers[0].trainable=False

In [151]:
model.optimizer.lr=1e-5

In [152]:
model.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=4, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
522s - loss: 0.0367 - acc: 0.9879 - val_loss: 1.1184 - val_acc: 0.7907
Epoch 2/4
518s - loss: 0.0264 - acc: 0.9907 - val_loss: 0.6310 - val_acc: 0.8528
Epoch 3/4
494s - loss: 0.0285 - acc: 0.9897 - val_loss: 0.6731 - val_acc: 0.8459
Epoch 4/4
507s - loss: 0.0186 - acc: 0.9937 - val_loss: 0.7627 - val_acc: 0.8479


<keras.callbacks.History at 0x22b24123f98>

This more complex architecture has given us another boost in accuracy.

## LSTM

We haven't covered this bit yet!

In [154]:
model = Sequential([
    Embedding(vocab_size, 50, input_length=seq_len, 
              weights=[emb], mask_zero=True),
    BatchNormalization(),
    LSTM(100, implementation=1),
    BatchNormalization(),
    Dense(1, activation='sigmoid')
])

In [155]:
model.compile(Adam(), loss='binary_crossentropy', metrics=['accuracy'])

In [156]:
model.fit(trn, y_train, validation_data=(test, y_test), batch_size=batch_size, epochs=4, verbose=2)

Train on 25000 samples, validate on 25000 samples
Epoch 1/4
759s - loss: 0.5809 - acc: 0.6852 - val_loss: 0.5291 - val_acc: 0.7811
Epoch 2/4
956s - loss: 0.3971 - acc: 0.8130 - val_loss: 1.4765 - val_acc: 0.5222
Epoch 3/4
1366s - loss: 0.2845 - acc: 0.8806 - val_loss: 0.4289 - val_acc: 0.8357
Epoch 4/4
2230s - loss: 0.2010 - acc: 0.9207 - val_loss: 0.3053 - val_acc: 0.8788


<keras.callbacks.History at 0x22b24f25630>