# Demo 4 - Image caption autoencoder

In [1]:
import mycoco
mycoco.setmode("train")

loading annotations into memory...
Done (t=13.19s)
creating index...
index created!
loading annotations into memory...
Done (t=0.89s)
creating index...
index created!


## Preparation

Let's get all the captions from COCO.

In [2]:
allcaptions = mycoco.query([['']])[0]

In [3]:
len(allcaptions)

118287

In [4]:
captions = mycoco.get_captions_for_ids(allcaptions)

In [5]:
captions[50000:50020]

['A subway station above groundtrain with a small green building.',
 'A very long elevated train is making its way through the city.',
 'A photo of four different train tracks and a train on one of them.',
 'A big train riding along on the train tracks',
 'The train is going down the railroad tracks. ',
 'A row of surfboards that are lined up on a beach.',
 'A bunch of surfboards leaning on cement in the sand.',
 'A group of surfboards sitting on top of a sandy beach.',
 'Two people sitting between surfboards propped up at the beach.',
 'Surfboards are line up against a parking lot.',
 'SOMEONE IS SKIING AND JUMPING HIGH OFF A MOUNTAIN',
 'A skier is flying in the air in the middle of a mountainous area.',
 'A person going down a snowy hill with skis.',
 'A skier is taking a large jump on a slope',
 'A photograph of a skier performing a stunt.',
 'A mouse sitting on book, with a Microsoft logo on it.',
 'THIS IS A PHOTO OF A MOUSE ON TOP OF A BOOK',
 'A computer mouse is sitting on the

Here we tokenize the sentences and convert them to lists of word indices (as integers), which we need to train neural networks.

In [6]:
from keras.preprocessing.text import Tokenizer, one_hot
from keras.utils import to_categorical

Using TensorFlow backend.


In [7]:
tokenizer = Tokenizer(num_words=10000)

In [8]:
tokenizer.fit_on_texts(captions)

In [9]:
sequences = tokenizer.texts_to_sequences(captions)

In [13]:
sequences[50000:50020]

[[1, 751, 220, 234, 6, 1, 34, 63, 67],
 [1, 138, 238, 1950, 40, 8, 458, 150, 400, 104, 4, 81],
 [1, 166, 3, 227, 188, 40, 196, 7, 1, 40, 2, 101, 3, 246],
 [1, 163, 40, 44, 218, 2, 4, 40, 196],
 [4, 40, 8, 271, 29, 4, 642, 196],
 [1, 406, 3, 507, 24, 17, 417, 32, 2, 1, 72],
 [1, 168, 3, 507, 457, 2, 747, 5, 4, 408],
 [1, 31, 3, 507, 11, 2, 30, 3, 1, 563, 72],
 [13, 16, 11, 410, 507, 1587, 32, 14, 4, 72],
 [507, 17, 360, 32, 313, 1, 199, 194],
 [362, 8, 243, 7, 283, 355, 211, 1, 256],
 [1, 353, 8, 83, 5, 4, 120, 5, 4, 216, 3, 1, 2289, 99],
 [1, 27, 271, 29, 1, 244, 222, 6, 156],
 [1, 353, 8, 225, 1, 25, 607, 2, 1, 230],
 [1, 453, 3, 1, 353, 618, 1, 1213],
 [1, 426, 11, 2, 374, 6, 1, 7635, 2161, 2, 26],
 [137, 8, 1, 166, 3, 1, 426, 2, 30, 3, 1, 374],
 [1, 111, 426, 8, 11, 2, 4, 2541, 3, 1, 374],
 [137, 8, 1, 426, 2, 1, 2541, 2, 1, 22],
 [1, 426, 11, 2, 1, 374, 142, 12, 5312, 3, 1, 130]]

In [14]:
sequences2 = [[0] + x + [0] for x in sequences]

In [15]:
sequences2[50000:50020]

[[0, 1, 751, 220, 234, 6, 1, 34, 63, 67, 0],
 [0, 1, 138, 238, 1950, 40, 8, 458, 150, 400, 104, 4, 81, 0],
 [0, 1, 166, 3, 227, 188, 40, 196, 7, 1, 40, 2, 101, 3, 246, 0],
 [0, 1, 163, 40, 44, 218, 2, 4, 40, 196, 0],
 [0, 4, 40, 8, 271, 29, 4, 642, 196, 0],
 [0, 1, 406, 3, 507, 24, 17, 417, 32, 2, 1, 72, 0],
 [0, 1, 168, 3, 507, 457, 2, 747, 5, 4, 408, 0],
 [0, 1, 31, 3, 507, 11, 2, 30, 3, 1, 563, 72, 0],
 [0, 13, 16, 11, 410, 507, 1587, 32, 14, 4, 72, 0],
 [0, 507, 17, 360, 32, 313, 1, 199, 194, 0],
 [0, 362, 8, 243, 7, 283, 355, 211, 1, 256, 0],
 [0, 1, 353, 8, 83, 5, 4, 120, 5, 4, 216, 3, 1, 2289, 99, 0],
 [0, 1, 27, 271, 29, 1, 244, 222, 6, 156, 0],
 [0, 1, 353, 8, 225, 1, 25, 607, 2, 1, 230, 0],
 [0, 1, 453, 3, 1, 353, 618, 1, 1213, 0],
 [0, 1, 426, 11, 2, 374, 6, 1, 7635, 2161, 2, 26, 0],
 [0, 137, 8, 1, 166, 3, 1, 426, 2, 30, 3, 1, 374, 0],
 [0, 1, 111, 426, 8, 11, 2, 4, 2541, 3, 1, 374, 0],
 [0, 137, 8, 1, 426, 2, 1, 2541, 2, 1, 22, 0],
 [0, 1, 426, 11, 2, 1, 374, 142, 12, 5312

Note that we put a 0 at the beginning and ending of each sentence.

## A scaled-down word2vec-like autoencoder

We're going to create a feed-forward network that represents an autoencoder similar to the skip-gram model of word2vec --- the word in focus is going to predict its immediate neighbour before and after.  This window is narrower than word2vec, and we are not going to implement negative sampling.

So we need to split the text into training samples:

In [21]:
def create_training(seqs):
    collect = []
    for seq in seqs:
        for i in range(1, len(seq)-1):
            collect.append((seq[i], [seq[i-1],seq[i+1]]))
    return [x[0] for x in collect], [x[1] for x in collect]

In [22]:
numtrain_X, numtrain_y = create_training(sequences2[0:1000])

X is the word in question, y0 is the word before it, and y1 is the word after it. That's why we need to pad the sentences with zeros.  Now we can convert these into "one-hot" vectors:

In [23]:
train_X = [to_categorical(x, num_classes=10000) for x in numtrain_X]
train_y0 = [to_categorical(y[0], num_classes=10000) for y in numtrain_y]
train_y1 = [to_categorical(y[1], num_classes=10000) for y in numtrain_y]

In [24]:
len(train_X), len(numtrain_y)

(10515, 10515)

In [25]:
train_X[0].shape, len(train_y0), len(train_y1)

((10000,), 10515, 10515)

In [26]:
train_X[0], train_y0[0]

(array([0., 0., 0., ..., 0., 0., 0.], dtype=float32),
 array([1., 0., 0., ..., 0., 0., 0.], dtype=float32))

## Model design

This is an autoencoder, so we are going to compress the input representation to a smaller vector space, here 100-dimensional.  But we need to split the input back into predictors for words in our original vector space.  The prediction is done via softmaxes over dense layers of the right size.  In class we had one dense layer but that does not make sense for two softmaxes, so it's edited here.

In [23]:
from keras import Model
from keras.layers import Input, Dense, Activation

In [28]:
inputlayer = Input(shape=(10000,))
encoder = Dense(100)(inputlayer)
decoder1 = Dense(10000)(encoder)
decoder2 = Dense(10000)(encoder)

activation1 = Activation('softmax')(decoder1)
activation2 = Activation('softmax')(decoder2)

model = Model(inputs=[inputlayer], outputs=[activation1, activation2])

In [29]:
model.compile('adam', loss='categorical_crossentropy', metrics=['accuracy'])

In [30]:
model.summary()

__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            (None, 10000)        0                                            
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 100)          1000100     input_2[0][0]                    
__________________________________________________________________________________________________
dense_5 (Dense)                 (None, 10000)        1010000     dense_4[0][0]                    
__________________________________________________________________________________________________
dense_6 (Dense)                 (None, 10000)        1010000     dense_4[0][0]                    
__________________________________________________________________________________________________
activation

## Model training and vector extraction

In [31]:
model.fit([train_X], [train_y0, train_y1], batch_size=40, epochs=40)

Epoch 1/40
Epoch 2/40
Epoch 3/40
Epoch 4/40
Epoch 5/40
Epoch 6/40
Epoch 7/40
Epoch 8/40
Epoch 9/40
Epoch 10/40
Epoch 11/40
Epoch 12/40
Epoch 13/40
Epoch 14/40
Epoch 15/40
Epoch 16/40
Epoch 17/40
Epoch 18/40
Epoch 19/40
Epoch 20/40
Epoch 21/40
Epoch 22/40
Epoch 23/40
Epoch 24/40
Epoch 25/40
Epoch 26/40
Epoch 27/40
Epoch 28/40
Epoch 29/40
Epoch 30/40
Epoch 31/40
Epoch 32/40
Epoch 33/40
Epoch 34/40
Epoch 35/40
Epoch 36/40
Epoch 37/40
Epoch 38/40
Epoch 39/40
Epoch 40/40


<keras.callbacks.History at 0x7fc98dd10b38>

In [32]:
model.get_weights()[1]

array([ 0.14314662,  0.16355611, -0.07532723,  0.07635795, -0.41619694,
       -0.14242716,  0.1496203 ,  0.07238261,  0.10004503,  0.18769394,
        0.12176848, -0.18580864,  0.20597501, -0.11215528,  0.32767844,
       -0.145422  , -0.14715248,  0.11516788, -0.1380871 , -0.04156268,
        0.12133057,  0.19974759,  0.26606095, -0.20170295, -0.20935528,
       -0.12681833,  0.03637717,  0.15068875, -0.24225724, -0.00805199,
        0.1123011 ,  0.06966335, -0.23864731,  0.18234386, -0.07011341,
        0.01986277, -0.15526071,  0.11711799,  0.05910657, -0.15329146,
        0.07250773,  0.24581744,  0.0720281 , -0.076203  , -0.09033831,
        0.19663027,  0.2895875 ,  0.10339213,  0.1663343 ,  0.20417655,
        0.16058923,  0.16953008,  0.1681371 , -0.26785317,  0.14404684,
       -0.23360181, -0.1335926 ,  0.1905494 ,  0.0417267 ,  0.13558243,
       -0.01929747, -0.15311107,  0.16830768,  0.2660324 , -0.13602369,
        0.17998797,  0.07145264,  0.15956768, -0.19569813,  0.19

These weights are not meaningful by themselves, for the 100-dimensional layer. We need to apply them to each word input to get the embedding we want.  So we copy the weights to a new network that just has the decoder layer.

In [33]:
inputlayer2 = Input(shape=(10000,))
encoder2 = Dense(100)(inputlayer2)

model2 = Model(inputs=[inputlayer2], outputs=[encoder2])

In [34]:
model2.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_3 (InputLayer)         (None, 10000)             0         
_________________________________________________________________
dense_7 (Dense)              (None, 100)               1000100   
Total params: 1,000,100
Trainable params: 1,000,100
Non-trainable params: 0
_________________________________________________________________


In [35]:
model2.set_weights(model.get_weights()[0:2])

In [36]:
len(train_X)

10515

In [37]:
import numpy as np
limited_vocab = np.unique(train_X, axis=0)

In [38]:
limited_vocab

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       [0., 1., 0., ..., 0., 0., 0.]], dtype=float32)

What we just did was get the input vocabulary by finding all unique vectors.

In [39]:
limited_vocab.shape

(1405, 10000)

In [40]:
limited_vocab[0].shape

(10000,)

In [41]:
predictions = model2.predict(limited_vocab)

In [42]:
predictions[1]

array([ 0.18191895,  0.7673938 , -0.42503676,  0.21641074, -0.43395153,
       -0.7462448 ,  0.5142629 ,  0.800788  ,  0.57447726,  0.70238596,
        0.3525356 ,  0.30519813,  0.4696656 , -0.31868902,  0.12342709,
       -0.12800355, -0.3328032 ,  0.50857294, -0.32948518, -0.01759833,
        0.70950234, -0.02983892,  0.32943922,  0.10977852, -0.4553824 ,
       -0.5262557 , -0.03285315,  0.75082207, -0.13913803,  0.14193931,
       -0.0392445 ,  0.3051605 , -0.24927881,  0.2694665 ,  0.28161168,
        0.59672505,  0.03770941,  0.7673499 , -0.31700903, -0.24245512,
        0.22706467,  0.12413344,  0.01418295, -0.04698663, -0.524775  ,
       -0.44773534, -0.29795176, -0.1553868 ,  0.8976282 ,  0.6257448 ,
        0.40681547, -0.08134015, -0.2900033 , -0.23900709, -0.00960328,
       -0.55789924,  0.0084601 ,  0.5626581 , -0.64054185,  0.52611536,
       -0.6000326 , -0.09984165,  0.0429983 ,  0.58272547,  0.08460243,
        0.33115682,  0.19802988,  0.4621813 , -0.34791082,  0.93

In [43]:
predictions[2]

array([-0.25502354,  0.04944277, -0.5670153 , -0.29917794, -0.69977474,
       -0.66997534,  0.28384385,  0.42144537,  0.04923905,  0.21359417,
       -0.21635136, -0.59145594, -0.1079127 , -0.548488  ,  0.74762267,
       -0.6050516 ,  0.17679739,  0.60143137, -0.63196814, -0.44900575,
        0.18229765,  0.36886823,  0.5242398 , -0.5135215 , -0.68832326,
        0.22165588, -0.4248374 , -0.31730098, -0.6920254 ,  0.0761136 ,
        0.5636907 ,  0.5056648 , -0.697595  , -0.28111702, -0.5325645 ,
       -0.38215536,  0.22438112,  0.08300776,  0.33157372, -0.6050271 ,
       -0.258968  ,  0.6488949 ,  0.58741826, -0.2831214 ,  0.30824584,
        0.61425304,  0.7610241 ,  0.5610622 , -0.16325665,  0.6506476 ,
        0.1443589 ,  0.6208735 ,  0.06706424, -0.67102075,  0.33918086,
        0.13869268, -0.55294687,  0.6399654 , -0.38942868,  0.28037673,
        0.49838746,  0.10069683, -0.11020742,  0.7295141 , -0.5611611 ,
        0.4756082 , -0.01054832, -0.17739466, -0.44328785,  0.61

Every vocabulary item has a different 100-dimensional vector output from the model now. They can be used as embeddings for other tasks, such as clustering.

## LSTM-based autoencoder for sequences

Just for illustration purposes, we take a small subset of the sequences.

In [10]:
smallseqs = sequences[0:1000]

In [11]:
from keras.preprocessing.sequence import pad_sequences

In [38]:
from keras.layers import Embedding, LSTM, TimeDistributed, Dropout

If we want to learn vectors from sequences, the sequences need to have the same length (because we can't multiply matrices with variable sizes, and all of this is just fancy matrix multiplication).

In [13]:
paddedseqs = pad_sequences(smallseqs)

In [14]:
paddedseqs.shape

(1000, 36)

We then create the categorical one-hot vectors for each sequence, because this is what we predict.

In [15]:
catseqs = to_categorical(paddedseqs)

In [16]:
catseqs.shape

(1000, 36, 9425)

This is small hack to make sure we got the right vocab size.  The dimensionality of the categorical vectors is always one more than the vocab size.

In [17]:
import numpy as np
dim = max(np.unique(paddedseqs))
dim

9424

Now we build the model:

In [39]:
seqlayer = Input(shape=(36,))
emblayer = Embedding(9425, 100, input_length=36)(seqlayer)
lstm1 = LSTM(100, return_sequences=True)(emblayer)
dropoutlayer = Dropout(0.1)(lstm1)
lstm2 = LSTM(100, return_sequences=True)(dropoutlayer)
tdlayer = TimeDistributed(Dense(9425))(lstm2)
softmaxlayer = Activation('softmax')(tdlayer)

model = Model(inputs=[seqlayer], outputs=[softmaxlayer])

We are going to use the Embeddings layer to learn word embeddings this time. It only needs to know the maximum length of the sentences (36).  We'll get 100-dimensional vectors from it.  It does the rest of the work.  Then we have a couple of LSTM layers with dropout between them (again, what dropout is good is an empirical question/matter of judgement).  We then need to predict the sequence, which is what the TimeDistributed layer does -- it repeats a vocab-sized Dense layer over the length of the sequence.  And we predict the *current* word via softmax.  

In [40]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 36)                0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 36, 100)           942500    
_________________________________________________________________
lstm_8 (LSTM)                (None, 36, 100)           80400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 36, 100)           0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 36, 100)           80400     
_________________________________________________________________
time_distributed_3 (TimeDist (None, 36, 9425)          951925    
_________________________________________________________________
activation_3 (Activation)    (None, 36, 9425)          0         
Total para

In [41]:
model.compile('rmsprop', loss="categorical_crossentropy", metrics=["accuracy"])

Note that when we train the model, we give the padded integer indices as input for the Embeddings layer, but as output we give it the categorical vectors.

In [42]:
model.fit([paddedseqs], [catseqs], epochs=30)

Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


<keras.callbacks.History at 0x7feaa7bdc2e8>

Convergence happens very quickly because our data is super small...but wait, after we get past the 30th epoch it suddenly takes off! (Try this without dropout.) Sometimes it pays to wait a bit.  We can re-run fit on the current weights repeatedly and keep training it until it really converges.

## Extract weights from the embeddings layer

In [43]:
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_4 (InputLayer)         (None, 36)                0         
_________________________________________________________________
embedding_5 (Embedding)      (None, 36, 100)           942500    
_________________________________________________________________
lstm_8 (LSTM)                (None, 36, 100)           80400     
_________________________________________________________________
dropout_1 (Dropout)          (None, 36, 100)           0         
_________________________________________________________________
lstm_9 (LSTM)                (None, 36, 100)           80400     
_________________________________________________________________
time_distributed_3 (TimeDist (None, 36, 9425)          951925    
_________________________________________________________________
activation_3 (Activation)    (None, 36, 9425)          0         
Total para

In [46]:
model.get_weights()[0][1]

array([ 0.08595851, -0.1348605 , -0.05951789, -0.09311055, -0.04006767,
       -0.13085993, -0.17875338, -0.12385153,  0.1233968 ,  0.07179775,
       -0.04407612, -0.04639833,  0.16350667,  0.15804137, -0.16682605,
        0.07753159, -0.14385594,  0.04069513, -0.07771698,  0.1258528 ,
        0.05372755,  0.07482429, -0.04459634, -0.08102922,  0.09292468,
        0.18648   ,  0.15452106, -0.07762194,  0.12481114,  0.10383981,
       -0.04829976,  0.17528687,  0.05951104,  0.05654939, -0.08203832,
       -0.17533015,  0.08978003, -0.04999898, -0.16869149,  0.11418618,
        0.12412349, -0.07800698, -0.15448354, -0.05552231,  0.0884326 ,
       -0.13845249,  0.0678155 , -0.15825325, -0.08738711, -0.11881547,
        0.10172014,  0.10572896, -0.13061467,  0.04477144,  0.04484133,
       -0.0725379 ,  0.11765544,  0.12547421,  0.1331902 , -0.15013619,
       -0.13198432, -0.0692712 , -0.17415167, -0.08659714,  0.10983732,
        0.07686424,  0.03690993,  0.09399714,  0.14904135,  0.12

In [47]:
model.get_weights()[0][400]

array([ 0.00440912, -0.04126433,  0.02153069,  0.01764884,  0.02494913,
       -0.00458091, -0.02202054, -0.02562576, -0.01499826,  0.02869104,
       -0.02300212,  0.02751501, -0.03722743,  0.00067579, -0.03031388,
        0.03184034,  0.00276793, -0.0089811 , -0.04803849,  0.04574609,
       -0.01663978,  0.00677484,  0.02859687,  0.01374232,  0.01842347,
       -0.00318735, -0.01623718, -0.00460901, -0.02634054,  0.04690667,
       -0.01026788,  0.02768691, -0.02393703,  0.00526467, -0.02245242,
        0.01438989, -0.01939533,  0.02299419, -0.01693181, -0.02457434,
        0.02575289, -0.0399997 , -0.0349864 , -0.01022828, -0.01965418,
       -0.04237857,  0.02159353,  0.01240896,  0.02745398, -0.04243142,
        0.04358572,  0.03999844, -0.04961624,  0.04602572,  0.05444944,
       -0.02967173, -0.01196242,  0.03462901,  0.04673612,  0.01505011,
        0.0179581 , -0.0021507 ,  0.0106782 , -0.04382759, -0.00679799,
        0.02960026,  0.00406672,  0.02004926, -0.02685613,  0.03

The 0th layer of the model contains the weights for the Embedding layer (the Input layer doesn't *have* weights).  The integer word indices are an index into the corresponding embedding in the Embedding layer.  This way, you can get the word vectors out and cluster, etc, as before, or use them to train another model.