## Notes

Link: https://www.tensorflow.org/alpha/tutorials/text/word_embeddings

- We want a dense representation, hence OHC is bad
- We want embedding to have similarity meanings (similar words have similar embeddings), hence random index (eg. frequency index) is bad

Embedding layer

- Takes (batch, sequence_length) as input; eg 32 sentences, where each sentence is a same-length integer vectors (so use frequency index before that)
- It can take variable sequence lengths across batches; this is achieved by using a GlobalAveragePooling1D layer
- Output: (batch, sequence_length, embedding_dimensionality), so each word in each sentence is turned into a floating point vector with size embedding_dimensionality

Gotchas
- numpy has to be 1.16.2

Q
- So we always need to have labeled data to train embeddings...?

In [1]:
from __future__ import absolute_import, division, print_function, unicode_literals
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers

In [2]:
import numpy as np
np.__version__

'1.16.2'

In [3]:
embedding_layer = layers.Embedding(1000, 32) # (vocabulary size, embedding output size)

In [6]:
# data
vocab_size = 10000
imdb = keras.datasets.imdb
# labels: 1 for positive; 0 for negative
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=vocab_size)
print(train_data.shape, train_labels.shape, test_data.shape)

(25000,) (25000,) (25000,)


In [19]:
print(train_labels[0:10])
print(train_data[0])

[1 0 0 1 0 0 1 0 1 0]
[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


In [26]:
word_index = imdb.get_word_index()

print(len(word_index)) # 88584

# The first indices are reserved
# Todo: should we increase vocab_size by 4 then? no -- see below
word_index = {k:(v+3) for k,v in word_index.items()}
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  
word_index["<UNUSED>"] = 3 # never used

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index.get(i, '?') for i in text])

decode_review(train_data[2])

88584


"<START> this has to be one of the worst films of the 1990s when my friends i were watching this film being the target audience it was aimed at we just sat watched the first half an hour with our jaws touching the floor at how bad it really was the rest of the time everyone else in the theatre just started talking to each other leaving or generally crying into their popcorn that they actually paid money they had <UNK> working to watch this feeble excuse for a film it must have looked like a great idea on paper but on film it looks like no one in the film has a clue what is going on crap acting crap costumes i can't get across how <UNK> this is to watch save yourself an hour a bit of your life"

In [44]:
# does train_data have these special characters?
# it does not have <PAD> and <UNUSED>
def contains(i):
    for l in train_data:
        try:
            l.index(i)
            return True
        except ValueError:
            continue
    return False
for i in [0,1,2,3]:
    print(contains(i))

False
True
True
False


In [33]:
# test that dataset indeed has <= 10000 vocab size
# so after adding <PAD>, train data will have vocab size of 9999, hence setting 10000 is safe
s = set()
for l in train_data:
    s = s.union(set(l))
print("train data vocab size:", len(s))
s = set()
for l in test_data:
    s = s.union(set(l))
print("test data vocab size:", len(s))

train data vocab size: 9998
test data vocab size: 9951


In [25]:
# max length of train_data and test_data
print("max length of train data", max(list(map(lambda x: len(x), train_data))))
print("max length of test data", max(list(map(lambda x: len(x), test_data))))
# tutorial truncates to 500; todo: increase

max length of train data 2494
max length of test data 2315


In [45]:
# pad
maxlen = 500

train_data = keras.preprocessing.sequence.pad_sequences(train_data,
                                                        value=word_index["<PAD>"],
                                                        padding='post',
                                                        maxlen=maxlen)

test_data = keras.preprocessing.sequence.pad_sequences(test_data,
                                                       value=word_index["<PAD>"],
                                                       padding='post',
                                                       maxlen=maxlen)
len(train_data[0])

500

## Model

In [76]:
embedding_dim=16

model = keras.Sequential([
  layers.Embedding(vocab_size, embedding_dim, input_length=maxlen), # zero initializer does not work
  # average over all words within one sentence, this is simplest way to deal with variable length sentence  
  layers.GlobalAveragePooling1D(),
  layers.Dense(16, activation='relu'), # without this ~85%, with this ~88%
  layers.Dense(1, activation='sigmoid')
])

model.summary()

Model: "sequential_7"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_8 (Embedding)      (None, 500, 16)           160000    
_________________________________________________________________
global_average_pooling1d_7 ( (None, 16)                0         
_________________________________________________________________
dense_11 (Dense)             (None, 16)                272       
_________________________________________________________________
dense_12 (Dense)             (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


In [77]:
model.compile(optimizer='adam',
              loss='binary_crossentropy', # important
              metrics=['accuracy'])

print("starting weights:")
unused_before = model.layers[0].get_weights()[0][3]
print(unused_before)
print(model.layers[0].get_weights()[0][10])

history = model.fit(
    train_data,
    train_labels,
    epochs=30,
    batch_size=512,
    validation_split=0.2)

print("weights after training")
unused_after = model.layers[0].get_weights()[0][3]
print(unused_after)
print(model.layers[0].get_weights()[0][10])

print("Has <UNUSED> changed weight?") # no yay
unused_before == unused_after

starting weights:
[-0.02960713  0.04779026  0.0245004   0.02629631  0.04664708  0.00642449
 -0.04180528 -0.00649091  0.02941719 -0.03295922  0.037051    0.03730576
 -0.02007076 -0.00246441  0.01029216  0.04189518]
[ 0.04955132 -0.04416816  0.03132464 -0.01327448 -0.01478958  0.04752263
 -0.04614146 -0.00361229 -0.0139159  -0.04050081 -0.02818963  0.04721266
 -0.04109221  0.04445548 -0.03769999 -0.02909354]
Train on 20000 samples, validate on 5000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30
weights after training
[-0.02960713  0.04779026  0.0245004   0.02629631  0.04664708  0.00642449
 -0.04180528 -0.00649091  0.02941719 -0.03295922  0.037051    0.03730576
 -0.02007076 -0.00246

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True])

In [78]:
# test accuracy
test_loss, test_accuracy = model.evaluate(test_data, test_labels, batch_size = 512)



In [79]:
# get the trained embeddings
e = model.layers[0]
embeddings = e.get_weights()[0]
embeddings.shape

(10000, 16)