##### Copyright 2018 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Introduction to Word Embeddings

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

This tutorial shows how to train a sentiment classifer on the IMDB dataset using learned word embeddings. As a bonus, we show how to visualize these embeddings in the [TensorFlow Embedding Projector](http://projector.tensorflow.org). 

First, here's a bit of background. Before we can build a model to predict the sentiment of a review, first we will need a way to represent the words of the review as numbers, so they can be processed by our network. There are several strategies to convert words to numbers.

As a first attempt, we might one-hot encode each word. One problem with this approach is efficiency. A one-hot encoded vector is sparse (meaning, most indicices are zero). Imagine we have 10,000 words in our vocabulary. To one-hot encode each one, we would create a vector where 99.99% of the elements are zero!

Instead, we can encode each word using a unique number. For example, we might assign 1 to 'the', 42 to 'dog', and 96 to 'cat', and so on. Using these numbers, we could encode a sentence like "The dog and cat sat on the mat" as \[1, 42, 96, ...\]. One problem still remains. Although we know dogs and cats are related, our representation doesn't encode that information for the classifier (the numbers 42 and 96 were arbitrarily chosen). 

Unlike the above methods, a word embedding is learned from data. An embedding represents each word as a n-dimensional vector of floating point values. These values are traininable parameters, weights learned while training the model. After training, we hope that similar words will be close together in the embedding space. We can visualize the learned embeddings by projecting them down to a 2- or 3-dimensional space.

There are two ways to obtain word embeddings:

* Learn word embeddings jointly with the main task you care about (e.g. sentiment classification). In this case, you would start with random word vectors, then learn your word vectors in the same way that you learn the weights of a neural network.

* Load word embeddings into your model that were pre-computed using a different machine learning task than the one you are trying to solve. These are called "pre-trained word embeddings".

Here, we will take the first approach.

In [2]:
import tensorflow as tf
from tensorflow import keras
print(tf.__version__)

1.8.0


# Download the IMDB dataset

The IMDB dataset comes packaged with TensorFlow. It has already been preprocessed such that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary.

In [3]:
imdb = keras.datasets.imdb

# Number of words to consider as features
num_words = 20000

# load IMDB dataset as lists of integers
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=num_words)

The argument num_words=20000 keeps the top 20,000 most frequently occurring words in the training data.

In [4]:
print("Training examples: {}, labels: {}".format(len(train_data), len(train_labels)))

Training examples: 25000, labels: 25000


The text of reviews have been converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:

In [5]:
print(train_data[0])

[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 19193, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 10311, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 12118, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]


Movie reviews may be different lengths. The below code shows the number of words in the first and second reviews. Since inputs to a neural network must be the same length, we'll need to resolve this.

In [6]:
len(train_data[0]), len(train_data[1])

(218, 189)

# Prepocess Data
We will pad the arrays so they all have the same length, using the [https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences](pad_sequences) method. In this case, TensorFlow will create new matrix of shape ```max_len * num_examples```:

In [7]:
# Cut texts after this number of words 
max_len = 500

# Convert our lists of integers into 2D tensors
train_data = keras.preprocessing.sequence.pad_sequences(train_data, 
                                                        maxlen=max_len)
test_data = keras.preprocessing.sequence.pad_sequences(test_data, 
                                                       maxlen=max_len)

print(train_data.shape)

(25000, 500)


Notice the pad sequences method worked by prepending '0's to the start of the sequence:

In [8]:
print(train_data[0])

[    0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0     0     0     0     0
     0     0     0     0     0     0     0     0   

# Build a Multi-Layer Perceptron
We are now ready to build our model. We will use an [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer to map from an integer that corresponds to a word, to a vector of floating point weights (the embedding). These weights are learned when we train the model.

In [9]:
from keras.models import Sequential
from keras.layers import Dense, Embedding, GlobalAveragePooling1D

embedding_dimension = 16

model = Sequential()
model.add(Embedding(num_words, embedding_dimension, input_length=max_len))

# Our output is a 3D tensor of shape (samples, vocab_size, embedding_dimension)
# we will use `GlobalAveragePooling` before our fully connected layer. In the past, this used to be a Flatten layer
model.add(GlobalAveragePooling1D())

# Add a classifier on top.
model.add(Dense(1, activation='sigmoid') )
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])

model.summary()

history = model.fit(
    train_data,
    train_labels,
    epochs=10,
    batch_size=32,
    validation_split=0.2
)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 500, 16)           320000    
_________________________________________________________________
global_average_pooling1d_1 ( (None, 16)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 17        
Total params: 320,017
Trainable params: 320,017
Non-trainable params: 0
_________________________________________________________________


Using TensorFlow backend.


Train on 20000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


Our classifier has a validation accuracy of about 89%. Note that we make use of only the first 500 words in each review. We are also GlobalAveragePooling on our embedding before passing it to a single Dense layer, which treats each word separately without taking into consideration the ordering of the words in the sequence. To reach higher accuracy, it would be helpful to use a recurrent layer or 1D convolution which will take the sequence of the words into consideration.

# Visualize Embeddings with the Embedding Projector

Recall the reviews are encoded as series of integers in our training data. Before we can visualize the learned embeddings, first we will need to determine which word corresponds to each number. In this case, the IMDB dataset includes a utility method ```.word_index()``` that contains a mapping from words to numbers. We will use this to build a reversed word index, which maps from numbers to words.

In [10]:
# A dictionary mapping words to an integer index
word_index = imdb.get_word_index()

# The first indices are reserved
word_index = {k:(v+3) for k,v in word_index.items()} 
word_index["<PAD>"] = 0
word_index["<START>"] = 1
word_index["<UNK>"] = 2  # unknown
word_index["<UNUSED>"] = 3

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])

def decode_review(text):
    return ' '.join([reverse_word_index[i] for i in text])

Now we can use the decode_review function to display the text for the first review. You will see padding at the beginning, since this review was shorter than our 250 word maximum length.

In [11]:
decode_review(train_data[0])

"<PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PAD> <PA

Now that we have the number to word mapping, we are ready to retrieve the learned embedding from the model. This gives us a matrix of weights. Each row corresponds to the embedding for that number in our ```reversed_word_dict``` above, and the corresponding word can be found in ```word_index```.

We retrieve the weights by using the ```model.layers``` and ```model.weights``` methods. In this case, the embedding layer is the first layer we added to the model.



In [12]:
e = model.layers[0]
weights = e.get_weights()[0]
print(weights.shape) # 1000, 16. Each word is mapped to an embedding vector.

(20000, 16)


Next, we will format these for visualization in the embedding projector. To do so, we will need to provide two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words).

In [13]:
out_v = open('vecs.tsv', 'w')
out_m = open('meta.tsv', 'w')
for word_num in range(num_words):
  word = reverse_word_index[word_num]
  embeddings = weights[word_num]
  out_m.write(word + "\n")
  out_v.write('\t'.join([str(x) for x in embeddings]) + "\n")
out_v.close()
out_m.close()

If you are running this tutorial in [Colaboratory](https://colab.research.google.com), you can use the following snippet to download these files to your local machine.

In [14]:
from google.colab import files
files.download('vecs.tsv')
files.download('meta.tsv')

ModuleNotFoundError: No module named 'google.colab'

Now, you can open the [Embedding Projector](http://projector.tensorflow.org/) in a new window, and click on 'Load data'. Upload the ```vecs.tsv``` and ```meta.tsv``` files from above. Next, click 'Search', and type in a word to find its closest neighbors. With this small dataset, not all of the learned embeddings will be interpretable, though some will be! 

For example, try searching for 'beautiful'. The learned embeddings you see may be different, they depend on random weight initialization used by the model. When the author of this tutorial ran it, they saw "loved" and "wonderful" were the closest neighbors. Likewise, the closest neigbhors for "lame" were "awful, and poorly".



# A More Advanced Model
We will implement a more advanced model that demonstrates two things:
1. The use of pre-trained embeddings.
2. The use of a 1D CNN. 

We will be implementing a Depthwise Separable Convolutional Neural Network, which is a type of CNN that was written about in a paper published by Francois Chollet that is found here: https://arxiv.org/abs/1610.02357. Because CNN makes use of a sliding window, it will take the order of words in our text into consideration.

In [23]:
# download pretrained GloVe embeddings
!wget http://nlp.stanford.edu/data/glove.6B.zip

In [None]:
!unzip glove.6B.zip

In [27]:
import os
import numpy as np

In [28]:
glove_dir = './'

embeddings_index = {} #initialize dictionary
f = open(os.path.join(glove_dir, 'glove.6B.100d.txt'))
for line in f:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype='float32')
    embeddings_index[word] = coefs
f.close()

print('Found %s word vectors.' % len(embeddings_index))

Found 400000 word vectors.


In [31]:
embedding_dim = 100

embedding_matrix = np.zeros((num_words, embedding_dim)) #create an array of zeros with word_num rows and embedding_dim columns
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < num_words:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector

In [15]:
from keras import initializers, models, regularizers
from keras.layers import Dense, Dropout, Embedding, SeparableConv1D, MaxPooling1D, GlobalAveragePooling1D

In [66]:
# let's define a Sequential model
cnnModel = models.Sequential()

#let's add an embedding layer. For now, it's not pre-trained
cnnModel.add(Embedding(num_words, 
                    embedding_dim, 
                    input_length=max_len,
                      weights=[embedding_matrix],
                      trainable=False))

Our sepCNN is defined as blocks. Each block will be made up as follows:
* Dropout Layer
* SepConv1D Layer
* SepConv1D Layer
* MaxPooling Layer

The number of blocks to have in the network is a hyperparameter

In [67]:
blocks = 4 # int, number of pairs of sepCNN and pooling blocks in the model.
dropout_rate = 0.3 # float, percentage of input to drop at Dropout layers. Recommended range 0.2-0.5
filters = 50 # int, output dimension of the layers. Recommended range 50 - 300
kernel_size = 5 #int, length of the convolution window. Recommended values 3 or 5
pool_size = 1 # int, factor by which to downscale input at MaxPooling layer.

for _ in range(blocks):
    cnnModel.add(Dropout(rate=dropout_rate))
    cnnModel.add(SeparableConv1D(filters=filters,
                             kernel_size= kernel_size,
                             activation= 'relu',
                             bias_initializer= 'random_uniform',
                             depthwise_initializer= 'random_uniform',
                             padding= 'same'))
    cnnModel.add(SeparableConv1D(filters=filters,
                             kernel_size= kernel_size,
                             activation= 'relu',
                             bias_initializer= 'random_uniform',
                             depthwise_initializer= 'random_uniform',
                             padding= 'same'))
    cnnModel.add(MaxPooling1D(pool_size=pool_size))

We will complete our model architecture by by adding the following layers:
* SeparableConv1D
* SeparableConv1D
* GlobalAveragePooling1D
* Dropout
* Dense

In [68]:
cnnModel.add(SeparableConv1D(filters=filters * 2,
                         kernel_size= kernel_size,
                         activation= 'relu',
                         bias_initializer= 'random_normal',
                         depthwise_initializer= 'random_normal',
                         padding='same'))
cnnModel.add(SeparableConv1D(filters=filters * 2,
                         kernel_size= kernel_size,
                         activation= 'relu',
                         bias_initializer= 'random_normal',
                         depthwise_initializer= 'random_normal',
                         padding='same'))
cnnModel.add(GlobalAveragePooling1D())
cnnModel.add(Dropout(rate= dropout_rate))
cnnModel.add(Dense(1, activation='sigmoid'))

At this point, we have the beginnings of a deep convolutional neural network. We can see the architecture in it's entirity below.

In [69]:
cnnModel.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_7 (Embedding)      (None, 500, 100)          2000000   
_________________________________________________________________
dropout_24 (Dropout)         (None, 500, 100)          0         
_________________________________________________________________
separable_conv1d_47 (Separab (None, 500, 50)           5550      
_________________________________________________________________
separable_conv1d_48 (Separab (None, 500, 50)           2800      
_________________________________________________________________
max_pooling1d_19 (MaxPooling (None, 500, 50)           0         
_________________________________________________________________
dropout_25 (Dropout)         (None, 500, 50)           0         
_________________________________________________________________
separable_conv1d_49 (Separab (None, 500, 50)           2800      
__________

Let's compile and train the model

In [70]:
from keras.optimizers import Adam

In [71]:
cnnModel.compile(optimizer=Adam(0.1), loss='binary_crossentropy', metrics=['acc'])

In [72]:
history = cnnModel.fit(
    train_data,
    train_labels,
    epochs=3,
    batch_size=512,
    validation_split=0.2
)

Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


# Next steps
* To learn more about Word Embeddings, we recommend browsing [this](https://www.tensorflow.org/tutorials/representation/word2vec) older tutorial (the code is out of date, and we recommend skipping it in favor of the newer version here, but the explanation and diagrams are useful).

* [TensorFlow Hub](https://www.tensorflow.org/hub/) contains large databases of pretrained word embeddings you can download and reuse in your projects (although at the time of writing, these use a different progamming style than the one in this tutorial).