##### Copyright 2018 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

In [2]:
#@title MIT License
#
# Copyright (c) 2017 François Chollet
#
# Permission is hereby granted, free of charge, to any person obtaining a
# copy of this software and associated documentation files (the "Software"),
# to deal in the Software without restriction, including without limitation
# the rights to use, copy, modify, merge, publish, distribute, sublicense,
# and/or sell copies of the Software, and to permit persons to whom the
# Software is furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
# DEALINGS IN THE SOFTWARE.

# Introduction to word embeddings

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href=""><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

# Introduction to Word Embeddings

Word embeddings are a way of numerically representing word tokens. When given a sequence of words, it is important to get a numeric representation of those words so that they can be fed into our network. It is normally the case that we start by stemming our words (getting the root word), and then creating a dictionary out of the stemmed words. This dictionary will let us assign a unique integer to each word. The representation is then a 1-D tensor or array with the integer 1 at the index representing our word, and the integer 0 elsewhere. This creates a sparse representation called a one-hot encoding. Word embeddings convert a sparse representation (one-hot encoding) into a dense representation (vectors). 

There are two ways of obtaining word embeddings:
* Learn embeddings with respect to the task that you would like to carry out. With the approach, you start with random vectors and then train the vectors in a similar manner to how the weights of a network layer are trained.
* Use pre-trained embeddings in the manner of transfer learning. With this approach, you would make use of an embedding that was pre-trained for a task that might be similar to yours, or completely different.

In this tutorial, we will look at both approaches with the goal of helping you implement either one.

There are different types of word embeddings, all of which are generated from a large body of text (called a corpus). A corpus could be from wikipedia (or some other encyclopedia), or from a body of literature. Some of these embeddings are:
* GloVe: Global Vectors for Word Representation, an unsupervised approach to learning word vectors. You can find additional information here: https://nlp.stanford.edu/projects/glove/
* Word2Vec: A two-layer neural network that learns word embeddings.
* ELMo: deep contextualized word representations. You can find additional information here: https://allennlp.org/elmo
* FastText: an open-source library for text representations from Facebook. You can find additional information here: https://fasttext.cc

# Embeddings Tutorial

In [3]:
from tensorflow import keras

from keras.layers import Embedding

Using TensorFlow backend.


We will learn an embedding using an Embedding layer which takes in two parameters, the first being the maximum number of tokens (our vocubulary size), and the second being the number of dimensions of the embeddings. An example of creating an embedding layer follows below.

In [4]:
VOCAB_SIZE = 1000
EMBED_SIZE = 64

# create an embedding layer
embedding_layer = Embedding(VOCAB_SIZE, EMBED_SIZE)

The Embedding layer provides a mapping from a one-hot vector to a dense vector. It essentially serves as a dictionary lookup.

The input to the Embedding layer is a 2D tensor of integers, of shape (samples, sequence_length). All the sequences in a batch must be of the same length. Sequences that are shorter than `sequence_length` should be padded with zeros, while sequences that are longer should be truncated.

The output of the Embedding layer is a 3D tensor of floating point numbers, of shape (samples, sequence_length, embedding_dimensionality). This output can be processed by an RNN layer or a 1D convolution layer.

When an Embedding layer is instantiated, its weights are randomly assigned. During training, these word vectors are gradually adjusted through backpropagation.

We will make use of the IMDB movie reviews dataset to train a classifier. We will restrict the movie reviews to the top 10,000 most common words, and cut the reviews after only 20 words. Our network will learn an 8-dimensional embedding for each of the 10,000 words, turn the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flatten the tensor to 2D, and train a single Dense layer on top for classification.

In [5]:
#Let's import the sample dataset from keras
from keras.datasets import imdb
from keras import preprocessing

# Number of words to consider as features
VOCAB_SIZE = 10000

# maximum number of words to use in a sequence
EMBED_SIZE = 20

# load IMDB dataset as lists of integers
(X_train, y_train), (X_valid, y_valid) = imdb.load_data(num_words= VOCAB_SIZE)


When we use `imdb.load_data()`, we get two sets of tuples. Each tuple is a 2-D ndarray. `y_train` and `y_valid` have only one element per row (that is, only one column), while `X_train` and `X_valid` have a varying number of rows (as a result of the words in the sentences that they represent).

`X_train` and `X_valid` contain numeric representations of our words. The words are first of all converted into a Bag-of-Words representation in which they are assigned numbers.

In order to work with `X_train` and `X_valid`, we will truncate them to our `EMBED_SIZE` and also pad any sequences that are less than that size. That is what we do below.

In [6]:
# convert our lists of integers into 2D tensors
X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=EMBED_SIZE)
X_valid = preprocessing.sequence.pad_sequences(X_valid, maxlen=EMBED_SIZE)

At this point, `X_train` and `X_valid` both have 20 columns and 20 elements in each row. Note that some of the elements could be null padding.

In [7]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

DIM = 8

# create a Sequential model
model = Sequential()
# lets add our Embedding layer
model.add( Embedding(VOCAB_SIZE, DIM, input_length= EMBED_SIZE) )

# our output is a 3D tensor of shape (samples, VOCAB_SIZE, EMBED_SIZE)
# we will flatten it into a 2D tensor of shape (samples, VOCAB_SIZE * EMBED_SIZE)
model.add( Flatten() )

# Let's add a classifier.
model.add( Dense(1, activation='sigmoid') )
model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['acc'] )

model.summary()

history = model.fit(
    X_train,
    y_train,
    epochs=3,
    batch_size=32,
    validation_split=0.2
)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 20, 8)             80000     
_________________________________________________________________
flatten_1 (Flatten)          (None, 160)               0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 161       
Total params: 80,161
Trainable params: 80,161
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


Our classifier has a validation accuracy of about 75%. Note that we make use of only the first 20 words in each review. We are also flattening our embedding and passing it to a single Dense layer, which treats each word separately without taking into consideration the ordering of the words in the sequence.

It would be much better to use a recurrent layer or 1D convolution which will take the sequence of the words into consideration.

# Without Embeddings
It is always good to have a control experiment. What if we used our Bag-of-Words without an embedding? In the following model, we will replace our embedding with a `Dense` layer, and we will set our output neurons to the same as the embedding size just to keep things uniform.

In [8]:
from keras.models import Sequential
from keras.layers import Flatten, Dense

DIM = 8

# create a Sequential model
model = Sequential()
# lets add our Dense layer
model.add( Dense(DIM, activation='relu', input_shape=(EMBED_SIZE,)) )


# Let's add a classifier.
model.add( Dense(1, activation='sigmoid') )
model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['acc'] )

model.summary()

history = model.fit(
    X_train,
    y_train,
    epochs=3,
    batch_size=32,
    validation_split=0.2
)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_2 (Dense)              (None, 8)                 168       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 9         
Total params: 177
Trainable params: 177
Non-trainable params: 0
_________________________________________________________________
Train on 20000 samples, validate on 5000 samples
Epoch 1/3
Epoch 2/3
Epoch 3/3


In this run, our validation accuracy was just under 50%!