From 5d5406ffcfcc8578391032ad4f1f93dcd5678bb1 Mon Sep 17 00:00:00 2001 From: Josh Gordon Date: Mon, 15 Oct 2018 18:33:54 -0400 Subject: [PATCH] Updates Thank you for starting this! Here's a round of edits. I think it's almost ready to go. Could you take a look and see if there's anything we can improve in this version? When you're happy with it, please submit a PR to the TF Docs repo, and we can continue refining there. I'd like to add some graphics of the embedding projector before we publish, add more references to educational resources, and improve the intro to embeddings before we publish, but we can work on those changes in the docs PR whenever you're ready. Thanks again! --- .../keras/intro_word_embeddings.ipynb | 1055 +++++++++++------ 1 file changed, 670 insertions(+), 385 deletions(-) diff --git a/site/en/tutorials/keras/intro_word_embeddings.ipynb b/site/en/tutorials/keras/intro_word_embeddings.ipynb index 39a09e00918..9cd754b2717 100644 --- a/site/en/tutorials/keras/intro_word_embeddings.ipynb +++ b/site/en/tutorials/keras/intro_word_embeddings.ipynb @@ -1,390 +1,675 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "##### Copyright 2018 The TensorFlow Authors." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "#@title MIT License\n", - "#\n", - "# Copyright (c) 2017 François Chollet\n", - "#\n", - "# Permission is hereby granted, free of charge, to any person obtaining a\n", - "# copy of this software and associated documentation files (the \"Software\"),\n", - "# to deal in the Software without restriction, including without limitation\n", - "# the rights to use, copy, modify, merge, publish, distribute, sublicense,\n", - "# and/or sell copies of the Software, and to permit persons to whom the\n", - "# Software is furnished to do so, subject to the following conditions:\n", - "#\n", - "# The above copyright notice and this permission notice shall be included in\n", - "# all copies or substantial portions of the Software.\n", - "#\n", - "# THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\n", - "# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\n", - "# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL\n", - "# THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\n", - "# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING\n", - "# FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER\n", - "# DEALINGS IN THE SOFTWARE." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction to word embeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "\n", - " \n", - " \n", - " \n", - "
\n", - " View on TensorFlow.org\n", - " \n", - " Run in Google Colab\n", - " \n", - " View source on GitHub\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Introduction to Word Embeddings" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Word embeddings are a way of numerically representing word tokens. When given a sequence of words, it is important to get a numeric representation of those words so that they can be fed into our network. It is normally the case that we start by stemming our words (getting the root word), and then creating a dictionary out of the stemmed words. This dictionary will let us assign a unique integer to each word. The representation is then a 1-D tensor or array with the integer 1 at the index representing our word, and the integer 0 elsewhere. This creates a sparse representation called a one-hot encoding. Word embeddings convert a sparse representation (one-hot encoding) into a dense representation (vectors). \n", - "\n", - "There are two ways of obtaining word embeddings:\n", - "* Learn embeddings with respect to the task that you would like to carry out. With the approach, you start with random vectors and then train the vectors in a similar manner to how the weights of a network layer are trained.\n", - "* Use pre-trained embeddings in the manner of transfer learning. With this approach, you would make use of an embedding that was pre-trained for a task that might be similar to yours, or completely different.\n", - "\n", - "In this tutorial, we will look at both approaches with the goal of helping you implement either one.\n", - "\n", - "There are different types of word embeddings, all of which are generated from a large body of text (called a corpus). A corpus could be from wikipedia (or some other encyclopedia), or from a body of literature. Some of these embeddings are:\n", - "* GloVe: Global Vectors for Word Representation, an unsupervised approach to learning word vectors. You can find additional information here: https://nlp.stanford.edu/projects/glove/\n", - "* Word2Vec: A two-layer neural network that learns word embeddings.\n", - "* ELMo: deep contextualized word representations. You can find additional information here: https://allennlp.org/elmo\n", - "* FastText: an open-source library for text representations from Facebook. You can find additional information here: https://fasttext.cc" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Embeddings Tutorial" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "Using TensorFlow backend.\n" - ] + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "intro_word_embeddings.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" } - ], - "source": [ - "from tensorflow import keras\n", - "\n", - "from keras.layers import Embedding" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We will learn an embedding using an Embedding layer which takes in two parameters, the first being the maximum number of tokens (our vocubulary size), and the second being the number of dimensions of the embeddings. An example of creating an embedding layer follows below." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [], - "source": [ - "VOCAB_SIZE = 1000\n", - "EMBED_SIZE = 64\n", - "\n", - "# create an embedding layer\n", - "embedding_layer = Embedding(VOCAB_SIZE, EMBED_SIZE)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The Embedding layer provides a mapping from a one-hot vector to a dense vector. It essentially serves as a dictionary lookup.\n", - "\n", - "The input to the Embedding layer is a 2D tensor of integers, of shape (samples, sequence_length). All the sequences in a batch must be of the same length. Sequences that are shorter than `sequence_length` should be padded with zeros, while sequences that are longer should be truncated.\n", - "\n", - "The output of the Embedding layer is a 3D tensor of floating point numbers, of shape (samples, sequence_length, embedding_dimensionality). This output can be processed by an RNN layer or a 1D convolution layer.\n", - "\n", - "When an Embedding layer is instantiated, its weights are randomly assigned. During training, these word vectors are gradually adjusted through backpropagation.\n", - "\n", - "We will make use of the IMDB movie reviews dataset to train a classifier. We will restrict the movie reviews to the top 10,000 most common words, and cut the reviews after only 20 words. Our network will learn an 8-dimensional embedding for each of the 10,000 words, turn the input integer sequences (2D integer tensor) into embedded sequences (3D float tensor), flatten the tensor to 2D, and train a single Dense layer on top for classification." - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "#Let's import the sample dataset from keras\n", - "from keras.datasets import imdb\n", - "from keras import preprocessing\n", - "\n", - "# Number of words to consider as features\n", - "VOCAB_SIZE = 10000\n", - "\n", - "# maximum number of words to use in a sequence\n", - "EMBED_SIZE = 20\n", - "\n", - "# load IMDB dataset as lists of integers\n", - "(X_train, y_train), (X_valid, y_valid) = imdb.load_data(num_words= VOCAB_SIZE)\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When we use `imdb.load_data()`, we get two sets of tuples. Each tuple is a 2-D ndarray. `y_train` and `y_valid` have only one element per row (that is, only one column), while `X_train` and `X_valid` have a varying number of rows (as a result of the words in the sentences that they represent).\n", - "\n", - "`X_train` and `X_valid` contain numeric representations of our words. The words are first of all converted into a Bag-of-Words representation in which they are assigned numbers.\n", - "\n", - "In order to work with `X_train` and `X_valid`, we will truncate them to our `EMBED_SIZE` and also pad any sequences that are less than that size. That is what we do below." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [], - "source": [ - "# convert our lists of integers into 2D tensors\n", - "X_train = preprocessing.sequence.pad_sequences(X_train, maxlen=EMBED_SIZE)\n", - "X_valid = preprocessing.sequence.pad_sequences(X_valid, maxlen=EMBED_SIZE)" - ] }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "At this point, `X_train` and `X_valid` both have 20 columns and 20 elements in each row. Note that some of the elements could be null padding." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "_________________________________________________________________\n", - "Layer (type) Output Shape Param # \n", - "=================================================================\n", - "embedding_2 (Embedding) (None, 20, 8) 80000 \n", - "_________________________________________________________________\n", - "flatten_1 (Flatten) (None, 160) 0 \n", - "_________________________________________________________________\n", - "dense_1 (Dense) (None, 1) 161 \n", - "=================================================================\n", - "Total params: 80,161\n", - "Trainable params: 80,161\n", - "Non-trainable params: 0\n", - "_________________________________________________________________\n", - "Train on 20000 samples, validate on 5000 samples\n", - "Epoch 1/3\n", - "20000/20000 [==============================] - 2s 100us/step - loss: 0.6651 - acc: 0.6229 - val_loss: 0.6016 - val_acc: 0.7054\n", - "Epoch 2/3\n", - "20000/20000 [==============================] - 2s 89us/step - loss: 0.5010 - acc: 0.7819 - val_loss: 0.5094 - val_acc: 0.7436\n", - "Epoch 3/3\n", - "20000/20000 [==============================] - 2s 89us/step - loss: 0.3895 - acc: 0.8387 - val_loss: 0.4942 - val_acc: 0.7492\n" - ] - } - ], - "source": [ - "from keras.models import Sequential\n", - "from keras.layers import Flatten, Dense\n", - "\n", - "DIM = 8\n", - "\n", - "# create a Sequential model\n", - "model = Sequential()\n", - "# lets add our Embedding layer\n", - "model.add( Embedding(VOCAB_SIZE, DIM, input_length= EMBED_SIZE) )\n", - "\n", - "# our output is a 3D tensor of shape (samples, VOCAB_SIZE, EMBED_SIZE)\n", - "# we will flatten it into a 2D tensor of shape (samples, VOCAB_SIZE * EMBED_SIZE)\n", - "model.add( Flatten() )\n", - "\n", - "# Let's add a classifier.\n", - "model.add( Dense(1, activation='sigmoid') )\n", - "model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['acc'] )\n", - "\n", - "model.summary()\n", - "\n", - "history = model.fit(\n", - " X_train,\n", - " y_train,\n", - " epochs=3,\n", - " batch_size=32,\n", - " validation_split=0.2\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our classifier has a validation accuracy of about 75%. Note that we make use of only the first 20 words in each review. We are also flattening our embedding and passing it to a single Dense layer, which treats each word separately without taking into consideration the ordering of the words in the sequence.\n", - "\n", - "It would be much better to use a recurrent layer or 1D convolution which will take the sequence of the words into consideration." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Without Embeddings\n", - "It is always good to have a control experiment. What if we used our Bag-of-Words without an embedding? In the following model, we will replace our embedding with a `Dense` layer, and we will set our output neurons to the same as the embedding size just to keep things uniform." - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "_________________________________________________________________\n", - "Layer (type) Output Shape Param # \n", - "=================================================================\n", - "dense_2 (Dense) (None, 8) 168 \n", - "_________________________________________________________________\n", - "dense_3 (Dense) (None, 1) 9 \n", - "=================================================================\n", - "Total params: 177\n", - "Trainable params: 177\n", - "Non-trainable params: 0\n", - "_________________________________________________________________\n", - "Train on 20000 samples, validate on 5000 samples\n", - "Epoch 1/3\n", - "20000/20000 [==============================] - 2s 75us/step - loss: 7.9536 - acc: 0.4986 - val_loss: 8.0226 - val_acc: 0.4944\n", - "Epoch 2/3\n", - "20000/20000 [==============================] - 1s 62us/step - loss: 7.9545 - acc: 0.4977 - val_loss: 7.9422 - val_acc: 0.4974\n", - "Epoch 3/3\n", - "20000/20000 [==============================] - 1s 65us/step - loss: 7.8356 - acc: 0.5011 - val_loss: 7.8839 - val_acc: 0.4958\n" - ] + "cells": [ + { + "metadata": { + "id": "xy8HXbLYe-zc", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "##### Copyright 2018 The TensorFlow Authors." + ] + }, + { + "metadata": { + "id": "Q0_10N6De-zh", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "jiR0ETqae-zt", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# Introduction to Word Embeddings" + ] + }, + { + "metadata": { + "id": "SkppDVzme-zu", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "\n", + " \n", + " \n", + " \n", + "
\n", + " View on TensorFlow.org\n", + " \n", + " Run in Google Colab\n", + " \n", + " View source on GitHub\n", + "
" + ] + }, + { + "metadata": { + "id": "Aj5d6ZHae-zz", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "This tutorial shows how to train a sentiment classifer on the IMDB dataset using learned word embeddings. As a bonus, we show how to visualize these embeddings in the [TensorFlow Embedding Projector](http://projector.tensorflow.org). \n", + "\n", + "First, here's a bit of background. Before we can build a model to predict the sentiment of a review, first we will need a way to represent the words of the review as numbers, so they can be processed by our network. There are several strategies to convert words to numbers.\n", + "\n", + "As a first attempt, we might one-hot encode each word. One problem with this approach is efficiency. A one-hot encoded vector is sparse (meaning, most indicices are zero). Imagine we have 10,000 words in our vocabulary. To one-hot encode each one, we would create a vector where 99.99% of the elements are zero!\n", + "\n", + "Instead, we can encode each word using a unique number. For example, we might assign 1 to 'the', 42 to 'dog', and 96 to 'cat', and so on. Using these numbers, we could encode a sentence like \"The dog and cat sat on the mat\" as \\[1, 42, 96, ...\\]. One problem still remains. Although we know dogs and cats are related, our representation doesn't encode that information for the classifier (the numbers 42 and 96 were arbitrarily chosen). \n", + "\n", + "Unlike the above methods, a word embedding is learned from data. An embedding represents each word as a n-dimensional vector of floating point values. These values are traininable parameters, weights learned while training the model. After training, we hope that similar words will be close together in the embedding space. We can visualize the learned embeddings by projecting them down to a 2- or 3-dimensional space.\n", + "\n", + "There are two ways to obtain word embeddings:\n", + "\n", + "* Learn word embeddings jointly with the main task you care about (e.g. sentiment classification). In this case, you would start with random word vectors, then learn your word vectors in the same way that you learn the weights of a neural network.\n", + "\n", + "* Load word embeddings into your model that were pre-computed using a different machine learning task than the one you are trying to solve. These are called \"pre-trained word embeddings\".\n", + "\n", + "Here, we will take the first approach." + ] + }, + { + "metadata": { + "id": "7zQoILT9e-z2", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "fb52a330-ffe1-4754-f198-3ff3547091ea" + }, + "cell_type": "code", + "source": [ + "import tensorflow as tf\n", + "from tensorflow import keras\n", + "print(tf.__version__)" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "stream", + "text": [ + "1.11.0\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "X6MReNwJlOHJ", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# Download the IMDB dataset\n", + "\n", + "The IMDB dataset comes packaged with TensorFlow. It has already been preprocessed such that the reviews (sequences of words) have been converted to sequences of integers, where each integer represents a specific word in a dictionary." + ] + }, + { + "metadata": { + "id": "pxf4Qu3xe-0E", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "imdb = keras.datasets.imdb\n", + "\n", + "# Number of words to consider as features\n", + "num_words = 1000\n", + "\n", + "# load IMDB dataset as lists of integers\n", + "(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=num_words)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "BGBirQQ2l1h7", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "The argument num_words=1000 keeps the top 1,000 most frequently occurring words in the training data." + ] + }, + { + "metadata": { + "id": "GDRePL1flhhw", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "64ba8f56-847f-47fc-ee17-0825e51ae24a" + }, + "cell_type": "code", + "source": [ + "print(\"Training examples: {}, labels: {}\".format(len(train_data), len(train_labels)))" + ], + "execution_count": 5, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Training examples: 25000, labels: 25000\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "RdVN8hYGloP7", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "The text of reviews have been converted to integers, where each integer represents a specific word in a dictionary. Here's what the first review looks like:" + ] + }, + { + "metadata": { + "id": "gO0nNEZklltp", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 54 + }, + "outputId": "7f9b25f2-6ba1-409d-8727-3a27423d8fe8" + }, + "cell_type": "code", + "source": [ + "print(train_data[0])" + ], + "execution_count": 6, + "outputs": [ + { + "output_type": "stream", + "text": [ + "[1, 14, 22, 16, 43, 530, 973, 2, 2, 65, 458, 2, 66, 2, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 2, 2, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2, 19, 14, 22, 4, 2, 2, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 2, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2, 2, 16, 480, 66, 2, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 2, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 2, 15, 256, 4, 2, 7, 2, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 2, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2, 56, 26, 141, 6, 194, 2, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 2, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 2, 88, 12, 16, 283, 5, 16, 2, 113, 103, 32, 15, 16, 2, 19, 178, 32]\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "bxGWS8mWlrGh", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Movie reviews may be different lengths. The below code shows the number of words in the first and second reviews. Since inputs to a neural network must be the same length, we'll need to resolve this." + ] + }, + { + "metadata": { + "id": "x6Bbi27-ltzb", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "bed21543-b621-465b-9a2a-9854db01d021" + }, + "cell_type": "code", + "source": [ + "len(train_data[0]), len(train_data[1])" + ], + "execution_count": 7, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "(218, 189)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 7 + } + ] + }, + { + "metadata": { + "id": "aLcQxHtmmHBb", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "We will pad the arrays so they all have the same length, using the [https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/sequence/pad_sequences](pad_sequences) method. In this case, TensorFlow will create new matrix of shape ```max_len * num_examples```:" + ] + }, + { + "metadata": { + "id": "A2JwvoSve-0H", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "393a478f-1624-4dc2-a844-bfa09ffacb40" + }, + "cell_type": "code", + "source": [ + "# Cut texts after this number of words \n", + "max_len = 250\n", + "\n", + "# Convert our lists of integers into 2D tensors\n", + "train_data = keras.preprocessing.sequence.pad_sequences(train_data, \n", + " maxlen=max_len)\n", + "test_data = keras.preprocessing.sequence.pad_sequences(test_data, \n", + " maxlen=max_len)\n", + "\n", + "print(train_data.shape)" + ], + "execution_count": 8, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(25000, 250)\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "I2oJlJoQe-0L", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Notice the pad sequences method worked by prepending '0's to the start of the sequence:" + ] + }, + { + "metadata": { + "id": "MvAGi0aPmsRH", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + }, + "outputId": "6a05df18-f10e-48e9-ce9a-01d1e3732fe1" + }, + "cell_type": "code", + "source": [ + "print(train_data[0])" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "stream", + "text": [ + "[ 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0\n", + " 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 14 22 16\n", + " 43 530 973 2 2 65 458 2 66 2 4 173 36 256 5 25 100 43\n", + " 838 112 50 670 2 9 35 480 284 5 150 4 172 112 167 2 336 385\n", + " 39 4 172 2 2 17 546 38 13 447 4 192 50 16 6 147 2 19\n", + " 14 22 4 2 2 469 4 22 71 87 12 16 43 530 38 76 15 13\n", + " 2 4 22 17 515 17 12 16 626 18 2 5 62 386 12 8 316 8\n", + " 106 5 4 2 2 16 480 66 2 33 4 130 12 16 38 619 5 25\n", + " 124 51 36 135 48 25 2 33 6 22 12 215 28 77 52 5 14 407\n", + " 16 82 2 8 4 107 117 2 15 256 4 2 7 2 5 723 36 71\n", + " 43 530 476 26 400 317 46 7 4 2 2 13 104 88 4 381 15 297\n", + " 98 32 2 56 26 141 6 194 2 18 4 226 22 21 134 476 26 480\n", + " 5 144 30 2 18 51 36 28 224 92 25 104 4 226 65 16 38 2\n", + " 88 12 16 283 5 16 2 113 103 32 15 16 2 19 178 32]\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "UA34t8p1mu7-", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "We are now ready to build our model. We will use an [Embedding](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding) layer to map from an integer that corresponds to a word, to a vector of floating point weights (the embedding). These weights are learned when we train the model." + ] + }, + { + "metadata": { + "id": "rl_sY4nFe-0N", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 649 + }, + "outputId": "b6d174e9-0b82-4d4a-c377-87df511cd7a4" + }, + "cell_type": "code", + "source": [ + "from tensorflow.keras.models import Sequential\n", + "from tensorflow.keras.layers import Dense, Embedding, Flatten\n", + "\n", + "embedding_dimension = 16\n", + "\n", + "model = Sequential()\n", + "model.add(Embedding(num_words, embedding_dimension, input_length=max_len))\n", + "\n", + "# Our output is a 3D tensor of shape (samples, vocab_size, embedding_dimension)\n", + "# we will flatten it into a 2D tensor of shape (samples, vocab_size * embedding_dimension)\n", + "model.add(Flatten())\n", + "\n", + "# Add a classifier on top.\n", + "model.add(Dense(1, activation='sigmoid') )\n", + "model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])\n", + "\n", + "model.summary()\n", + "\n", + "history = model.fit(\n", + " train_data,\n", + " train_labels,\n", + " epochs=10,\n", + " batch_size=32,\n", + " validation_split=0.2\n", + ")" + ], + "execution_count": 10, + "outputs": [ + { + "output_type": "stream", + "text": [ + "_________________________________________________________________\n", + "Layer (type) Output Shape Param # \n", + "=================================================================\n", + "embedding (Embedding) (None, 250, 16) 16000 \n", + "_________________________________________________________________\n", + "flatten (Flatten) (None, 4000) 0 \n", + "_________________________________________________________________\n", + "dense (Dense) (None, 1) 4001 \n", + "=================================================================\n", + "Total params: 20,001\n", + "Trainable params: 20,001\n", + "Non-trainable params: 0\n", + "_________________________________________________________________\n" + ], + "name": "stdout" + }, + { + "output_type": "stream", + "text": [ + "/usr/local/lib/python3.6/dist-packages/tensorflow/python/ops/gradients_impl.py:108: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.\n", + " \"Converting sparse IndexedSlices to a dense Tensor of unknown shape. \"\n" + ], + "name": "stderr" + }, + { + "output_type": "stream", + "text": [ + "Train on 20000 samples, validate on 5000 samples\n", + "Epoch 1/10\n", + "20000/20000 [==============================] - 1s 70us/step - loss: 0.5390 - acc: 0.7149 - val_loss: 0.3705 - val_acc: 0.8420\n", + "Epoch 2/10\n", + "20000/20000 [==============================] - 1s 59us/step - loss: 0.3294 - acc: 0.8575 - val_loss: 0.3370 - val_acc: 0.8596\n", + "Epoch 3/10\n", + "20000/20000 [==============================] - 1s 60us/step - loss: 0.2908 - acc: 0.8793 - val_loss: 0.3351 - val_acc: 0.8606\n", + "Epoch 4/10\n", + "20000/20000 [==============================] - 1s 60us/step - loss: 0.2653 - acc: 0.8927 - val_loss: 0.3388 - val_acc: 0.8590\n", + "Epoch 5/10\n", + "20000/20000 [==============================] - 1s 60us/step - loss: 0.2364 - acc: 0.9082 - val_loss: 0.3434 - val_acc: 0.8624\n", + "Epoch 6/10\n", + "20000/20000 [==============================] - 1s 61us/step - loss: 0.2064 - acc: 0.9223 - val_loss: 0.3606 - val_acc: 0.8538\n", + "Epoch 7/10\n", + "20000/20000 [==============================] - 1s 59us/step - loss: 0.1784 - acc: 0.9385 - val_loss: 0.3734 - val_acc: 0.8458\n", + "Epoch 8/10\n", + "20000/20000 [==============================] - 1s 60us/step - loss: 0.1512 - acc: 0.9528 - val_loss: 0.4006 - val_acc: 0.8394\n", + "Epoch 9/10\n", + "20000/20000 [==============================] - 1s 60us/step - loss: 0.1259 - acc: 0.9645 - val_loss: 0.4164 - val_acc: 0.8396\n", + "Epoch 10/10\n", + "20000/20000 [==============================] - 1s 60us/step - loss: 0.1035 - acc: 0.9748 - val_loss: 0.4409 - val_acc: 0.8366\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "qyXxmsDGe-0V", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Our classifier has a validation accuracy of about 83%. Note that we make use of only the first 250 words in each review. We are also flattening our embedding and passing it to a single Dense layer, which treats each word separately without taking into consideration the ordering of the words in the sequence. To reach higher accuracy, it would be helpful use a recurrent layer or 1D convolution which will take the sequence of the words into consideration." + ] + }, + { + "metadata": { + "id": "yNJEYzWCTopy", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# Visualize Embeddings with the Embedding Projector\n", + "\n", + "Recall the reviews are encoded as series of integers in our training data. Before we can visualize the learned embeddings, first we will need to determine which word corresponds to each number. In this case, the IMDB dataset includes a utility method ```.word_index()``` that contains a mapping from words to numbers. We will use this to build a reversed word index, which maps from numbers to words." + ] + }, + { + "metadata": { + "id": "96uw4Szxe-0b", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# A dictionary mapping words to an integer index\n", + "word_index = imdb.get_word_index()\n", + "\n", + "# The first indices are reserved\n", + "word_index = {k:(v+3) for k,v in word_index.items()} \n", + "word_index[\"\"] = 0\n", + "word_index[\"\"] = 1\n", + "word_index[\"\"] = 2 # unknown\n", + "word_index[\"\"] = 3\n", + "\n", + "reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])\n", + "\n", + "def decode_review(text):\n", + " return ' '.join([reverse_word_index[i] for i in text])" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "uZ6qkJOdfmfj", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Now we can use the decode_review function to display the text for the first review. You will see padding at the beginning, since this review was shorter than our 250 word maximum length." + ] + }, + { + "metadata": { + "id": "FZaYLC13flhp", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 54 + }, + "outputId": "4d520690-780f-4785-8c36-10275bb33890" + }, + "cell_type": "code", + "source": [ + "decode_review(train_data[0])" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "\" this film was just brilliant casting story direction really the part they played and you could just imagine being there robert is an amazing actor and now the same being director father came from the same as myself so i loved the fact there was a real with this film the throughout the film were great it was just brilliant so much that i the film as soon as it was released for and would recommend it to everyone to watch and the was amazing really at the end it was so sad and you know what they say if you at a film it must have been good and this definitely was also to the two little that played the of and paul they were just brilliant children are often left out of the i think because the stars that play them all up are such a big for the whole film but these children are amazing and should be for what they have done don't you think the whole story was so because it was true and was life after all that was with us all\"" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "metadata": { + "id": "2ngIjD8HhHsS", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Now that we have the number to word mapping, we are ready to retrieve the learned embedding from the model. This gives us a matrix of weights. Each row corresponds to the embedding for that number in our ```reversed_word_dict``` above, and the corresponding word can be found in ```word_index```.\n", + "\n", + "We retrieve the weights by using the ```model.layers``` and ```model.weights``` methods. In this case, the embedding layer is the first layer we added to the model.\n", + "\n" + ] + }, + { + "metadata": { + "id": "N7o16O-aUlzv", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "1654fa8f-dc2a-42a1-a2c6-c5a9fb1bfe5d" + }, + "cell_type": "code", + "source": [ + "e = model.layers[0]\n", + "weights = e.get_weights()[0]\n", + "print(weights.shape) # 1000, 16. Each word is mapped to an embedding vector." + ], + "execution_count": 13, + "outputs": [ + { + "output_type": "stream", + "text": [ + "(1000, 16)\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "hbOD5Rv3hV1m", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Next, we will format these for visualization in the embedding projector. To do so, we will need to provide two files in tab separated format: a file of vectors (containing the embedding), and a file of meta data (containing the words)." + ] + }, + { + "metadata": { + "id": "U2q09l-8WB0j", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "out_v = open('vecs.tsv', 'w')\n", + "out_m = open('meta.tsv', 'w')\n", + "for word_num in range(num_words):\n", + " word = reverse_word_index[word_num]\n", + " embeddings = weights[word_num]\n", + " out_m.write(word + \"\\n\")\n", + " out_v.write('\\t'.join([str(x) for x in embeddings]) + \"\\n\")\n", + "out_v.close()\n", + "out_m.close()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "JGYPLGwqhrH6", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "If you are running this tutorial in [Colaboratory](https://colab.research.google.com), you can use the following snippet to download these files to your local machine." + ] + }, + { + "metadata": { + "id": "0yuZjaYRWG2A", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "from google.colab import files\n", + "files.download('vecs.tsv')\n", + "files.download('meta.tsv')" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "SM0_cG11deiq", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Now, you can open the [Embedding Projector](http://projector.tensorflow.org/) in a new window, and click on 'Load data'. Upload the ```vecs.tsv``` and ```meta.tsv``` files from above. Next, click 'Search', and type in a word to find its closest neighbors. With this small dataset, not all of the learned embeddings will be interpretable, though some will be! \n", + "\n", + "For example, try searching for 'beautiful'. The learned embeddings you see may be different, they depend on random weight initialization used by the model. When the author of this tutorial ran it, they saw \"loved\" and \"wonderful\" were the closest neighbors. Likewise, the closest neigbhors for \"lame\" were \"awful, and poorly\".\n", + "\n", + "# Next steps\n", + "* To learn more about Word Embeddings, we recommend browing [this](https://www.tensorflow.org/tutorials/representation/word2vec) older tutorial (the code is out of date, and we recommend skipping it in favor of the newer version here, but the explanation and diagrams are useful).\n", + "\n", + "* [TensorFlow Hub](https://www.tensorflow.org/hub/) contains large databases of pretrained word embeddings you can download and reuse in your projects (although at the time of writing, these use a different progamming style than the one in this tutorial)." + ] } - ], - "source": [ - "from keras.models import Sequential\n", - "from keras.layers import Flatten, Dense\n", - "\n", - "DIM = 8\n", - "\n", - "# create a Sequential model\n", - "model = Sequential()\n", - "# lets add our Dense layer\n", - "model.add( Dense(DIM, activation='relu', input_shape=(EMBED_SIZE,)) )\n", - "\n", - "\n", - "# Let's add a classifier.\n", - "model.add( Dense(1, activation='sigmoid') )\n", - "model.compile( optimizer='adam', loss='binary_crossentropy', metrics=['acc'] )\n", - "\n", - "model.summary()\n", - "\n", - "history = model.fit(\n", - " X_train,\n", - " y_train,\n", - " epochs=3,\n", - " batch_size=32,\n", - " validation_split=0.2\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this run, our validation accuracy was just under 50%!" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.6.5" - } - }, - "nbformat": 4, - "nbformat_minor": 2 + ] }