## Word embeddings
https://www.tensorflow.org/alpha/tutorials/text/word_embeddings

### Ways of representing text as numbers
We can encode each word in a text by
- **A one-hot vector**&mdash;ineifficient because the representation vector will be unnecessarily sparse.
- **A unique number**&mdash;because of its arbitrary representation, the integer encoding is hard to reflect the relationship between words.
- **An embedding vector,** a (fixed-length) dense vector with learnable elements representing a word.

### Embedding layer

In [1]:
import tensorflow as tf
from tensorflow import keras

An embedding layer can be set as

In [2]:
embeddingLayer = keras.layers.Embedding(1000, 32)

The first argument of `Embedding`, `input_dim`, is the vocabulary size plus 1, and the second argument `output_dim` is the dimensionality of the embedding.

The embedding layer maps word indices to dense vectors. An input should be a tensor of shape `(samples, sequenceLength)` with `int` elements, and then the output is a tensor of shape `(samples, sequenceLength, embeddingDimensionality)` with `float` elements.

### Data preprocessing
We will use the IMDB database and preprocess it like we've done previously.

In [6]:
vocabularySize = 10000
(trainData, trainLabels), (testData, testLabels) = keras.datasets.imdb.load_data(num_words=vocabularySize)

wordToIdx = {word:(idx + 3)
             for word,idx in keras.datasets.imdb.get_word_index().items()}
wordToIdx["<PAD>"] = 0
wordToIdx["<START>"] = 1
wordToIdx["<UNK>"] = 2  # unknown
wordToIdx["<UNUSED>"] = 3
idxToWord = {value:key for key,value in wordToIdx.items()}

# Padding
maxLength = 500
trainData = keras.preprocessing.sequence.pad_sequences(trainData,
                                                       value=wordToIdx["<PAD>"],
                                                       padding='post',
                                                       maxlen=maxLength)
testData = keras.preprocessing.sequence.pad_sequences(testData,
                                                      value=wordToIdx["<PAD>"],
                                                      padding='post',
                                                      maxlen=maxLength)

# Helper function for decoding integer encodings.
def decodeReview(wordIdxs):
    return ' '.join(idxToWord.get(idx, '?') for idx in wordIdxs)

In [10]:
print(trainData.shape)
print(trainLabels.shape)
print(testData.shape)
print(testLabels.shape)

(25000, 500)
(25000,)
(25000, 500)
(25000,)


### Model

In [12]:
embeddingDim = 16

model = keras.Sequential([
    keras.layers.Embedding(vocabularySize, embeddingDim, input_length=maxLength),
    keras.layers.GlobalAvgPool1D(),
    keras.layers.Dense(16, 'relu'),
    keras.layers.Dense(1, 'sigmoid')
])

model.summary()

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_2 (Embedding)      (None, 500, 16)           160000    
_________________________________________________________________
global_average_pooling1d (Gl (None, 16)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 16)                272       
_________________________________________________________________
dense_3 (Dense)              (None, 1)                 17        
Total params: 160,289
Trainable params: 160,289
Non-trainable params: 0
_________________________________________________________________


### Training

In [13]:
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy'])

history = model.fit(trainData,
                    trainLabels,
                    epochs=30,
                    batch_size=512,
                    validation_split=0.2)

Train on 20000 samples, validate on 5000 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### Retrieving the learned embeddings
The learned embeddings will be a matrix of shape `(vocabSize, embeddingDim)`:

In [18]:
embeddingW = model.layers[0].get_weights()[0]
print(embeddingW.shape)

(10000, 16)


We will plot the embeddings using [**Embedding Projector**](http://projector.tensorflow.org/). Let's prepare tab separated files for the embedding vectors and metadata.

In [19]:
with open('vecs.tsv', 'w', encoding='utf-8') as outV:
    with open('meta.tsv', 'w', encoding='utf-8') as outM:
        for idx in range(vocabularySize):
            word = idxToWord[idx]
            vector = embeddingW[idx]
            outM.write(word + '\n')
            outV.write('\t'.join(str(x) for x in vector) + '\n')

### Visualizing the embeddings
By uploading the prepared files to [Embedding Projector](http://projector.tensorflow.org/) through `Load`, we can see a 2D or 3D projection of the embeddings. From the projection, we can inspect which words are neighbors of a word of interest. The example below shows the neighbors of "beautiful".
![](https://raw.githubusercontent.com/tensorflow/docs/master/site/en/r2/tutorials/text/images/embedding.jpg)