In [1]:
import numpy as np
import pandas as pd  # For data handling

from datetime import datetime

from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding

import warnings
warnings.filterwarnings('ignore')

Using TensorFlow backend.


In [2]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
pd.set_option("display.max_rows", 999)
pd.set_option('display.max_colwidth', 500)

In [3]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Word embeddings provide a dense representation of words and their relative meanings.

Here willl consider:

* Words Embeddings via the Embedding layer.
* How to learn a word embedding while fitting a neural network.

<hr>

* Word Embedding.
* Keras Embedding Layer.
* Example of Learning an Embedding.

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.


### Keras Embedding Layer
In keras we can find Embedding layer option, that can be used for neural networks on text data.

For neccessary that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras, sklearn or nltk packages.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used:

* Alone to learn a word embedding that can be saved and used in another model later.
* Part of a deep learning model where the embedding is learned along with the model itself.
* To load a pre-trained word embedding model.

#### Keras Embedding Layer arguments:

**input_dim**: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

**output_dim**: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.

**input_length**: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.
For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

e = Embedding(200, 32, input_length=50)

### Example of Learning an Embedding

In [4]:
# define documents
docs = ['Well done',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])

In [5]:
# Integer encode the documents
# We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)

[[28, 31], [24, 39], [1, 37], [38, 39], [34], [24], [28, 37], [22, 24], [28, 39], [46, 4, 31, 7]]


In [6]:
# The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. 
# We will pad all input sequences to have the length of max vector = 4. 
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)

[[28 31  0  0]
 [24 39  0  0]
 [ 1 37  0  0]
 [38 39  0  0]
 [34  0  0  0]
 [24  0  0  0]
 [28 37  0  0]
 [22 24  0  0]
 [28 39  0  0]
 [46  4 31  7]]


So, now ready to define our Embedding layer.

The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is binary classification model. 

Output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. 

We flatten this to a one 32-element vector to pass on to the Dense output layer.

In [7]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Instructions for updating:
Colocations handled automatically by placer.
Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_1 (Embedding)      (None, 4, 8)              400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 32)                0         
_________________________________________________________________
dense_1 (Dense)              (None, 1)                 33        
Total params: 433
Trainable params: 433
Non-trainable params: 0
_________________________________________________________________
None


In [8]:
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
print('Loss: %f' % (loss))

Instructions for updating:
Use tf.cast instead.
Accuracy: 69.999999
Loss: 0.638469


In [9]:
def transforSentence(sentences, vocab_size, max_length):
    v = [one_hot(d, vocab_size) for d in sentences]
    v = pad_sequences(v, maxlen=max_length, padding='post')
    return v

sentences = ["Well done"]
k = transforSentence(sentences=sentences, vocab_size=vocab_size, max_length=max_length)

pred = model.predict(k)
print("Word: " + str(sentences) + " Pred: " + str(pred[0]))

Word: ['Well done'] Pred: [0.4927784]
