In [29]:
import numpy as np
import pandas as pd  # For data handling

from datetime import datetime

from keras.models import Model
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Input
from keras.layers.embeddings import Embedding
from keras.preprocessing.text import Tokenizer

import warnings
warnings.filterwarnings('ignore')

In [2]:
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 500)
pd.set_option("display.max_rows", 999)
pd.set_option('display.max_colwidth', 500)

In [3]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))

Word embeddings provide a dense representation of words and their relative meanings.

Here willl consider:

* Words Embeddings via the Embedding layer.
* How to learn a word embedding while fitting a neural network.

<hr>

* Word Embedding.
* Keras Embedding Layer.
* Example of Learning an Embedding.

A word embedding is a class of approaches for representing words and documents using a dense vector representation.

It is an improvement over more the traditional bag-of-word model.

The position of a word within the vector space is learned from text and is based on the words that surround the word when it is used.


### Keras Embedding Layer
In keras we can find Embedding layer option, that can be used for neural networks on text data.

For neccessary that the input data be integer encoded, so that each word is represented by a unique integer. This data preparation step can be performed using the Tokenizer API also provided with Keras, sklearn or nltk packages.

The Embedding layer is initialized with random weights and will learn an embedding for all of the words in the training dataset.

It is a flexible layer that can be used:

* Alone to learn a word embedding that can be saved and used in another model later.
* Part of a deep learning model where the embedding is learned along with the model itself.
* To load a pre-trained word embedding model.

#### Keras Embedding Layer arguments:

**input_dim**: This is the size of the vocabulary in the text data. For example, if your data is integer encoded to values between 0-10, then the size of the vocabulary would be 11 words.

**output_dim**: This is the size of the vector space in which words will be embedded. It defines the size of the output vectors from this layer for each word. For example, it could be 32 or 100 or even larger. Test different values for your problem.

**input_length**: This is the length of input sequences, as you would define for any input layer of a Keras model. For example, if all of your input documents are comprised of 1000 words, this would be 1000.
For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

For example, below we define an Embedding layer with a vocabulary of 200 (e.g. integer encoded words from 0 to 199, inclusive), a vector space of 32 dimensions in which words will be embedded, and input documents that have 50 words each.

e = Embedding(200, 32, input_length=50)

### Example of Learning an Embedding

In [123]:
# define documents
docs = ['Well done',
        'Good work',
        'Great effort',
        'nice work',
        'Excellent!',
        'Weak',
        'Poor effort',
        'not good',
        'poor work',
        'Could have done better.']
# define class labels
labels = np.array([1,1,1,1,1,0,0,0,0,0])

### Integer encode document 

Using Tokenizer

In [153]:
### Integer encode document using Tokenizer
vocab_size = 20
tokenizer = Tokenizer(vocab_size)
tokenizer.fit_on_texts(docs)

sequences = tokenizer.texts_to_sequences(docs)
word_index = tokenizer.word_index
input_dim = len(word_index) + 1

print('Found %s unique tokens.' % len(word_index))

max_length = 4
padded_docs = pad_sequences(sequences, max_length, padding='post')
print('Shape of data tensor:', padded_docs.shape)
print(padded_docs)

Found 14 unique tokens.
Shape of data tensor: (10, 4)
[[ 6  2  0  0]
 [ 3  1  0  0]
 [ 7  4  0  0]
 [ 8  1  0  0]
 [ 9  0  0  0]
 [10  0  0  0]
 [ 5  4  0  0]
 [11  3  0  0]
 [ 5  1  0  0]
 [12 13  2 14]]


### Integer encode document 

Using Keras one_hot()

In [155]:
### Integer encode document using Keras one_hot()
# Keras provides the one_hot() function that creates a hash of each word as an efficient integer encoding. 
# We will estimate the vocabulary size of 50, which is much larger than needed to reduce the probability of collisions from the hash function.
vocab_size = 20
encoded_docs = [one_hot(d, vocab_size) for d in docs]

# The sequences have different lengths and Keras prefers inputs to be vectorized and all inputs to have the same length. 
# We will pad all input sequences to have the length of max vector = 4. 
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print('Shape of data tensor:', padded_docs.shape)
print(padded_docs)

Shape of data tensor: (10, 4)
[[17  5  0  0]
 [ 3 10  0  0]
 [14 18  0  0]
 [ 8 10  0  0]
 [ 1  0  0  0]
 [ 9  0  0  0]
 [ 5 18  0  0]
 [17  3  0  0]
 [ 5 10  0  0]
 [14  1  5 11]]


So, now ready to define our Embedding layer.

The Embedding has a vocabulary of 50 and an input length of 4. We will choose a small embedding space of 8 dimensions.

The model is binary classification model. 

Output from the Embedding layer will be 4 vectors of 8 dimensions each, one for each word. 

We flatten this to a one 32-element vector to pass on to the Dense output layer.

In [199]:
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 4, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# summarize the model
print(model.summary())

Model: "sequential_19"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding_29 (Embedding)     (None, 4, 4)              80        
_________________________________________________________________
flatten_24 (Flatten)         (None, 16)                0         
_________________________________________________________________
dense_15 (Dense)             (None, 1)                 17        
Total params: 97
Trainable params: 97
Non-trainable params: 0
_________________________________________________________________
None


In [215]:
history = model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
print('Loss: %f' % (loss))

Accuracy: 100.000000
Loss: 0.162495


In [216]:
wEmb = model.layers[0].get_weights()[0]
print("Shape of embeddings : ", wEmb.shape) 
print("\nEmbedings:\n")
print(wEmb)

Shape of embeddings :  (20, 4)

Embedings:

[[-1.99764460e-01  2.29839504e-01  9.99004692e-02  2.30218112e-01]
 [-5.64353406e-01  5.74807465e-01  5.72019935e-01 -3.64473052e-02]
 [-1.15214810e-02 -4.02787551e-02 -2.76461728e-02 -4.48251888e-03]
 [-5.80309570e-01  5.91365814e-01  6.23466253e-01 -4.25502479e-01]
 [-4.70023640e-02 -7.97265768e-03  4.11068909e-02 -3.20850238e-02]
 [ 4.74460661e-01 -5.13310313e-01 -7.26960421e-01 -6.04268193e-01]
 [-2.04268694e-02  8.55582952e-03  4.96452712e-02  5.87750226e-04]
 [-1.81722641e-03 -4.38152552e-02  3.40045616e-03 -3.73443738e-02]
 [-4.37325537e-01  4.21925187e-01  2.19033450e-01  4.69510376e-01]
 [ 5.12585521e-01 -4.97423023e-01 -3.36204201e-01 -5.36920726e-01]
 [ 2.94415832e-01  1.05284825e-01 -1.90992221e-01  2.26864070e-01]
 [-5.49828410e-01 -4.33199018e-01  4.41136420e-01 -4.30477470e-01]
 [-2.78783962e-03  1.16887689e-02 -1.26105435e-02 -2.61036046e-02]
 [-1.07128546e-03  6.29330799e-03  1.53057836e-02  6.87025860e-03]
 [-6.82420611e-01 

In [None]:
def transforSentence(sentences, vocab_size, max_length):
    v = [one_hot(d, vocab_size) for d in sentences]
    v = pad_sequences(v, maxlen=max_length, padding='post')
    return v

sentences = ["Well done"]
k = transforSentence(sentences=sentences, vocab_size=vocab_size, max_length=max_length)

pred = model.predict(k)
print("Sentence: " + str(sentences) + "  Prediction: " + str(pred[0]))

In [172]:
sentences = ["Poor work"]
k = transforSentence(sentences=sentences, vocab_size=vocab_size, max_length=max_length)

pred = model.predict(k)
print("Sentence: " + str(sentences) + "  Prediction: " + str(pred[0]))

Sentence: ['Poor work']  Prediction: [0.4878812]


<hr>

## How looks word embedings?

Exctracting word vectors from Keras Embedding Layer

In [220]:
### How looks word embedings

output_dim = 2
model2 = Sequential()
model2.add(Embedding(vocab_size, output_dim, input_length=max_length))
model2.add(Flatten())
# compile the model
model2.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

embeddings = model2.predict(padded_docs) # finally getting the embeddings.
print("Shape of embeddings : ", embeddings.shape)

Shape of embeddings :  (10, 8)


In [221]:
embeddings2 = embeddings.reshape(-1, max_length, output_dim)
print("Shape of embeddings : ", embeddings2.shape) 
print("\nEmbedings:\n")
print(embeddings)

Shape of embeddings :  (10, 4, 2)

Embedings:

[[-0.00379983  0.0309325  -0.03431519 -0.01433668  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [ 0.04388897  0.00719521 -0.03590513  0.03512326  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [ 0.02390401  0.00492169 -0.01700457  0.00464555  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [ 0.03399228 -0.0267233  -0.03590513  0.03512326  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [-0.01990628  0.03737198  0.03334451 -0.01893507  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [-0.01431173  0.0475159   0.03334451 -0.01893507  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [-0.03431519 -0.01433668 -0.01700457  0.00464555  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [-0.00379983  0.0309325   0.04388897  0.00719521  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [-0.03431519 -0.01433668 -0.03590513  0.03512326  0.03334451 -0.01893507
   0.03334451 -0.01893507]
 [ 0.02390401  0.00492169 -0.01990628  0.037

10 ---> number of documents

4  ---> each document is made of 4 words which was our maximum length of any document.

2  ---> each word is 2 dimensional.

In [222]:
embeddings = model2.layers[0].get_weights()[0]
words_embeddings = {w:embeddings[idx] for w, idx in word_index.items()}

In [223]:
words_embeddings

{'work': array([-0.01990628,  0.03737198], dtype=float32),
 'done': array([-0.00952191, -0.00654057], dtype=float32),
 'good': array([0.04388897, 0.00719521], dtype=float32),
 'effort': array([0.02222503, 0.02809996], dtype=float32),
 'poor': array([-0.03431519, -0.01433668], dtype=float32),
 'well': array([ 0.04503355, -0.04771587], dtype=float32),
 'great': array([ 3.0855570e-02, -2.4296343e-05], dtype=float32),
 'nice': array([ 0.03399228, -0.0267233 ], dtype=float32),
 'excellent': array([-0.01431173,  0.0475159 ], dtype=float32),
 'weak': array([-0.03590513,  0.03512326], dtype=float32),
 'not': array([-0.01014149, -0.02332916], dtype=float32),
 'could': array([-0.03128161, -0.00627512], dtype=float32),
 'have': array([ 0.01552765, -0.03073349], dtype=float32),
 'better': array([0.02390401, 0.00492169], dtype=float32)}