# A Detailed Explanation of Keras Embedding Layer

[https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer](https://www.kaggle.com/rajmehra03/a-detailed-explanation-of-keras-embedding-layer)

<br>

## A Detailed Guide to understand the Word Embeddings and Embedding Layer in Keras.

<br>

In this kernel I have explained the keras embedding layer. To do so I have created a sample corpus of just 3 documents and that should be sufficient to explain the working of the keras embedding layer.

Embeddingis are useful in a variety of machine learning applications. Because of the fact I have attached many data sources to the kernel where I fell that embeddings and Keras embedding layer may prove to be useful.

Before diving in let us skim through some of the applications of the embeddings :

**1) The first application that strikes me is in the Collaborative Filtering based Recommender Systems where we have to crate the user embeddinigs and the movie embeddings by decomposing the utility matrix which contains the user-item ratings.**

To see a complete tutorial on CF based recommender systems using embeddings in Keras you can follow **[this](https://www.kaggle.com/rajmehra03/cf-based-recsys-by-low-rank-matrix-factorization)** kernel of mine.

**2) The second use is in the Natrual Language Processing and its related applications where we have to create the word embeddings for all the words present in the documents of our corpus.**

This is the terminology(전문용어) that I shall use in this kernel.

**Thus the embedding layer in Keras can be used when we want to create the embeddings to embed higher dimensional data into lower dimensional vector space.**

<br>

## IMPORTING MODULES

In [1]:
# Ignorer the warnings
import warnings
warnings.filterwarnings('always')
warnings.filterwarnings('ignore')

# data visualization and manipulation
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
import seaborn as sns

# configure
# sets matplotlib to inline and displays graphs below the corresponding cell.
%matplotlib inline

style.use('fivethirtyeight')
sns.set(style='whitegrid', color_codes=True)

# nltk
import nltk

# stop-words
from nltk.corpus import stopwords
stop_words = set(nltk.corpus.stopwords.words('english'))

# tokenizing
from nltk import word_tokenize, sent_tokenize

# keras
import keras
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense, Flatten, Embedding, Input
from keras.models import Model

<br>

## CREATING SAMPLE CORPUS OF DOCUMENTS ie TEXTS

In [2]:
sample_text_1="bitty bought a bit of butter"
sample_text_2="but the bit of butter was a bit bitter"
sample_text_3="so she bought some better butter to make the bitter butter better"

corp = [sample_text_1,sample_text_2,sample_text_3]
no_docs = len(corp)

<br>

## INTEGER ENCODING ALL THE DOCUMENTS

After this all the unique words will be represented by an integer. For this we are usinig **`one_hot`** function from the Keras. Note that the **`vocab_size`** is specified large enough so as to ensure **unique integer encoding** for each and every word.

**Note one important thing that the integer encoding for the word remains same in different docs. eg 'butter' is denoted by 31 in each and every document.**

In [5]:
vocab_size = 50
encod_corp = []

for i, doc in enumerate(corp) :
    encod_corp.append(one_hot(doc, 50))
    print("The encoding for document", i+1, " is : ", one_hot(doc, 50))

The encoding for document 1  is :  [20, 1, 45, 25, 2, 30]
The encoding for document 2  is :  [30, 12, 25, 2, 30, 28, 45, 25, 48]
The encoding for document 3  is :  [26, 28, 1, 24, 13, 30, 25, 35, 12, 48, 30, 13]


In [12]:
encod_corp

[[20, 1, 45, 25, 2, 30],
 [30, 12, 25, 2, 30, 28, 45, 25, 48],
 [26, 28, 1, 24, 13, 30, 25, 35, 12, 48, 30, 13]]

<br>

## PADDING THE DOCS (to make very doc of same length)

**The Keras Embedding layer requires all individual documents to be of same length.** Hence we will pad the shorter documents with 0 for now. Therefore now in Keras Embedding llayer the **`input_length`** will be equal to the length (ie no of words) of the document with maximum length or maximum number of words.

To pad the shorter documents I am using **`pad_sequences`** function from the Keras library.

In [7]:
# length of maximum document. will be needed whenever create embeddinigs
# for the words.
maxlen = -1

for doc in corp :
    tokens = nltk.word_tokenize(doc)
    
    if (maxlen < len(tokens)) :
        maxlen = len(tokens)
        
print("The maximum number of words in any document is : ", maxlen)

The maximum number of words in any document is :  12


In [10]:
# now to create embeddings all of our docs need to be of same length.
# hence we can pad the docs with zeros.
pad_corp = pad_sequences(encod_corp,
                         maxlen=maxlen,
                         padding='post',
                         value=0.0)

print("No of padded documents ", len(pad_corp))

No of padded documents  3


In [11]:
pad_corp

array([[20,  1, 45, 25,  2, 30,  0,  0,  0,  0,  0,  0],
       [30, 12, 25,  2, 30, 28, 45, 25, 48,  0,  0,  0],
       [26, 28,  1, 24, 13, 30, 25, 35, 12, 48, 30, 13]])

<br>

## ACTUALLY CREATING THE EMBEDDINGS using KERAS EMBEDDING LAYER

Now all the documents are of same length (after padding). And so now we are ready to create and using the embeddings.

**I will embed the words into vectors of 8 dimensions.**

In [14]:
# specifying the input shape
input = Input(shape=(no_docs, maxlen), dtype='float64')

In [15]:
'''
shape of input.
each document has 12 element or words which is the value of our maxlen variable.
'''

word_input = Input(shape=(maxlen,), dtype='float64')

# creating the embedding
word_embedding = Embedding(input_dim=vocab_size, 
                           output_dim=8, 
                           input_length=maxlen)(word_input)

word_vec = Flatten()(word_embedding) # flatten
embed_model = Model([word_input], word_vec) # combining all into a Keras model

Instructions for updating:
Colocations handled automatically by placer.


<br>

### PARAMETERS OF THE EMBEDDING LAYER ---

`input_dim`
- **the vocab size that we will chose**. 
- In other words it is the number of unique words in the vocab.

`output_dim` 
- **the number of dimensions we wish to embed into**. 
- Each word will be represented by a vector of this much dimensions.

`input_length`
- **length of the maximum document**. 
- which is stored in maxlen variable in our case.

In [16]:
embed_model.compile(optimizer=keras.optimizers.Adam(lr=1e-3),
                    loss='binary_crossentropy',
                    metrics=['acc'])
# compiling the model. parameters can be tuned as always.

In [17]:
print(type(word_embedding))
print(word_embedding)

<class 'tensorflow.python.framework.ops.Tensor'>
Tensor("embedding_1/embedding_lookup/Identity:0", shape=(?, 12, 8), dtype=float32)


In [18]:
print(embed_model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         (None, 12)                0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 12, 8)             400       
_________________________________________________________________
flatten_1 (Flatten)          (None, 96)                0         
Total params: 400
Trainable params: 400
Non-trainable params: 0
_________________________________________________________________
None


In [19]:
embeddings = embed_model.predict(pad_corp) # finally getting the embeddings.

In [20]:
print("Shape of embeddings : ", embeddings.shape)
print(embeddings)

Shape of embeddings :  (3, 96)
[[-0.0240765   0.02443459 -0.03537592 -0.0094936  -0.01362468 -0.0450674
  -0.00685731  0.02262599  0.02322966 -0.00913255  0.04162628 -0.02427381
   0.02361368 -0.04316616  0.03342781 -0.01610591  0.03424484  0.00468548
  -0.02316825 -0.02190601 -0.02104993  0.03283853  0.00412433 -0.02419662
   0.04895766 -0.01155044 -0.03116281 -0.01587792  0.0036968  -0.00192527
  -0.0471268   0.03254895 -0.02099657 -0.00372682  0.03763645  0.02339592
  -0.04064574 -0.04777142  0.04741727  0.04608115 -0.02878731 -0.0265355
   0.01230639  0.02184333  0.01229494 -0.03252731  0.02623424 -0.03275543
   0.03774544  0.03224077 -0.00760424 -0.02256087 -0.02685989  0.00226678
  -0.04385168  0.00119857  0.03774544  0.03224077 -0.00760424 -0.02256087
  -0.02685989  0.00226678 -0.04385168  0.00119857  0.03774544  0.03224077
  -0.00760424 -0.02256087 -0.02685989  0.00226678 -0.04385168  0.00119857
   0.03774544  0.03224077 -0.00760424 -0.02256087 -0.02685989  0.00226678
  -0.0438

In [21]:
embeddings = embeddings.reshape(-1, maxlen, 8)
print("Shape of embeddings : ", embeddings.shape)
print(embeddings)

Shape of embeddings :  (3, 12, 8)
[[[-0.0240765   0.02443459 -0.03537592 -0.0094936  -0.01362468
   -0.0450674  -0.00685731  0.02262599]
  [ 0.02322966 -0.00913255  0.04162628 -0.02427381  0.02361368
   -0.04316616  0.03342781 -0.01610591]
  [ 0.03424484  0.00468548 -0.02316825 -0.02190601 -0.02104993
    0.03283853  0.00412433 -0.02419662]
  [ 0.04895766 -0.01155044 -0.03116281 -0.01587792  0.0036968
   -0.00192527 -0.0471268   0.03254895]
  [-0.02099657 -0.00372682  0.03763645  0.02339592 -0.04064574
   -0.04777142  0.04741727  0.04608115]
  [-0.02878731 -0.0265355   0.01230639  0.02184333  0.01229494
   -0.03252731  0.02623424 -0.03275543]
  [ 0.03774544  0.03224077 -0.00760424 -0.02256087 -0.02685989
    0.00226678 -0.04385168  0.00119857]
  [ 0.03774544  0.03224077 -0.00760424 -0.02256087 -0.02685989
    0.00226678 -0.04385168  0.00119857]
  [ 0.03774544  0.03224077 -0.00760424 -0.02256087 -0.02685989
    0.00226678 -0.04385168  0.00119857]
  [ 0.03774544  0.03224077 -0.00760424 -

<br>

The resulting shape is (3, 12, 8).

**3 $\rightarrow$ no of documents**

**12 $\rightarrow$ each document is made of 12 words which was our maximum length of any document.**

**8 $\rightarrow$ each word is 8 dimensional.**

<br>

## GETTING ENCODING FOR A PARTICULAR WORD IN A SPECIFIC DOCUMENT

In [26]:
for i, doc in enumerate(embeddings) :
    for j, word in enumerate(doc) :
        print("The encoding for ", j+1, "th word", "in", i+1, "th document is : \n\n", word, '\n')

The encoding for  1 th word in 1 th document is : 

 [-0.0240765   0.02443459 -0.03537592 -0.0094936  -0.01362468 -0.0450674
 -0.00685731  0.02262599] 

The encoding for  2 th word in 1 th document is : 

 [ 0.02322966 -0.00913255  0.04162628 -0.02427381  0.02361368 -0.04316616
  0.03342781 -0.01610591] 

The encoding for  3 th word in 1 th document is : 

 [ 0.03424484  0.00468548 -0.02316825 -0.02190601 -0.02104993  0.03283853
  0.00412433 -0.02419662] 

The encoding for  4 th word in 1 th document is : 

 [ 0.04895766 -0.01155044 -0.03116281 -0.01587792  0.0036968  -0.00192527
 -0.0471268   0.03254895] 

The encoding for  5 th word in 1 th document is : 

 [-0.02099657 -0.00372682  0.03763645  0.02339592 -0.04064574 -0.04777142
  0.04741727  0.04608115] 

The encoding for  6 th word in 1 th document is : 

 [-0.02878731 -0.0265355   0.01230639  0.02184333  0.01229494 -0.03252731
  0.02623424 -0.03275543] 

The encoding for  7 th word in 1 th document is : 

 [ 0.03774544  0.03224077

<br>

Now this makes it easier to visualize that we have 3(size of corp) documents with each consisting of 12(maxlen) words and each word mapped to a 8-dimensional vector.

<br>

## HOW TO WORK WITH A REAL PIECE OF TEXT

Just like above we can now use any other document. We can `sent_tokenize` the doc into sentences.

Each sentence has a list of words which we will integer encode using the `ont_hot` function as below.

Now each sentence will be having different number of words. So we will need to pad the sequences to the sentence with maximum words.

**At this point we are ready to feed the input to Keras Embedding layer as shown above.**

**`input_dim`** = **the vocab size that we will choose**

**`output_dim`** = **the number of dimensions we wish to embed into**

**`input_length`** = **length of the maximum document**

<br>

**If you want to see the application of Keras embedding layer on a real task eg text classificationi then please check out my [this](https://github.com/mrc03/IMDB-Movie-Review-Sentiment-Analysis) repo on Github in which I have used the embeddings to perform sentiment analysis on IMdb moview reviewe dataset.**