# Word Embedding Techniques using Embedding Layer in Keras

Libraries Used Tensorflow> 2.0  and keras

In [None]:
!pip install tensorflow-gpu # No need to install this cell explicitly.

## Importing Tensorflow Library

In [2]:
import tensorflow as tf

print(tf.__version__) # checking the version of tensorflow

2.18.0


## Importing OHE from tf

In [3]:
##tensorflow >2.0
from tensorflow.keras.preprocessing.text import one_hot

## Declaring Sentences

In [None]:
### sentences
sent=['the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good']

In [None]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

## Explicitly specifying Vocabulary

In [None]:
### Vocabulary size - More the size more the processing

# Its a Hyperparameter. More the vocabulary size, we will get good feature representation.

voc_size = 500

## One Hot Representation

Converting the given sentences into One Hot Encoding

In [None]:
onehot_repr = [one_hot(sentence,voc_size)for sentence in sent]

print(onehot_repr)

[[277, 303, 96, 188], [277, 303, 96, 300], [277, 377, 96, 136], [279, 465, 443, 287, 151], [279, 465, 443, 287, 457], [436, 277, 200, 96, 66], [31, 194, 161, 287]]


### Understanding above output

As we specified the vocabulary size explicitly as 10,000. An array of 10,000 indexes gets generated and once we pass the each sentence to the `one_hot(sentence, voc_size)` iterating over the sentences list, then it will return the output of the indexes where that particular word is present in the given vocabulary.

For example: [277, 8459, 2709, 7322]

Here 1220 means for the word 'the' index is 1220 in the vocabulary.

## Word Embedding Represntation

## Importing embedding layer, padding and model

In [None]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential

In [None]:
import numpy as np

## Padding

We need the size of all the sentences to be equal while training the ANN.

Hence, we will specify the maximum Sentence length or more here we are keeping as 8, and then we will use help of **Padding** where it will add zeroes at the front or at the end based on "pre" or "post" padding specified.

THIS PROCESS REMAINS COMMON FOR ALL THE NLP APPLICATIONS

## Specifying the sentence length

Based on the length specified here, all the sentences gets convered into the same size. If we dont have any word present in the required length, then automatically zeroes (0) gets added.

In [None]:
sent_length = 8

In [None]:
## pre padding
embedded_docs = pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)

print(embedded_docs)

[[  0   0   0   0 277 303  96 188]
 [  0   0   0   0 277 303  96 300]
 [  0   0   0   0 277 377  96 136]
 [  0   0   0 279 465 443 287 151]
 [  0   0   0 279 465 443 287 457]
 [  0   0   0 436 277 200  96  66]
 [  0   0   0   0  31 194 161 287]]


## Declaring Feature Dimensions

In [None]:
## 10 feature dimesnions

dim = 10


## Training the ANN model

In [None]:
model = Sequential() # creating an object of the Sequential class

# Embedding layer works similar to Word2Vec
model.add(Embedding(voc_size, # Vocabulary size
                    10, # Dimension lenghth - Features required for each vector (means every vector represented in 10 dimensions)
                    input_length=sent_length)) # Sentences length

model.compile('adam','mse') # Compiling with Adam optimizer and MSE loss function

In [None]:
model.summary()

## Predicting

In [None]:
embedded_docs[0] # After padding first sentence - the glass of milk

array([   0,    0,    0,    0, 1220, 8459, 2709, 7322], dtype=int32)

In [None]:
model.predict(embedded_docs[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 858ms/step


array([[-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [ 0.03374443, -0.04046713,  0.01686255,  0.03337267,  0.03554653,
         0.01253268,  0.01704322, -0.02261758, -0.0099019 , -0.0373638 ],
       [ 0.03374443, -0.04046713,  0.01686255,  0.03337267,  0.03554653,
         0.01253268,  0.01704322, -0.02261758, -0.0099019 , -0.0373638 ],
       [ 0.03374443, -0.04046713,  0.01686255,  0.03337267,  0.03554653,
         0.01253268,  0.01704322, -0.02261758

In [None]:
model.predict(embedded_docs[1])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step


array([[-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [-0.01417743,  0.02342245, -0.00028545, -0.04773829,  0.0269583 ,
        -0.03326305,  0.04963494,  0.0115789 , -0.03794973, -0.01383566],
       [ 0.03374443, -0.04046713,  0.01686255,  0.03337267,  0.03554653,
         0.01253268,  0.01704322, -0.02261758, -0.0099019 , -0.0373638 ],
       [ 0.03374443, -0.04046713,  0.01686255,  0.03337267,  0.03554653,
         0.01253268,  0.01704322, -0.02261758, -0.0099019 , -0.0373638 ],
       [ 0.03374443, -0.04046713,  0.01686255,  0.03337267,  0.03554653,
         0.01253268,  0.01704322, -0.02261758

# Assignment

## Step 1: Declaring Sentences

In [8]:
#Assignment

sentences = ["The world is a better place",
      "Marvel series is my favourite movie",
      "I like DC movies",
      "the cat is eating the food",
      "Tom and Jerry is my favourite movie",
      "Python is my favourite programming language"
      ]

## Step 2: Importing tensorflow library

In [5]:
import tensorflow as tf

print(tf.__version__)


2.18.0


## Step 3: Declare Vocabulary Size

We need to specify the Vocabulary size now as it will be used by the one hot encoding to assign the index during the conversion.

Remember the more the Vocabulary size, the good will be the feature representation.

In [11]:
voc_size = 500

## Step 3: Convert the sentences into OHE

### Importing library from TF Keras to perform OHE

In [6]:
from tensorflow.keras.preprocessing.text import one_hot

### Representing each sentence in form of OHE

In [14]:
onehot_embedded_docs = [ one_hot(sentence, voc_size) for sentence in sentences] # converting the sentences into one hot encoding

In [15]:
print(onehot_embedded_docs)

[[483, 422, 232, 32, 33, 352], [109, 157, 232, 99, 313, 476], [134, 223, 201, 336], [483, 361, 232, 230, 483, 246], [230, 462, 476, 232, 99, 313, 476], [123, 232, 99, 313, 484, 150]]


In the above output, since we specified the vocabulary size as 500 all the indexes lies between the value of 0 to 500.

483 is the index for the word "The", 422 is the index for the word "world", so on and so forth.

## Step 4: Applying padding to make sentences length equal

### Importing library to perform padding from keras

In [16]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

The maximum sentence length in the given corpus is 6. Lets us consider the sentence length as 10.

In [17]:
sent_length = 10

### Performing pre-padding operation

In [18]:
onehot_embedded_docs_padded = pad_sequences(onehot_embedded_docs, # passing the documents which are one hot encoded
                                            padding='pre', # choosing prepadding
                                            maxlen=sent_length) # specifying the maximum sentence lenghth

In [19]:
onehot_embedded_docs_padded

array([[  0,   0,   0,   0, 483, 422, 232,  32,  33, 352],
       [  0,   0,   0,   0, 109, 157, 232,  99, 313, 476],
       [  0,   0,   0,   0,   0,   0, 134, 223, 201, 336],
       [  0,   0,   0,   0, 483, 361, 232, 230, 483, 246],
       [  0,   0,   0, 230, 462, 476, 232,  99, 313, 476],
       [  0,   0,   0,   0, 123, 232,  99, 313, 484, 150]], dtype=int32)

As we can observe the output, all the sentences are converted to the fixed length of 10. Now, at this point we are good to go and train our data with our ANN.

## Step 5: Finding the Embeddings

### Importing sequential and embedding from keras

In [20]:
from tensorflow.keras.models import Sequential # Sequential model
from tensorflow.keras.layers import Embedding # Embedding layer

### Creating object of Sequential model

In [25]:
model = Sequential()

### Adding the embedding layer

In [26]:
model.add(Embedding(input_dim=voc_size, # Inserting the vocabulary size
                    output_dim=5, # output dimension specified 5 (Post padding each sentence is represented in 10 indexes, here each index value gets represented in form of 5 unique dimensions)
                    input_length=sent_length)) # specifying our sentence length as 10



### Viewing the model summary

In [27]:
model.summary()

## Step 6: Prediction

Lets first display our first sentence in our corpus after padding

In [28]:
onehot_embedded_docs_padded[0]

array([  0,   0,   0,   0, 483, 422, 232,  32,  33, 352], dtype=int32)

In [33]:
onehot_embedded_docs_padded[0].shape

(10,)

In [31]:
model.predict(onehot_embedded_docs_padded[0])

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 29ms/step


array([[ 3.0929152e-02,  6.5941699e-03, -1.5458070e-02,  4.1523203e-03,
        -4.8915006e-02],
       [ 3.0929152e-02,  6.5941699e-03, -1.5458070e-02,  4.1523203e-03,
        -4.8915006e-02],
       [ 3.0929152e-02,  6.5941699e-03, -1.5458070e-02,  4.1523203e-03,
        -4.8915006e-02],
       [ 3.0929152e-02,  6.5941699e-03, -1.5458070e-02,  4.1523203e-03,
        -4.8915006e-02],
       [ 2.0910725e-03, -4.4374395e-02,  4.9491551e-02,  1.6825352e-02,
         1.8227100e-04],
       [ 9.2148185e-03,  1.1051595e-02,  3.4328811e-03,  1.8353675e-02,
         1.4357869e-02],
       [-1.7058171e-02, -9.2722066e-03, -1.9316785e-03, -3.0861974e-02,
         3.6418941e-02],
       [ 7.5140819e-03, -3.3537269e-02,  5.3764097e-03, -3.8585819e-02,
         8.7975748e-03],
       [-2.8642654e-02, -4.1455485e-02, -2.8335845e-02, -1.5461337e-02,
        -3.0452037e-02],
       [-1.1373311e-05,  1.5189733e-02,  1.4620211e-02,  4.4613812e-02,
         1.5127230e-02]], dtype=float32)

In [32]:
model.predict(onehot_embedded_docs_padded[0]).shape

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step


(10, 5)

### Understanding output

If we closely notice the dimensions of our original document which is (10,1) - 10 rows and 1 column. It means the first sentence "The world is a better place" is now represented in (10,1) after one hot encoding and after applyinhg padding.

Now, once the same is passed to the embedding layer, each word/index is converted into 5 columns of data representing unique vector. Hence the updated dimension is 10 * 5.