# Neural Network Based Embeddings 

Because relationships in natural language are complex and nonlinear, deep learning models have quickly emerged as an alternative to counting based techniques to generate embeddings.

## Word2Vec

Word2Vec is one of those neural networks based methods that generates embeddings from tokenized, processed text. Various Word2Vec implementations leverage different architecture for the networks, but two common ones are the **Continuous Bag of Words (CBOW)** architecture and the **skip-gram** architecture. 

![NN](images/nn.png)

## CBOW vs. Skip Gram


 CBOW, a feed forward neural network, selects a target and uses distributed representations of the context surrounding the target to predict the target word. The skip-gram architecture is a bit simpler with one hidden layer and strives to predict the probability of a word being present given various inputs. Conceptually, it reverses the input and output of the CBOW approach. The current word is taken as input to the model and it attempts to predict the context around the input word. 

In [None]:
pip install gensim

In [None]:
from nltk.tokenize import sent_tokenize, word_tokenize
import gensim
from gensim.models import Word2Vec

In [None]:
corpus = f"Hey! I'm new in town. Can you please point me in the direction of the groccery store"

In order to pre-process our data for the `Word2Vec` model, we'll first need to tokenize the corpus by sentence then by word. 

In [None]:
# Tokenize by sentence and word 
data = []

for i in sent_tokenize(corpus):
    temp = []
    
    for j in word_tokenize(i):
        temp.append(j.lower())
        
    data.append(temp)

After we've tokenized, we can train a `Word2Vec` model using the small corpus above. Training this model will allow for simliarty calcuations downstream

In [None]:
# Train Word2Vec using CBOW
cbow = Word2Vec(data, min_count=1, vector_size=100, window=5, sg=0)

The `Word2Vec` class has several parameters.
- The `min_count` parameter will ignore any words with less than a single frequency. 
- The `vector_size` parameter limits the dimensionality of the feature vector
- The `window` parameter maps the max distance between the current and predicted word within a sentence 
- The `sg` parameter controls algorithm. Setting `sg=0` means the CBOW is used, while `sg=1` means that the Skip-Gram algorithm is used 

Read more about the various parameters [here](https://tedboy.github.io/nlps/_modules/gensim/models/word2vec.html#Word2Vec)

In [None]:
# Train Word2Vec using Skip Gram
skip_gram = Word2Vec(data, min_count=1, vector_size=100, window=5, sg=1)

In [None]:
# Calculate similarities 
cbow_similarity = cbow.wv.similarity("town", "direction")
skip_gram_similarity = skip_gram.wv.similarity("town", "direction")

# Print results
print(f"Cosine similarity between `town` and `direction` using CBOW Model: {cbow_similarity}")
print(f"Cosine similarity between `town` and `direction` using Skip Gram Model: {skip_gram_similarity}")


Since our corpus is so small, it's likely that these values will be very similar. As the data grows, the `Word2Vec` model is able to pick up on nuances between words. 

# Word Embedding using Keras
 - Define the vocabs size as parameter
 - convert each word in one hot encoding
 - Define the feature size as parameter
 - Convert each word to a vector if sizeof(features)

In [2]:
import tensorflow

In [7]:
from tensorflow.keras.preprocessing.text import one_hot

In [8]:
sent=['the glass of milk',
      'the glass of juice',
      'understanding the meaning of words',
      'My name is GOuorav Sen',
     'your videos are good',
     'who are you']

In [9]:
voc_size=10000

In [11]:
indexed_rep=[one_hot(sentence,voc_size) for sentence in sent]

In [12]:
indexed_rep

[[7103, 2242, 808, 349],
 [7103, 2242, 808, 8023],
 [1394, 7103, 2223, 808, 4603],
 [2276, 49, 9182, 1084, 1719],
 [96, 5729, 8002, 3680],
 [2505, 8002, 10]]

## the sentences are of different length so we use pad sequences which adds padding to the vector representation of sentences

In [13]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np

In [15]:
max_sent_length=8
embedded_docs=pad_sequences(indexed_rep,padding='pre',maxlen=max_sent_length)
embedded_docs

array([[   0,    0,    0,    0, 7103, 2242,  808,  349],
       [   0,    0,    0,    0, 7103, 2242,  808, 8023],
       [   0,    0,    0, 1394, 7103, 2223,  808, 4603],
       [   0,    0,    0, 2276,   49, 9182, 1084, 1719],
       [   0,    0,    0,    0,   96, 5729, 8002, 3680],
       [   0,    0,    0,    0,    0, 2505, 8002,   10]])

Clearly each word is represented using its index in vocabulary of size 10k with each vector of length 8 and 0's added in start as pre padding was passed.

In [20]:
feature_length=10
model=Sequential(
Embedding(voc_size,feature_length,mask_zero=True)
)
model.compile('adam','mse')

In [21]:
print(model.predict(embedded_docs))

[[[-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443
   -0.02005479 -0.04243258  0.02892577  0.00104243  0.03111192]
  [-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443
   -0.02005479 -0.04243258  0.02892577  0.00104243  0.03111192]
  [-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443
   -0.02005479 -0.04243258  0.02892577  0.00104243  0.03111192]
  [-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443
   -0.02005479 -0.04243258  0.02892577  0.00104243  0.03111192]
  [ 0.04734812  0.03247236  0.02644009 -0.03757442  0.02390346
    0.02729023  0.03885617 -0.04646749  0.01124547 -0.0367664 ]
  [-0.04382465  0.02960933  0.01386192 -0.03950764 -0.0191502
   -0.03138149  0.04424322  0.04320553 -0.01413291  0.0482845 ]
  [-0.03808812 -0.04608783 -0.02595259  0.03688293  0.04998564
   -0.04328847 -0.02208418 -0.02825901 -0.02601417  0.04060811]
  [-0.03251044 -0.00560902 -0.0339759   0.03864704  0.04305229
    0.02481123 -0.01905105 -0.04114872 -0.0010180

In [23]:
embedded_docs[0]

array([   0,    0,    0,    0, 7103, 2242,  808,  349])

embedded_docs[0] is a vector of length of 8 and each integer represent a word except 0 which is used as padding. For each word/integer it is converted to a dense vector of feature_length that we took as parameter.

In [22]:
print(model.predict(embedded_docs[0]))

[[-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443 -0.02005479
  -0.04243258  0.02892577  0.00104243  0.03111192]
 [-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443 -0.02005479
  -0.04243258  0.02892577  0.00104243  0.03111192]
 [-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443 -0.02005479
  -0.04243258  0.02892577  0.00104243  0.03111192]
 [-0.01128607 -0.00849074 -0.02349017 -0.04639615  0.04637443 -0.02005479
  -0.04243258  0.02892577  0.00104243  0.03111192]
 [ 0.04734812  0.03247236  0.02644009 -0.03757442  0.02390346  0.02729023
   0.03885617 -0.04646749  0.01124547 -0.0367664 ]
 [-0.04382465  0.02960933  0.01386192 -0.03950764 -0.0191502  -0.03138149
   0.04424322  0.04320553 -0.01413291  0.0482845 ]
 [-0.03808812 -0.04608783 -0.02595259  0.03688293  0.04998564 -0.04328847
  -0.02208418 -0.02825901 -0.02601417  0.04060811]
 [-0.03251044 -0.00560902 -0.0339759   0.03864704  0.04305229  0.02481123
  -0.01905105 -0.04114872 -0.00101805  0.00302534]]

# This is how we got embedding of each sentence