# What are word embeddings 
A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems

# Word2vec Representation
Word2vec is a recently-introduced distributed word representation learning 
technique that is currently being used as a feature engineering technique for many 
NLP tasks

### Advantages 
 * The Word2vec approach is not subjective to the human knowledge of language as in the WordNet-based approach.
 * Word2vec representation vector size is independent of the vocabulary size unlike one-hot encoded representation or the word co-occurrence matrix
 * Word2vec is a distributed representation. Unlike localist representation, where the representation depends on the activation of a single element of the representation vector (for example, one-hot encoding), the distributed representation depends on the activation pattern of all the elements in the vector. This gives more expressive power to Word2vec than produced by the one-hot encoded representation
 
### Main idea 
Word2vec learns the meaning of a given word by looking at its context and representing it numerically. By context, we refer to a fixed number of words in front of and behind the word of interest. 

# The continous bad of words model 
The CBOW model architecture tries to predict the current target word (the center word) based on the source context words (surrounding words).

Considering a simple sentence, “the quick brown fox jumps over the lazy dog”, this can be pairs of (context_window, target_word) where if we consider a context window of size 2, we have examples like ([quick, fox], brown), ([the, brown], quick), ([the, dog], lazy) and so on. Thus the model tries to predict the target_word based on the context_window words

<img src = 'https://www.researchgate.net/profile/Daniel-Braun-6/publication/326588219/figure/fig1/AS:652185784295425@1532504616288/Continuous-Bag-of-words-CBOW-CB-and-Skip-gram-SG-training-model-illustrations.png'>

In [7]:
from keras.preprocessing.text import one_hot
import tensorflow
import numpy as np 

In [8]:
tensorflow.__version__

'2.6.0'

In [9]:
corpus =[
    'the glass of milk',
    'the cup of tea',
    'I am a good boy',
    'I am a good developer',
    'Understand the meaning of words'   
]

In [10]:
voc_size  =1000
labels = np.array([1,1,0,0,2])

In [11]:
onehot = [one_hot(words,voc_size) for words in corpus]
onehot  # getting the index representation

[[660, 768, 298, 252],
 [660, 637, 298, 953],
 [198, 910, 854, 56, 955],
 [198, 910, 854, 56, 679],
 [87, 660, 321, 298, 133]]

# Word embedding representation 

In [12]:
from keras.layers import Embedding
from keras.preprocessing.sequence import pad_sequences 
from keras.models import Sequential

In [13]:
import numpy as np 

In [14]:
sent_length = 8 
embedded_docs = pad_sequences(onehot, padding = 'pre', maxlen = sent_length) 
print(embedded_docs)

[[  0   0   0   0 660 768 298 252]
 [  0   0   0   0 660 637 298 953]
 [  0   0   0 198 910 854  56 955]
 [  0   0   0 198 910 854  56 679]
 [  0   0   0  87 660 321 298 133]]


In [15]:
dim = 10
from keras.layers import Flatten
from keras.layers import Dense

In [16]:
model = Sequential()
model.add(Embedding(voc_size, dim, input_length = sent_length))
model.add(Flatten()) 
model.add(Dense(2,activation = 'softmax')) 
model.compile('adam','categorical_crossentropy', metrics = ['accuracy'])

In [17]:
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 10)             10000     
_________________________________________________________________
flatten (Flatten)            (None, 80)                0         
_________________________________________________________________
dense (Dense)                (None, 2)                 162       
Total params: 10,162
Trainable params: 10,162
Non-trainable params: 0
_________________________________________________________________


In [18]:
embedded_docs[0]

array([  0,   0,   0,   0, 660, 768, 298, 252])