# Word Embedding
##### 1 . Word2Vec
##### 2 . Glove

`vector representations of a particular word`

```
sent1 : Have a good day
sent2 : Have a great day

vocabulary v = {Have, a, good, great, day}

then one hot representation of the vocabulary is 

Have = [1,0,0,0,0] 
a    = [0,1,0,0,0]  
good = [0,0,1,0,0]  
great= [0,0,0,1,0]  
day  = [0,0,0,0,1]

If we try to visualize these encodings, we can think of a 5 dimensional space, where each word occupies one of the dimensions and has nothing to do with the rest (no projection along the other dimensions). This means ‘good’ and ‘great’ are as different as ‘day’ and ‘have’, which is not true.

Can't found similarty between words
```

<img src='https://3qeqpr26caki16dnhd19sv6by6v-wpengine.netdna-ssl.com/wp-content/uploads/2017/08/Word2Vec-Training-Models.png' style="width: 800px;">

### CBOW Model: This method takes the context of each word as the input and tries to predict the word corresponding to the context.
### Skip-Gram model: We use the target word (whose representation we want to generate) to predict the context and in the process, we produce the representations


<img src='https://miro.medium.com/max/2598/1*sAJdxEsDjsPMioHyzlN3_A.png' style="width: 800px;">

### Cosinge Similarity

##### Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
##### Cosine similarity should be between 0 and 1 or max -1 and +1 (taking negative angles)
<img src='https://neo4j.com/docs/graph-algorithms/current/images/cosine-similarity.png'>
<img src='https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png' style="width:500px;">

<img src='https://www.pyimagesearch.com/wp-content/uploads/2014/02/sim_metric_cosine.png' style="width:800px;">
<img src='https://datascience-enthusiast.com/figures/cosine_sim.png' style="width:800px;">

# Keras
    1. Define Vocabulary size
    2. Convert to One Hot Representation
    3. Pass the representation to Embedding layer , par = {dimensions,representation}  , convert the rep in a feature rep or vector matrix 
    4. Embedding matrix
    

In [1]:
from tensorflow.keras.preprocessing.text import one_hot

In [2]:
### sentences
sent=[  'the glass of milk',
     'the glass of juice',
     'the cup of tea',
    'I am a good boy',
     'I am a good developer',
     'understand the meaning of words',
     'your videos are good',]

In [3]:
sent

['the glass of milk',
 'the glass of juice',
 'the cup of tea',
 'I am a good boy',
 'I am a good developer',
 'understand the meaning of words',
 'your videos are good']

In [4]:
### Vocabulary size
voc_size=10000

In [5]:
onehot_repr=[one_hot(words,voc_size) for words in sent] 
print(onehot_repr)

[[7597, 6419, 3355, 852], [7597, 6419, 3355, 8393], [7597, 1396, 3355, 5682], [2100, 6217, 2314, 4512, 7194], [2100, 6217, 2314, 4512, 1245], [9385, 7597, 7958, 3355, 8590], [4164, 344, 3824, 4512]]


In [7]:
from tensorflow.keras.layers import Embedding
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
import numpy as np

In [8]:
sent_length=8
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0    0 7597 6419 3355  852]
 [   0    0    0    0 7597 6419 3355 8393]
 [   0    0    0    0 7597 1396 3355 5682]
 [   0    0    0 2100 6217 2314 4512 7194]
 [   0    0    0 2100 6217 2314 4512 1245]
 [   0    0    0 9385 7597 7958 3355 8590]
 [   0    0    0    0 4164  344 3824 4512]]


In [9]:
dim=10

In [10]:
model=Sequential()
model.add(Embedding(voc_size,10,input_length=sent_length))
model.compile('adam','mse')
model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 8, 10)             100000    
Total params: 100,000
Trainable params: 100,000
Non-trainable params: 0
_________________________________________________________________


In [11]:
print(model.predict(embedded_docs))

[[[ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
   -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
   -1.52490027e-02 -4.92345579e-02]
  [ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
   -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
   -1.52490027e-02 -4.92345579e-02]
  [ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
   -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
   -1.52490027e-02 -4.92345579e-02]
  [ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
   -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
   -1.52490027e-02 -4.92345579e-02]
  [ 2.78482772e-02  4.25400175e-02  1.95365213e-02  2.26448812e-02
   -3.69555354e-02 -3.05938013e-02 -4.47949544e-02  3.90998237e-02
    1.41094364e-02 -3.72097269e-02]
  [-1.88705567e-02  2.44389512e-02  2.40766741e-02  1.58469938e-02
   -4.97491620e-02 -1.49353743e-02  1.14225373e-02  7.02540949e-03
    1.06448419e-

In [12]:
embedded_docs[0]

array([   0,    0,    0,    0, 7597, 6419, 3355,  852])

In [13]:
print(model.predict(embedded_docs)[0])

[[ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
  -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
  -1.52490027e-02 -4.92345579e-02]
 [ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
  -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
  -1.52490027e-02 -4.92345579e-02]
 [ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
  -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
  -1.52490027e-02 -4.92345579e-02]
 [ 1.00296028e-02 -4.02883515e-02 -3.77208106e-02  2.59280205e-05
  -2.83508897e-02  6.68331236e-03  9.52942297e-03  4.72874977e-02
  -1.52490027e-02 -4.92345579e-02]
 [ 2.78482772e-02  4.25400175e-02  1.95365213e-02  2.26448812e-02
  -3.69555354e-02 -3.05938013e-02 -4.47949544e-02  3.90998237e-02
   1.41094364e-02 -3.72097269e-02]
 [-1.88705567e-02  2.44389512e-02  2.40766741e-02  1.58469938e-02
  -4.97491620e-02 -1.49353743e-02  1.14225373e-02  7.02540949e-03
   1.06448419e-02 -3.90231386e-02