# **Text to Vector conversion** (Part 2)

## **Word2vec**
Word2vec is a popular Google-developed machine learning technique used in natural language processing (NLP) to map words into high-dimensional, numerical vector spaces. It learns semantic relationships by training a shallow, two-layer neural network to predict words based on their surrounding context, enabling applications like word analogies (e.g., "king - man + woman = queen").

![](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcRUBt6Ov0yCKNtbPAFpIjPV6Mx7y3QorsfSPg&s)

## **Avg Word2Vec**
Avg Word2vec (or Averaging Word Embeddings) is a simple yet effective technique in NLP used to create a single vector representation for a whole sentence, paragraph, or document. 

Suppose a sentence has 7 words, then every word having 300 dimensions (in the case of google word2vec) will make the training very heavy in temrs of computational power, in that case, making a vector that represents the whole sentence by averaging the vectors, makes the whole sentence having 300 dimensions.

### Implementation

In [None]:
# import gensim.downloader as api
# wv = api.load('word2vec-google-news-300')     # downloading The google word2vec model
# wv.save("word2vec.kv")                        # saving it in kv file to load faster next time

The above cell is commented to avoid accidental execution

In [23]:
from gensim.models import KeyedVectors
wv = KeyedVectors.load("word2vec.kv", mmap='r')

In [25]:
wv['Queen']

memmap([-0.22070312, -0.17480469, -0.10498047,  0.2578125 ,  0.16210938,
        -0.13085938, -0.16699219,  0.07373047, -0.07226562,  0.02404785,
        -0.13964844,  0.02197266,  0.17675781, -0.19140625,  0.0378418 ,
        -0.01782227, -0.03710938, -0.03735352,  0.15625   ,  0.08837891,
         0.0534668 , -0.02392578, -0.2734375 , -0.2578125 , -0.00720215,
         0.06933594, -0.21777344, -0.10058594,  0.2421875 ,  0.03417969,
        -0.12890625, -0.1171875 , -0.18261719,  0.04321289, -0.125     ,
        -0.09960938,  0.26367188,  0.375     , -0.32421875, -0.1328125 ,
        -0.13378906, -0.50390625, -0.05908203,  0.04077148,  0.23730469,
        -0.03393555, -0.01495361, -0.09765625, -0.06445312,  0.02087402,
        -0.10302734,  0.10449219,  0.20019531, -0.16503906, -0.01196289,
         0.30859375, -0.41015625, -0.22070312,  0.08056641, -0.12792969,
         0.13085938,  0.28515625, -0.07275391,  0.02612305,  0.01916504,
        -0.16992188,  0.01745605,  0.13085938, -0.1

In [49]:
wv.most_similar('happy')

[('glad', 0.7408890724182129),
 ('pleased', 0.6632170677185059),
 ('ecstatic', 0.6626912355422974),
 ('overjoyed', 0.6599286794662476),
 ('thrilled', 0.6514049172401428),
 ('satisfied', 0.6437949538230896),
 ('proud', 0.636042058467865),
 ('delighted', 0.627237856388092),
 ('disappointed', 0.6269949674606323),
 ('excited', 0.6247665286064148)]

Checking the "King - Man + Woman = Queen" concept

In [29]:
vec = wv['king'] - wv['man'] + wv['woman']
wv.most_similar(vec)

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.5777117609977722),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

Checking other relationships like, 

`France - Paris = India - Delhi`\
=> `India = France - Paris + Delhi`

In [48]:
cap_vec = wv['France'] - wv['Paris'] + wv['Delhi']
wv.most_similar(cap_vec)

[('Delhi', 0.7827259302139282),
 ('India', 0.7255375981330872),
 ('Haryana', 0.6402689218521118),
 ('NEW_DELHI', 0.6274688243865967),
 ('Delhi_Oct.##_ANI', 0.6266664266586304),
 ('Delhi_Aug.##_ANI', 0.6259540915489197),
 ('Maharashtra', 0.6175909638404846),
 ('Uttar_Pradesh', 0.609610378742218),
 ('Karnataka', 0.6041072607040405),
 ('Delhi_Sep', 0.6027783751487732)]

## Implementing avg word2vec

In [52]:
sentence = "This Man is a doctor and works in India"
sentence = sentence.lower().split(" ")
sentence

['this', 'man', 'is', 'a', 'doctor', 'and', 'works', 'in', 'india']

In [54]:
# removing stopwords
sentence = [word for word in sentence if word not in ['this', 'is', 'a', 'and', 'in']]
sentence

['man', 'doctor', 'works', 'india']

In [56]:
import numpy as np

vectors = []

for word in sentence:
    vectors.append(wv[word])

vectors = np.array(vectors)
vectors

array([[ 0.32617188,  0.13085938,  0.03466797, ..., -0.30273438,
        -0.08007812,  0.02099609],
       [-0.09326172,  0.02734375,  0.07958984, ..., -0.02661133,
         0.15429688, -0.08691406],
       [ 0.04418945,  0.10693359,  0.09716797, ...,  0.01300049,
         0.13476562,  0.06689453],
       [-0.234375  , -0.07177734,  0.01055908, ..., -0.09521484,
        -0.11621094, -0.11230469]], dtype=float32)

In [57]:
vectors_avg = np.mean(vectors, axis=0)
vectors_avg

array([ 1.06811523e-02,  4.83398438e-02,  5.54962158e-02,  5.60607910e-02,
       -4.07714844e-02,  8.22143555e-02, -3.62548828e-02, -1.50619507e-01,
       -2.56347656e-02, -9.21039581e-02, -2.70996094e-02, -2.02880859e-01,
       -5.13916016e-02, -9.89990234e-02, -1.33152008e-02,  2.16003418e-01,
       -1.59606934e-02,  1.70349121e-01,  8.05053711e-02, -2.03323364e-02,
       -3.50284576e-02,  2.09541321e-02,  2.44140625e-03, -8.72802734e-02,
       -8.12988281e-02, -4.61425781e-02, -1.91345215e-01,  1.19873047e-01,
        3.35693359e-04, -1.33422852e-01,  4.88281250e-04,  5.75561523e-02,
       -1.41113281e-01, -2.97851562e-02, -9.81445312e-02,  9.02771950e-02,
        3.67431641e-02,  6.46972656e-03,  1.39389038e-01, -3.82690430e-02,
        1.24435425e-02, -2.19726562e-03,  1.76147461e-01,  1.11511230e-01,
        1.28662109e-01, -6.81152344e-02, -1.20849609e-01, -2.74658203e-02,
       -6.39038086e-02, -3.24707031e-02, -1.27136230e-01,  8.64257812e-02,
        6.23168945e-02, -

In [59]:
print(f"Normal word2vec: {vectors.shape}\nAverage word2vec: {vectors_avg.shape}")

Normal word2vec: (4, 300)
Average word2vec: (300,)
