# WORD2VEC

Word2Vec is a popular technique used in Natural Language Processing (NLP) to represent words as vectors in a continuous vector space. It is based on the idea that words appearing in similar contexts have similar meanings. Word2Vec models are trained using two main architectures:

### CBOW (Continuous Bag of Words)

CBOW predicts the target word based on the context words (surrounding words). It is faster and works well with smaller datasets.

### Skip-gram

Skip-gram predicts the context words given a target word. It is slower but performs better on larger datasets and for rare words.

### Uses of Gensim

Gensim is a Python library widely used for NLP tasks. Some of its uses include:

- Training Word2Vec models (CBOW and Skip-gram).
- Topic modeling using algorithms like LDA (Latent Dirichlet Allocation).
- Document similarity analysis.
- Text preprocessing and vectorization.


In [32]:
%pip install gensim

Note: you may need to restart the kernel to use updated packages.


In [38]:
import gensim

In [39]:
from gensim.models import Word2Vec, KeyedVectors

In [40]:
import gensim.downloader as api
wv = api.load('word2vec-google-news-300')
vec_king = wv['king']

MemoryError: Unable to allocate 3.35 GiB for an array with shape (3000000, 300) and data type float32

In [41]:
vec_king

array([ 1.25976562e-01,  2.97851562e-02,  8.60595703e-03,  1.39648438e-01,
       -2.56347656e-02, -3.61328125e-02,  1.11816406e-01, -1.98242188e-01,
        5.12695312e-02,  3.63281250e-01, -2.42187500e-01, -3.02734375e-01,
       -1.77734375e-01, -2.49023438e-02, -1.67968750e-01, -1.69921875e-01,
        3.46679688e-02,  5.21850586e-03,  4.63867188e-02,  1.28906250e-01,
        1.36718750e-01,  1.12792969e-01,  5.95703125e-02,  1.36718750e-01,
        1.01074219e-01, -1.76757812e-01, -2.51953125e-01,  5.98144531e-02,
        3.41796875e-01, -3.11279297e-02,  1.04492188e-01,  6.17675781e-02,
        1.24511719e-01,  4.00390625e-01, -3.22265625e-01,  8.39843750e-02,
        3.90625000e-02,  5.85937500e-03,  7.03125000e-02,  1.72851562e-01,
        1.38671875e-01, -2.31445312e-01,  2.83203125e-01,  1.42578125e-01,
        3.41796875e-01, -2.39257812e-02, -1.09863281e-01,  3.32031250e-02,
       -5.46875000e-02,  1.53198242e-02, -1.62109375e-01,  1.58203125e-01,
       -2.59765625e-01,  

In [42]:
vec_king.shape

(300,)

In [43]:
type(vec_king)

numpy.ndarray

In [44]:
import numpy as np

max_position = np.argmax(vec_king)
max_position

61

In [45]:
print(wv.similarity('king', 'queen'), wv.similarity('king', 'apple'))

0.6510957 0.10826096


In [46]:
#higher the score more the simillar
wv.similarity('mango', 'apple')

0.57518554

In [58]:
q = wv['king']- wv['man'] + wv['woman']

In [59]:
wv.similarity('queen', 'q')

0.73005176

In [60]:
wv.most_similar('q', topn=10)

[('king', 0.8449392318725586),
 ('queen', 0.7300517559051514),
 ('monarch', 0.645466148853302),
 ('princess', 0.6156251430511475),
 ('crown_prince', 0.5818676352500916),
 ('prince', 0.577711820602417),
 ('kings', 0.5613663792610168),
 ('sultan', 0.5376775860786438),
 ('Queen_Consort', 0.5344247817993164),
 ('queens', 0.5289887189865112)]

In [None]:
wv.most_similar('cricket')

[('cricketing', 0.8372225761413574),
 ('cricketers', 0.8165745735168457),
 ('Test_cricket', 0.8094819188117981),
 ('Twenty##_cricket', 0.8068488240242004),
 ('Twenty##', 0.7624265551567078),
 ('Cricket', 0.75413978099823),
 ('cricketer', 0.7372578382492065),
 ('twenty##', 0.7316356897354126),
 ('T##_cricket', 0.7304614186286926),
 ('West_Indies_cricket', 0.6987985968589783)]

In [61]:
wv.most_similar('hockey', topn=20)

[('Hockey', 0.7227486968040466),
 ('Ice_Hockey', 0.6408185362815857),
 ('lacrosse', 0.6390798091888428),
 ('peewee_hockey', 0.6332175731658936),
 ('soccer', 0.6270937919616699),
 ('Hockey_League', 0.6250644326210022),
 ('pee_wee_hockey', 0.6238211393356323),
 ('basketball', 0.6131464242935181),
 ('midget_hockey', 0.6043297052383423),
 ('NHL', 0.5965192914009094),
 ('rink', 0.5942675471305847),
 ('TheHockeyNews.com', 0.5942544937133789),
 ('Edmonton_Chimos', 0.5888263583183289),
 ('inline_hockey', 0.5887413620948792),
 ('hockey_rink', 0.5869693756103516),
 ('NWHL', 0.5857946872711182),
 ('Junior_Hockey', 0.583908200263977),
 ('skating', 0.5832276940345764),
 ('sledge_hockey', 0.5821781754493713),
 ('NHLer', 0.5803652405738831)]

In [62]:
wv.similarity('hockey', 'sport')

0.47289255