Open Kaggle notebook for data: https://www.kaggle.com/code/sharooqfarzeenak/nlp-vectorization-word2vec

**Word2vec** is a technique in natural language processing (NLP) for obtaining vector representations of words. These vectors capture information about the meaning of the word based on the surrounding words. The word2vec algorithm estimates these representations by modeling text in a large corpus.

Goal of this notebook is to familiarize with vector representation of different words in the 'word2vec-google-news-300' dataset, which includes pre-trained vectors trained on a portion of the Google News dataset (consisting of about 100 billion words). The model contains 300-dimensional vectors for about 3 million words and phrases.

In [1]:
import gensim
from gensim.models import Word2Vec, KeyedVectors

In [2]:
# Loading the dataset
# Loading the pre-trained Word2Vec model
word2vec_path = '/kaggle/input/google-word2vec/GoogleNews-vectors-negative300.bin'
word2vec = gensim.models.KeyedVectors.load_word2vec_format(word2vec_path, binary = True)

In [4]:
type(word2vec)

gensim.models.keyedvectors.KeyedVectors

In [5]:
w2v = word2vec

In [45]:
# Getting vectorized form of a word
example = w2v['example']
example.shape

(300,)

In [42]:
# Getting similar words
w2v.most_similar([w2v['example']])

[('example', 1.0),
 ('instance', 0.7873880863189697),
 ('examples', 0.6041036248207092),
 ('illustration', 0.5342040657997131),
 ('exemplar', 0.49639376997947693),
 ('shining_example', 0.4890441298484802),
 ('reason', 0.4628060758113861),
 ('Example', 0.44629594683647156),
 ('counterexample', 0.4366326332092285),
 ('analogy', 0.4335395395755768)]

# 'x' is to 'y' what 'z' is to?

In [50]:
# king is to man what queen is to?
w2v.most_similar([w2v['man'] - w2v['king'] + w2v['queen']])

[('woman', 0.7186801433563232),
 ('man', 0.6557512283325195),
 ('girl', 0.5882835388183594),
 ('lady', 0.5754351615905762),
 ('teenage_girl', 0.5700528025627136),
 ('teenager', 0.5378326177597046),
 ('schoolgirl', 0.497780978679657),
 ('policewoman', 0.49065014719963074),
 ('blonde', 0.4870774745941162),
 ('redhead', 0.4778464436531067)]

In [48]:
# newborn is to adult what puppy is to?
w2v.most_similar([w2v['adult'] - w2v['newborn'] + w2v['puppy']])

[('puppy', 0.6128201484680176),
 ('adult', 0.6070362329483032),
 ('dog', 0.5691561698913574),
 ('dogs', 0.5043929815292358),
 ('beagle', 0.49217966198921204),
 ('poodle', 0.4801988899707794),
 ('puppies', 0.4798838794231415),
 ('Puppy', 0.47203266620635986),
 ('pitbull', 0.467395544052124),
 ('basset_hound', 0.46064868569374084)]

In [64]:
# Asia is to India what Germany is to?
w2v.most_similar([w2v['asia'] - w2v['india'] + w2v['germany']])

[('asia', 0.751755952835083),
 ('germany', 0.6635499000549316),
 ('europe', 0.511908233165741),
 ('austria', 0.4750858247280121),
 ('sweden', 0.4703856110572815),
 ('korea', 0.4643610417842865),
 ('russia', 0.4627966284751892),
 ('european', 0.45522257685661316),
 ('east_asia', 0.4503966271877289),
 ('spain', 0.44895800948143005)]