<a href="https://colab.research.google.com/github/rahiakela/nlp-research-and-practice/blob/main/practical-natural-language-processing/3-text-representation/5_using_pre_trained_word2vec_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Word Embeddings

What does it mean when we say a text representation should capture “distributional similarities between words”?

Let’s consider some examples. If we’re given the word “USA,” distributionally similar words could be other countries (e.g., Canada, Germany, India, etc.) or cities in the USA. If we’re given the word “beautiful,” words that share some relationship with this word (e.g., synonyms, antonyms) could be considered distributionally similar words. These are words that are likely to occur in similar contexts.

In 2013, a seminal work by Mikolov et al. showed that their neural network–based word representation model known as “Word2vec,” based on “distributional similarity,” can capture word analogy relationships such as:

`King – Man + Woman ≈ Queen`

While learning such semantically rich relationships, Word2vec ensures that the learned word representations are low dimensional (vectors of dimensions 50–500, instead of several thousands) and dense (that is, most values in these vectors are non-zero).

Such representations make ML tasks more tractable and efficient. Word2vec led to a lot of work (both pure and applied) in the direction of learning text representations using neural networks. These representations are also called “embeddings.”

To “derive” the meaning of the word, Word2vec uses distributional similarity and distributional hypothesis. That is, it derives the meaning of a word from its context: words that appear in its neighborhood in the text. So, if two different words (often) occur in similar context, then it’s highly likely that their meanings are also similar.

Word2vec operationalizes this by projecting the meaning of the words in a vector space where words with similar meanings will tend to cluster together, and words with very different meanings are far from one another.

Conceptually, Word2vec takes a large corpus of text as input and “learns” to represent the words in a common vector space based on the contexts in which they appear in the corpus.

## Pre-trained word embeddings

Some of the most popular pre-trained embeddings are Word2vec by Google, GloVe by Stanford, and fasttext embeddings by Facebook, to name a few. Further, they’re available for various dimensions like d = 25, 50, 100, 200, 300, 600.

Let us take an example of a pre-trained word2vec model, and how we can use it to look for most similar words. We will use the Google News vectors embeddings. https://drive.google.com/file/d/0B7XkCwpI5KDYNlNUTTlSS21pQmM

In [19]:
#This module ignores the various types of warnings generated
import warnings
warnings.filterwarnings("ignore")

#This module provides a way of using operating system dependent functionality
import os

from gensim.models import Word2Vec, KeyedVectors

#This module helps in retrieving information on running processes and system resource utilization
import psutil
process = psutil.Process(os.getpid())
from psutil import virtual_memory
mem = virtual_memory()

#This module is used to calculate the time
import time

In [5]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [6]:
# content/gdrive/My Drive/Kaggle is the path where kaggle.json is  present in the Google Drive
os.environ['KAGGLE_CONFIG_DIR'] = "/content/gdrive/MyDrive/kaggle-keys"

In [18]:
%%shell

# download dataset from kaggle> URL: https://www.kaggle.com/datasets/leadbest/googlenewsvectorsnegative300
kaggle datasets download -d leadbest/googlenewsvectorsnegative300

unzip -qq googlenewsvectorsnegative300.zip
rm -rf googlenewsvectorsnegative300.zip

Downloading googlenewsvectorsnegative300.zip to /content
100% 3.17G/3.17G [00:38<00:00, 106MB/s] 
100% 3.17G/3.17G [00:38<00:00, 88.8MB/s]




In [21]:
pretrained_path = 'GoogleNews-vectors-negative300.bin.gz'

#Load W2V model. This will take some time, but it is a one time effort!
pre = process.memory_info().rss
print("Memory used in GB before Loading the Model: %0.2f"%float(pre/(10**9))) #Check memory usage before loading the model
print('-'*10)

start_time = time.time() #Start the timer
ttl = mem.total #Toal memory available

#load the model
w2v_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True)
print("%0.2f seconds taken to load"%float(time.time() - start_time)) #Calculate the total time elapsed since starting the timer
print('-'*10)

print('Finished loading Word2Vec')
print('-'*10)

post = process.memory_info().rss
print("Memory used in GB after Loading the Model: {:.2f}".format(float(post/(10**9)))) #Calculate the memory used after loading the model
print('-'*10)

print("Percentage increase in memory usage: {:.2f}% ".format(float((post/pre)*100))) #Percentage increase in memory after loading the model
print('-'*10)

print("Numver of words in vocablulary: ",len(w2v_model.key_to_index)) #Number of words in the vocabulary.

Memory used in GB before Loading the Model: 4.30
----------
65.45 seconds taken to load
----------
Finished loading Word2Vec
----------
Memory used in GB after Loading the Model: 8.46
----------
Percentage increase in memory usage: 196.95% 
----------
Numver of words in vocablulary:  3000000


Here, we find the words that are semantically most similar to the word “beautiful”; the last line returns the embedding vector of the word “beautiful”.

Let us examine the model by knowing what the most similar words are, for a given word!

In [22]:
w2v_model.most_similar('beautiful')

[('gorgeous', 0.8353005051612854),
 ('lovely', 0.8106936812400818),
 ('stunningly_beautiful', 0.7329413294792175),
 ('breathtakingly_beautiful', 0.7231340408325195),
 ('wonderful', 0.6854086518287659),
 ('fabulous', 0.6700063943862915),
 ('loveliest', 0.6612576246261597),
 ('prettiest', 0.6595001816749573),
 ('beatiful', 0.6593326330184937),
 ('magnificent', 0.6591402888298035)]

In [23]:
# Let us try with another word!
w2v_model.most_similar('toronto')

[('montreal', 0.6984112858772278),
 ('vancouver', 0.6587257385253906),
 ('nyc', 0.6248832941055298),
 ('alberta', 0.6179691553115845),
 ('boston', 0.611499547958374),
 ('calgary', 0.61032634973526),
 ('edmonton', 0.6100260615348816),
 ('canadian', 0.5944076776504517),
 ('chicago', 0.5911980271339417),
 ('springfield', 0.5888351798057556)]

In [24]:
# What is the vector representation for a word?
w2v_model['beautiful']

array([-0.01831055,  0.05566406, -0.01153564,  0.07275391,  0.15136719,
       -0.06176758,  0.20605469, -0.15332031, -0.05908203,  0.22851562,
       -0.06445312, -0.22851562, -0.09472656, -0.03344727,  0.24707031,
        0.05541992, -0.00921631,  0.1328125 , -0.15429688,  0.08105469,
       -0.07373047,  0.24316406,  0.12353516, -0.09277344,  0.08203125,
        0.06494141,  0.15722656,  0.11279297, -0.0612793 , -0.296875  ,
       -0.13378906,  0.234375  ,  0.09765625,  0.17773438,  0.06689453,
       -0.27539062,  0.06445312, -0.13867188, -0.08886719,  0.171875  ,
        0.07861328, -0.10058594,  0.23925781,  0.03808594,  0.18652344,
       -0.11279297,  0.22558594,  0.10986328, -0.11865234,  0.02026367,
        0.11376953,  0.09570312,  0.29492188,  0.08251953, -0.05444336,
       -0.0090332 , -0.0625    , -0.17578125, -0.08154297,  0.01062012,
       -0.04736328, -0.08544922, -0.19042969, -0.30273438,  0.07617188,
        0.125     , -0.05932617,  0.03833008, -0.03564453,  0.24

Note that if we search for a word that is not present in the Word2vec model, we’ll see a “key not found” error.

Hence, as a good coding practice, it’s always advised to first check if the word is present in the model’s vocabulary before attempting to retrieve its vector.

In [25]:
# What if I am looking for a word that is not in this vocabulary?
w2v_model.most_similar('practicalnlp')

KeyError: ignored

In [26]:
w2v_model['practicalnlp']

KeyError: ignored

In [27]:
w2v_model.most_similar('Ryaan')

KeyError: ignored

Two things to note while using pre-trained models:

1. Tokens/Words are always lowercased. If a word is not in the vocabulary, the model throws an exception.
2. So, it is always a good idea to encapsulate those statements in try/except blocks.

## Getting the embedding representation for full text

We have seen how to get embedding vectors for single words. How do we use them to get such a representation for a full text? A simple way is to just sum or average the embeddings for individual words.

Let us see a small example using another NLP library Spacy.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm

In [29]:
import spacy

# Load the spacy model that we already installed in Chapter 2. This takes a few seconds.
%time nlp = spacy.load('en_core_web_sm')

CPU times: user 676 ms, sys: 57.6 ms, total: 733 ms
Wall time: 861 ms


In [30]:
# process a sentence using the model
mydoc = nlp("Canada is a large country")

In [31]:
# Get a vector for individual words
print(mydoc[0].vector)  # vector for 'Canada', the first word in the text

[-1.74234    -0.90920454  0.41536316  0.15736246  1.2859436   0.24543142
  1.2570572   0.35663185 -0.8244102  -0.0674134   1.4712349   0.5119143
 -1.3309681  -0.5264146  -1.0188745  -0.8524463   1.2472408   0.2747297
 -0.0436547  -0.4842371  -1.2904495   0.42295414 -0.03794765 -0.22511679
 -0.4816206   0.36949652  1.2843533   1.4024066  -0.6087295   0.7147388
 -0.14381114 -0.9796721   0.452798    0.7162336  -0.5708136  -0.08537036
 -0.63481605  0.9896861  -0.474687    3.4676626  -0.9343261   0.29444414
 -0.02503309  1.285727   -1.7670362   0.39907005 -0.03138383  2.235859
  1.233593   -0.06988642 -0.48538476  1.0872145  -0.8912538  -1.4635974
 -0.76645774 -0.4039675   0.86213416 -0.55711997  0.77631915 -0.13158414
 -0.3540035  -0.22625872  0.38927513 -0.54100454  0.40940216 -0.5324899
 -0.55475163 -0.6075223   0.3275603  -1.6374564   0.7500537  -0.6747781
  1.2150496  -0.35457557 -0.85388327 -0.69132215 -0.6772988  -1.405904
 -0.5053379  -0.21676248 -0.219181    0.7379973  -0.24607135 

In [32]:
# Averaged vector for the entire sentence
print(mydoc.vector)

[-2.44864374e-01 -1.56845257e-01 -5.19747622e-02  5.86494267e-01
  8.10811967e-02 -1.65754989e-01  7.57052720e-01  2.63185889e-01
  1.40734492e-02  2.51211464e-01  2.43307427e-01 -2.79111534e-01
 -3.70179832e-01  5.22314429e-01 -5.23915410e-01  4.84695425e-03
  4.30857569e-01 -2.19760254e-01 -3.72532457e-01  1.71566337e-01
 -2.67529279e-01  2.24802848e-02 -3.03287357e-01 -1.04288436e-01
  1.51315406e-01 -5.31261384e-01  4.36048269e-01  2.97305524e-01
  4.72418487e-01  3.90211403e-01  2.69951403e-01  2.36672014e-01
  4.59462464e-01 -4.97865111e-01 -1.82451054e-01 -1.67997599e-01
  1.93978697e-01  5.16766071e-01 -2.88335413e-01  3.74710053e-01
 -1.11499667e-01  3.33659947e-01  5.49611822e-02  2.53970414e-01
 -5.02043903e-01  3.85194987e-01 -1.86397389e-01  8.60191345e-01
  2.11835742e-01 -1.24764726e-01 -7.09948778e-01  7.70933092e-01
 -1.79754198e-01 -6.63751960e-01 -4.01271343e-01  1.83464423e-01
 -2.96254933e-01  7.63848484e-01 -3.35624158e-01 -1.81755573e-01
  5.62856086e-02 -5.20981

In [33]:
# What happens when I give a sentence with strange words (and stop words), and try to get its word vector in Spacy?
temp = nlp("practicalnlp is a newword")
temp[0].vector

array([-0.7808644 , -0.13927388,  1.1979539 ,  0.02954794,  0.10888022,
       -0.08408853,  1.0671581 ,  1.0224844 , -0.21108732, -0.873439  ,
        1.2589307 , -0.03803551, -0.5621327 , -0.68604934, -0.9219377 ,
       -0.34106416,  0.41339433, -0.55588746,  0.01959878,  0.9607241 ,
       -0.7844026 , -1.2117573 ,  0.10303535, -0.35093412, -1.3106965 ,
        0.82981586,  0.53083956,  0.7383765 ,  0.29475784,  0.32589513,
        0.12443371, -1.0673028 ,  0.62938726, -0.62013155,  0.33982378,
       -0.74396217, -0.37885252,  0.27242422, -0.8541051 ,  1.5684996 ,
       -1.3312532 ,  0.22952707, -0.1200439 ,  0.77720463, -0.79096997,
        0.82063174,  0.4348258 ,  0.50177073,  1.6081035 , -0.23500574,
       -0.66881514,  0.604759  , -0.36700648, -0.46789208, -0.05904178,
       -0.12296978,  0.5446122 , -0.21588881, -0.57844615, -0.32954615,
       -0.19385432,  0.09608626,  0.08408612, -0.18028721,  0.54459274,
        0.20719814,  0.07226813, -0.14609273,  1.1348839 , -1.18

In [34]:
temp = nlp("Ryaan is a king")
temp[0].vector

array([-1.49338603e+00, -1.76442647e+00,  6.31928861e-01,  4.76052493e-01,
       -2.92037308e-01,  9.35610294e-01,  1.47564816e+00,  4.49107736e-01,
       -1.45390689e+00, -9.73136008e-01,  1.00833142e+00,  3.00820947e-01,
       -1.19312632e+00, -6.35515332e-01, -1.21202350e+00, -2.44572610e-01,
        9.66186762e-01, -4.70576018e-01, -4.17391241e-01,  1.19598836e-01,
       -1.52088606e+00,  3.90184253e-01, -2.57461518e-02, -5.72054982e-01,
        5.73398411e-01, -2.02877939e-01,  1.43802655e+00,  1.48042428e+00,
       -4.09596801e-01,  5.95115066e-01, -2.40366697e-01, -8.03723097e-01,
        4.20367390e-01, -3.01447839e-01, -1.63274455e+00,  1.11928654e+00,
       -1.14143562e+00,  6.56367779e-01,  2.25021333e-01,  2.86314917e+00,
       -1.07065856e+00, -5.20799041e-01, -3.20751011e-01,  1.17073393e+00,
       -1.43994677e+00,  3.85589719e-01, -3.21990907e-01,  2.65340900e+00,
        1.14150953e+00,  9.87993479e-02, -2.21533269e-01,  1.54868090e+00,
       -3.14203709e-01, -