# <center> <font size = 24 color = 'steelblue'> <b> Pre-trained word embedding model from gensim

<div class="alert alert-block alert-info">
    
<font size = 4>
    
**This notebook demonstrates representation of text using pre-trained word embedding models.**

# <a id= 'w0'>
<font size = 4>
    
**Table of contents:**<br>
[1. Installation and import of necessary packages](#w1)<br>
[2. Model implementation](#w2)<br>
[3. Load the embedding model](#w3)<br>

    

<font size =5 color = 'seagreen'>
    
Using a pre-trained word2vec model to look for most similar words.
    
<b>For this demonstration, `Google News vectors embeddings` are used.

##### <a id = 'w1'>
<font size = 10 color = 'midnightblue'> <b>Installation and import of necessary packages

In [1]:
%pip uninstall thinc blis -y
%pip uninstall spacy -y
%pip uninstall gensim -y
%pip uninstall numpy -y


Found existing installation: thinc 8.2.5
Uninstalling thinc-8.2.5:
  Successfully uninstalled thinc-8.2.5
Found existing installation: blis 0.7.9
Uninstalling blis-0.7.9:
  Successfully uninstalled blis-0.7.9
Note: you may need to restart the kernel to use updated packages.
Found existing installation: spacy 3.7.4
Uninstalling spacy-3.7.4:
  Successfully uninstalled spacy-3.7.4
Note: you may need to restart the kernel to use updated packages.
Found existing installation: gensim 4.3.3
Uninstalling gensim-4.3.3:
  Successfully uninstalled gensim-4.3.3
Note: you may need to restart the kernel to use updated packages.
Found existing installation: numpy 1.26.4
Uninstalling numpy-1.26.4:
  Successfully uninstalled numpy-1.26.4
Note: you may need to restart the kernel to use updated packages.


In [2]:
# !pip install -U numpy
# !pip install scikit-learn
# !pip install -U gensim
# !pip install -U spacy
!pip install spacy==3.7.4 thinc==8.2.5 blis==0.7.9 numpy==1.26.4
!pip install gensim numpy==1.26.4
!python -m spacy download en_core_web_md

Collecting spacy==3.7.4
  Using cached spacy-3.7.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (27 kB)
Collecting thinc==8.2.5
  Using cached thinc-8.2.5-cp312-cp312-macosx_11_0_arm64.whl.metadata (15 kB)
Collecting blis==0.7.9
  Using cached blis-0.7.9-cp312-cp312-macosx_11_0_arm64.whl
Collecting numpy==1.26.4
  Using cached numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl.metadata (61 kB)
Using cached spacy-3.7.4-cp312-cp312-macosx_11_0_arm64.whl (6.1 MB)
Using cached thinc-8.2.5-cp312-cp312-macosx_11_0_arm64.whl (760 kB)
Using cached numpy-1.26.4-cp312-cp312-macosx_11_0_arm64.whl (13.7 MB)
Installing collected packages: numpy, blis, thinc, spacy
Successfully installed blis-0.7.9 numpy-1.26.4 spacy-3.7.4 thinc-8.2.5
Collecting gensim
  Using cached gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl.metadata (8.1 kB)
Using cached gensim-4.3.3-cp312-cp312-macosx_11_0_arm64.whl (24.0 MB)
Installing collected packages: gensim
Successfully installed gensim-4.3.3
Collecting en-core-web-md==3.7.1


 <font size =5 color = 'seagreen'> <b> Import packages

In [3]:
import os
from gensim.models import Word2Vec, KeyedVectors

# To suppress warnings
import warnings #This module ignores the various types of warnings generated
warnings.filterwarnings("ignore")

import spacy

[top](#w0)

In [4]:
# https://radimrehurek.com/gensim/downloader.html
import gensim.downloader as api
dir(api)

['BASE_DIR',
 'DATA_LIST_URL',
 'DOWNLOAD_BASE_URL',
 '_DEFAULT_BASE_DIR',
 '_PARENT_DIR',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_calculate_md5_checksum',
 '_create_base_dir',
 '_download',
 '_get_checksum',
 '_get_filename',
 '_get_parts',
 '_load_info',
 '_progress',
 'absolute_import',
 'argparse',
 'base_dir',
 'errno',
 'hashlib',
 'info',
 'io',
 'json',
 'load',
 'logger',
 'logging',
 'math',
 'os',
 'partial',
 'shutil',
 'sys',
 'tempfile',
 'urllib',
 'urlopen']

In [5]:
api.load('glove-twitter-25')

<gensim.models.keyedvectors.KeyedVectors at 0x147c14ce0>

In [6]:
api.info()

{'corpora': {'semeval-2016-2017-task3-subtaskBC': {'num_records': -1,
   'record_format': 'dict',
   'file_size': 6344358,
   'reader_code': 'https://github.com/RaRe-Technologies/gensim-data/releases/download/semeval-2016-2017-task3-subtaskB-eng/__init__.py',
   'license': 'All files released for the task are free for general research use',
   'fields': {'2016-train': ['...'],
    '2016-dev': ['...'],
    '2017-test': ['...'],
    '2016-test': ['...']},
   'description': 'SemEval 2016 / 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http://alt.qcri.org/semeval2016/task3/data/uploads/semeval2016-task3-report.pdf linked in section “Papers” of https://github.com/RaRe-Technologies/gensim-data/issues/18.',
   'checksum': '701ea67acd82e75f95e1d8e62fb0ad29',
   'file_name': 'se

In [7]:
model = api.load("glove-twitter-25")

In [8]:
model['message']

array([ 0.29771 ,  0.39549 , -0.4852  ,  0.75132 ,  0.49235 , -1.122   ,
        1.1417  ,  0.061686, -0.18265 , -0.039129, -0.024698,  0.62861 ,
       -3.2669  ,  0.12984 ,  0.096197,  1.2485  ,  0.38842 , -0.4175  ,
       -1.8549  , -0.22553 ,  0.29188 , -1.2671  , -0.18226 ,  0.023759,
       -1.5955  ], dtype=float32)

In [9]:
model.save('model1.bin')

In [None]:
# Generates error message - UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte
# KeyedVectors.load_word2vec_format('model1.bin', binary=True)

In [11]:
from gensim.test.utils import datapath

wv_from_text = KeyedVectors.load_word2vec_format(datapath('word2vec_pre_kv_c'), binary=False)  # C text format
wv_from_bin = KeyedVectors.load_word2vec_format(datapath("euclidean_vectors.bin"), binary=True)

In [12]:
wv_from_bin['king']

array([ 0.5307419 , -0.29500753,  0.22712094,  0.55379623,  0.34295803,
       -0.50292325,  0.69789636, -0.06378646, -1.4219495 ,  0.13709101],
      dtype=float32)

 ##### <a id = 'w2'>
<font size = 10 color = 'midnightblue'> <b>  Model implementation

<font size = 5 color = pwdrblue> <b> Get the word embeddings

In [None]:
%pip install -U kagglehub

In [13]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("leadbest/googlenewsvectorsnegative300")

print("Path to dataset files:", path)

Path to dataset files: /Users/toddwalters/.cache/kagglehub/datasets/leadbest/googlenewsvectorsnegative300/versions/2


In [14]:
# path = os.getcwd()
file_name = 'GoogleNews-vectors-negative300.bin.gz'
pretrained_path = path + '/' + file_name

<font size = 5 color = pwdrblue> <b> Load the model

In [15]:
w2v_model = KeyedVectors.load_word2vec_format(pretrained_path, binary=True) #load the model

<font size = 5 color = pwdrblue> <b> Check number of words in vocabulary

In [16]:
print("Number of words in vocabulary: ",len(w2v_model.index_to_key)) #Number of words in the vocabulary.

Number of words in vocabulary:  3000000


In [17]:
print(f"First few words of the vocabulary :\n{ w2v_model.index_to_key[:20]}")

First few words of the vocabulary :
['</s>', 'in', 'for', 'that', 'is', 'on', '##', 'The', 'with', 'said', 'was', 'the', 'at', 'not', 'as', 'it', 'be', 'from', 'by', 'are']


<font size = 5 color = pwdrblue> <b> Examine the model to extract most similar words for a given word like `joyful`, `solid`

In [18]:
w2v_model.most_similar('joyful')

[('joyous', 0.818248987197876),
 ('Besnik_Berisha_Pristina', 0.684850811958313),
 ('joy', 0.6633967757225037),
 ('joyousness', 0.6440029740333557),
 ('exuberant', 0.6130944490432739),
 ('uplifting', 0.593187153339386),
 ('old_demonstrator_Juliya', 0.592427134513855),
 ('sorrowful', 0.5822992324829102),
 ('cheerful', 0.5811518430709839),
 ('indescribable_joy', 0.5810405611991882)]

In [19]:
w2v_model.most_similar('Travel')

[('travel', 0.6577277183532715),
 ('Adventure_Travel', 0.636359453201294),
 ('Travel_Agent', 0.608870804309845),
 ('Aol_Autos', 0.5961860418319702),
 ('Destinations', 0.5935966968536377),
 ('Vacations', 0.5759146213531494),
 ('Vantage_Deluxe', 0.573976993560791),
 ('Travel_Destinations', 0.5731626152992249),
 ('Travel_Agents', 0.5724309086799622),
 ('Escorted_Tours', 0.5571459531784058)]

In [None]:
w2v_model.most_similar('travel')

<div class="alert alert-block alert-success">
<font size = 4>
    
<center><b> Error occurred because the word is not present in the vocabulary.</b>


<font size = 5 color = seagreen> <b> The below snippet can be used to manage the error and check similarity for multiple words:

In [None]:
inp = "y"
while inp.lower() == 'y':
    word = input("Enter a word to get similar words: ")
    try :
        print(f"Most similar words to '{word}' :\n")
        for t in w2v_model.most_similar(word):
            print(t)
        print('\n')
    except :
        print('Word does not exists in vocabulary!')
    inp = input("Do you want to continue? (Y/N) : ")


<font size = 5 color = pwdrblue> <b>  Get the word vector of any term

In [20]:
w2v_model['beautiful']

array([-0.01831055,  0.05566406, -0.01153564,  0.07275391,  0.15136719,
       -0.06176758,  0.20605469, -0.15332031, -0.05908203,  0.22851562,
       -0.06445312, -0.22851562, -0.09472656, -0.03344727,  0.24707031,
        0.05541992, -0.00921631,  0.1328125 , -0.15429688,  0.08105469,
       -0.07373047,  0.24316406,  0.12353516, -0.09277344,  0.08203125,
        0.06494141,  0.15722656,  0.11279297, -0.0612793 , -0.296875  ,
       -0.13378906,  0.234375  ,  0.09765625,  0.17773438,  0.06689453,
       -0.27539062,  0.06445312, -0.13867188, -0.08886719,  0.171875  ,
        0.07861328, -0.10058594,  0.23925781,  0.03808594,  0.18652344,
       -0.11279297,  0.22558594,  0.10986328, -0.11865234,  0.02026367,
        0.11376953,  0.09570312,  0.29492188,  0.08251953, -0.05444336,
       -0.0090332 , -0.0625    , -0.17578125, -0.08154297,  0.01062012,
       -0.04736328, -0.08544922, -0.19042969, -0.30273438,  0.07617188,
        0.125     , -0.05932617,  0.03833008, -0.03564453,  0.24

<font size = 5 color = pwdrblue> <b>  Get the embeddings for a complete text

<div class="alert alert-block alert-success">
<font size = 4>
    
- A simple way is to just sum or average the embeddings for individual words.
- Let us see a small example using another NLP library Spacy

[top](#w0)

 ##### <a id = 'w3'>
<font size = 10 color = 'midnightblue'> <b> Load the embedding model

In [21]:
# Load the english embedding
nlp = spacy.load('en_core_web_md')

In [24]:
# Create a model object
mydoc = nlp("Artificial intelligence revolutionizes industries by enhancing automation and decision-making.")

# Check the length of the vector
print(f'Length of the mydoc vector: {len(mydoc.vector)}\n')

# Get the averaged vector for the entire sentence
print(mydoc.vector)

Length of the mydoc vector: 300

[-2.88833928e+00 -1.73949754e+00 -1.43964255e+00  1.20811331e+00
  5.13780832e+00 -1.30674643e-02  7.21087217e-01  3.04970646e+00
 -1.84638822e+00 -2.22266102e+00  4.68991232e+00  2.61828923e+00
 -4.52669907e+00  1.26645672e+00 -2.34535038e-01  2.95858502e+00
  2.32375741e+00  3.09238362e+00 -2.10459971e+00 -2.37954974e+00
  1.66396189e+00  1.19737089e+00 -2.69254231e+00 -6.14912510e-01
 -1.08377755e+00 -2.53180766e+00 -8.91990006e-01 -2.05081487e+00
  1.18648148e+00 -1.64304212e-01 -6.91387177e-01 -3.24216664e-01
  9.49795187e-01 -4.35991287e-02 -1.66022158e+00 -1.46512663e+00
  1.34241199e+00  2.41495681e+00 -1.01387584e+00 -7.52867520e-01
  1.03934491e+00 -4.46677536e-01 -1.66344082e+00  1.26216695e-01
 -1.90512753e+00 -7.97698975e-01 -2.97492743e-03 -1.90645266e+00
  1.40009150e-01  3.15768272e-01 -2.23950672e+00  2.10762620e+00
  2.52927512e-01 -4.42820024e+00  2.02550087e-02  1.65891302e+00
 -1.52912760e+00  1.03578997e+00  1.68434843e-01 -1.34021