# Gensim - Working with Word2vec pretrained Model

Subject: Download and work with pretrained Work2vec embeddings.

Data: google-news model, trained with Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. 

Procedure:
- Print list of pretrained models from Gensim
- Download model trained with Google-News dataset (to be executed in console)
- Save that model locally
- Load that model from local file, preferably with limited vocab
- Load the model's vocab in Pytorch as torchtext.vocab.Vocab
- Load the model's word vectors into a torch.nn.Embedding layer for usage in torch model

Others:
- Support for Colab for saving/loading from Google Drive

Sources:
- https://code.google.com/archive/p/word2vec/
- https://saturncloud.io/blog/pytorch-gensim-how-do-i-load-pretrained-word-embeddings/

In [11]:
if IN_COLAB := 'google.colab' in str(get_ipython()):
    from google.colab import drive
    drive.mount('/content/drive')
    BASE_PATH = './drive/MyDrive/Colab/'

else:
    BASE_PATH = '../'

In [None]:
from pprint import pprint
import locale
locale.setlocale(locale.LC_ALL, locale='')  # for thousands separator via ... print(f'{value:n}')

## gensim's Pretrained Models

In [None]:
import gensim
import gensim.downloader as downloader

### List gensim pretrained models

In [None]:
print('\n'.join(gensim.downloader.info()['models'].keys()))

In [None]:
pprint(gensim.downloader.info()['models']['word2vec-google-news-300'])

### Download pretrained word vectors and save locally
better execute in <b>console</b>

In [None]:
# download might crash jupyter lab
# might help
# -> "jupyter-lab --NotebookApp.iopub_data_rate_limit=1e10"
# -> "jupyter-lab --NotebookApp.iopub_msg_rate_limit=1e10"
# but better download and save in console, then load from local file in notebook

wv = downloader.load('word2vec-google-news-300')  # ~1.7 GB
wv.save_word2vec_format(BASE_PATH + 'pretrained/word2vec-google-news-300.bin', binary=True)  # >3.5gb; takes few sec.  # see https://code.google.com/archive/p/word2vec/

### Load saved pretrained word vectors

In [12]:
# load saved 
wv = gensim.models.KeyedVectors.load_word2vec_format(BASE_PATH + 'pretrained/word2vec-google-news-300.bin', binary=True)  # takes a few sec.

### Quick Test

In [13]:
wv.most_similar(positive=['woman', 'king'], negative=['man'])

[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('Queen_Consort', 0.5235945582389832),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454)]

### Vocab

In [14]:
print(f'Vocab size: {len(wv.key_to_index):n} words.')

Vocab size: 3.000.000 words.


In [18]:
# With over 3GB in RAM, the Google-News-embeddings are too large to be convenient to work with. 
# Load only the first 100.000 (of 3.000.000) words:
wv_limited = gensim.models.KeyedVectors.load_word2vec_format(BASE_PATH + 'pretrained/word2vec-google-news-300.bin',
                                                     binary=True,
                                                     limit=100000)
print(f'Vocab size: {len(wv_limited.key_to_index):n} words.')

wv_limited.most_similar(positive=['woman', 'king'], negative=['man'])  # Notice that 'Queen_Consort' is missing

Vocab size: 100.000 words.


[('queen', 0.7118193507194519),
 ('monarch', 0.6189674139022827),
 ('princess', 0.5902431011199951),
 ('crown_prince', 0.5499460697174072),
 ('prince', 0.5377321839332581),
 ('kings', 0.5236844420433044),
 ('queens', 0.5181134343147278),
 ('sultan', 0.5098593831062317),
 ('monarchy', 0.5087411999702454),
 ('royal_palace', 0.5087166428565979)]

In [19]:
wv_more_limited = gensim.models.KeyedVectors.load_word2vec_format(BASE_PATH + 'pretrained/word2vec-google-news-300.bin',
                                                     binary=True,
                                                     limit=20000)
print(f'Vocab size: {len(wv_more_limited.key_to_index):n} words.')

wv_more_limited.most_similar(positive=['woman', 'king'], negative=['man'])  # With more exotic words missing, the results seem more plausible

Vocab size: 20.000 words.


[('queen', 0.7118193507194519),
 ('princess', 0.5902431011199951),
 ('prince', 0.5377321839332581),
 ('monarchy', 0.5087411999702454),
 ('throne', 0.5005807876586914),
 ('royal', 0.493820458650589),
 ('ruler', 0.4909275770187378),
 ('kingdom', 0.4550568163394928),
 ('palace', 0.45176562666893005),
 ('Princess', 0.4375406801700592)]

## Use gensim embeddings in PyTorch🔥

### Vocabulary

In [20]:
import torchtext.vocab as vocab

In [21]:
v = vocab.vocab(wv_more_limited.key_to_index)

In [22]:
v['king']

6146

### Embeddings from Word vectors

In [23]:
import torch.nn as nn
import torch

In [24]:
# the vectors are simply an ndarray of size...
print(f'.vectors is a {type(wv_more_limited.vectors)} of size {wv_more_limited.vectors.shape} with dtype {wv_more_limited.vectors.dtype}.')

.vectors is a <class 'numpy.ndarray'> of size (20000, 300) with dtype float32.


In [25]:
# convert to tensor
pretrained_vectors = torch.tensor(wv_more_limited.vectors)
print(f'Converted to {type(pretrained_vectors)} of size {pretrained_vectors.shape} with dtype {pretrained_vectors.dtype}.')

Converted to <class 'torch.Tensor'> of size torch.Size([20000, 300]) with dtype torch.float32.


In [26]:
embedding_layer = nn.Embedding.from_pretrained(pretrained_vectors)

You can then use this layer in your PyTorch model, for example:

In [27]:
class MyModel(nn.Module):
    def __init__(self):
        super(MyModel, self).__init__()
        self.embedding = nn.Embedding.from_pretrained(pretrained_vectors)
        self.fc = nn.Linear(300, 10)
    
    def forward(self, x):
        embedded = self.embedding(x)
        output = self.fc(embedded.mean(dim=1))
        return output