# Using Pre-trained Word Embeddings

In this notebook we will show some operations on pre-trained word embeddings to gain an intuition about them.

We will be using the pre-trained GloVe embeddings that can be found in the [official website](https://nlp.stanford.edu/projects/glove/). In particular, we will use the file `glove.6B.300d.txt` contained in this [zip file](https://nlp.stanford.edu/data/glove.6B.zip).

We will first load the GloVe embeddings using [Gensim](https://radimrehurek.com/gensim/). Specifically, we will use [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html)'s [`load_word2vec_format()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.load_word2vec_format) classmethod, which supports the original word2vec file format.
However, there is a difference in the file formats used by GloVe and word2vec, which is a header used by word2vec to indicate the number of embeddings and dimensions stored in the file. The file that stores the GloVe embeddings doesn't have this header, so we will have to address that when loading the embeddings.

Loading the embeddings may take a little bit, so hang in there!

In [1]:
!pip install gensim
from gensim.models import KeyedVectors

fname = "glove.6B.300d-vocabulary.txt"
glove = KeyedVectors.load_word2vec_format(fname, no_header=True)
glove.vectors.shape

Collecting gensim
  Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (8.1 kB)
Collecting smart-open>=1.8.1 (from gensim)
  Downloading smart_open-7.0.5-py3-none-any.whl.metadata (24 kB)
Collecting wrapt (from smart-open>=1.8.1->gensim)
  Downloading wrapt-1.17.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (6.4 kB)
Downloading gensim-4.3.3-cp311-cp311-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (26.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m26.7/26.7 MB[0m [31m75.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[?25hDownloading smart_open-7.0.5-py3-none-any.whl (61 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.4/61.4 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading wrapt-1.17.0-cp311-cp311-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (83 kB)
[2K   [90m━━━━━━━━━━━━━━━━━

(400000, 0)

## Word similarity

One attribute of word embeddings that makes them useful is the ability to compare them using cosine similarity to find how similar they are. [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html) objects provide a method called [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) that we can use to find the closest words to a particular word of interest. By default, [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) returns the 10 most similar words, but this can be changed using the `topn` parameter.

Below we test this function using a few different words.

In [2]:
# common noun
glove.most_similar("cactus")

  dists = dot(self.vectors[clip_start:clip_end], mean) / self.norms[clip_start:clip_end]


[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [3]:
# common noun
glove.most_similar("cake")

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [4]:
# adjective
glove.most_similar("angry")

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [5]:
# adverb
glove.most_similar("quickly")

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [6]:
# preposition
glove.most_similar("between")

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [7]:
# determiner
glove.most_similar("the")

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

## Word analogies

Another characteristic of word embeddings is their ability to solve analogy problems.
The same [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method can be used for this task, by passing two lists of words:
a `positive` list with the words that should be added and a `negative` list with the words that should be subtracted. Using these arguments, the famous example $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ can be executed as follows:

In [8]:
# king - man + woman
glove.most_similar(positive=["king", "woman"], negative=["man"])

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

Here are a few other interesting analogies:

In [9]:
# car - drive + fly
glove.most_similar(positive=["car", "fly"], negative=["drive"])

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [10]:
# berlin - germany + australia
glove.most_similar(positive=["berlin", "australia"], negative=["germany"])

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [11]:
# england - london + baghdad
glove.most_similar(positive=["england", "baghdad"], negative=["london"])

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [12]:
# japan - yen + peso
glove.most_similar(positive=["japan", "peso"], negative=["yen"])

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

In [13]:
# best - good + tall
glove.most_similar(positive=["best", "tall"], negative=["good"])

[('5,770', nan),
 ('woundwort', nan),
 ('40.64', nan),
 ('hookey', nan),
 ('tigerman', nan),
 ('jagua', nan),
 ('kissane', nan),
 ('bawean', nan),
 ('5,430', nan),
 ('termly', nan)]

## Looking under the hood

Now that we are more familiar with the [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method, it is time to implement its functionality ourselves.
But first, we need to take a look at the different parts of the [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html) object that we will need.
Obviously, we will need the vectors themselves. They are stored in the `vectors` attribute.

In [15]:
glove.vectors.shape

(400000, 300)

As we can see above, `vectors` is a 2-dimensional matrix with 400,000 rows and 300 columns.
Each row corresponds to a 300-dimensional word embedding. These embeddings are not normalized, but normalized embeddings can be obtained using the [`get_normed_vectors()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.get_normed_vectors) method.

In [16]:
normed_vectors = glove.get_normed_vectors()
normed_vectors.shape

(400000, 300)

Now we need to map the words in the vocabulary to rows in the `vectors` matrix, and vice versa.
The [`KeyedVectors`](https://radimrehurek.com/gensim/models/keyedvectors.html) object has the attributes `index_to_key` and `key_to_index` which are a list of words and a dictionary of words to indices, respectively.

In [17]:
#glove.index_to_key

In [18]:
#glove.key_to_index

## Word similarity from scratch

Now we have everything we need to implement a `most_similar_words()` function that takes a word, the vector matrix, the `index_to_key` list, and the `key_to_index` dictionary. This function will return the 10 most similar words to the provided word, along with their similarity scores.

In [None]:
import numpy as np
from transformers import AutoTokenizer, AutoModel
import torch

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
model = AutoModel.from_pretrained('bert-base-uncased')

def compute_word_embeddings(vocab_file, model, tokenizer):
    """
    Compute word embeddings by averaging token embeddings.
    """
    word_embeddings = {}
    with open(vocab_file, 'r') as f:
        for line in f:
            word = line.strip()
            # Tokenize word
            tokens = tokenizer(word, return_tensors='pt', add_special_tokens=False)
            token_ids = tokens['input_ids'][0]
            
            # Get embeddings for tokens
            with torch.no_grad():
                outputs = model(**tokens)
                embeddings = outputs.last_hidden_state[0]  # Shape: (num_tokens, hidden_size)
            
            # Average token embeddings to form word embedding
            word_embedding = embeddings.mean(dim=0).numpy()
            word_embeddings[word] = word_embedding
    
    return word_embeddings

from sklearn.metrics.pairwise import cosine_similarity

def most_similar(word, word_embeddings, top_n=5):
    if word not in word_embeddings:
        raise ValueError(f"Word '{word}' not in vocabulary.")
    
    target_embedding = word_embeddings[word].reshape(1, -1)
    similarities = {}
    
    for other_word, embedding in word_embeddings.items():
        similarity = cosine_similarity(target_embedding, embedding.reshape(1, -1))[0][0]
        similarities[other_word] = similarity

    return sorted(similarities.items(), key=lambda x: x[1], reverse=True)[:top_n]


# Compute word embeddings
vocab_file = "glove.6B.300d-vocabulary.txt"
word_embeddings = compute_word_embeddings(vocab_file, model, tokenizer)

# Run examples
examples = ['king', 'queen', 'man', 'woman', 'apple', 'banana']
results = {}
for example in examples:
    results[example] = most_similar_words(example, word_embeddings)

# Display results
for word, similar in results.items():
    print(f"Most similar to '{word}':")
    for sim_word, score in similar:
        print(f"  {sim_word}: {score:.4f}")
    print()


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Now let's try the same example that we used above: the most similar words to "cactus".

In [None]:
vectors = glove.get_normed_vectors()
index_to_key = glove.index_to_key
key_to_index = glove.key_to_index
most_similar_words("cactus", vectors, index_to_key, key_to_index)

## Analogies from scratch

The `most_similar_words()` function behaves as expected. Now let's implement a function to perform the analogy task. We will give it the very creative name `analogy`. This function will get two lists of words (one for positive words and one for negative words), just like the [`most_similar()`](https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.most_similar) method we discussed above.

In [None]:
from numpy.linalg import norm

def analogy(positive, negative, vectors, index_to_key, key_to_index, topn=10):
    # find ids for positive and negative words
    pos_ids = [key_to_index[w] for w in positive]
    neg_ids = [key_to_index[w] for w in negative]
    given_word_ids = pos_ids + neg_ids
    # get embeddings for positive and negative words
    pos_emb = vectors[pos_ids].sum(axis=0)
    neg_emb = vectors[neg_ids].sum(axis=0)
    # get embedding for analogy
    emb = pos_emb - neg_emb
    # normalize embedding
    emb = emb / norm(emb)
    # calculate similarities to all words in out vocabulary
    similarities = vectors @ emb
    # get word_ids in ascending order with respect to similarity score
    ids_ascending = similarities.argsort()
    # reverse word_ids
    ids_descending = ids_ascending[::-1]
    # get boolean array with element corresponding to any of given_word_ids set to false
    given_words_mask = np.isin(ids_descending, given_word_ids, invert=True)
    # obtain new array of indices that doesn't contain any of the given_word_ids
    ids_descending = ids_descending[given_words_mask]
    # get topn word_ids
    top_ids = ids_descending[:topn]
    # retrieve topn words with their corresponding similarity score
    top_words = [(index_to_key[i], similarities[i]) for i in top_ids]
    # return results
    return top_words

Let's try this function with the $\vec{king} - \vec{man} + \vec{woman} \approx \vec{queen}$ example we discussed above.

In [None]:
positive = ["king", "woman"]
negative = ["man"]
vectors = glove.get_normed_vectors()
index_to_key = glove.index_to_key
key_to_index = glove.key_to_index
analogy(positive, negative, vectors, index_to_key, key_to_index)