# GPT 2 - Analogies with Learned Embeddings
---

This document explores using the learned embeddings layer of the GPT 2 language model to perform analogy analysis, similar to the approach used for Word2Vec. The reason for exploring this analysis using GPT 2 rather than other learned embeddings like BERT is in response to the rise in prevalence of ChatGPT in modern day use. 

In [1]:
from transformers import GPT2Tokenizer, GPT2Model

# Load pre-trained model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2Model.from_pretrained('gpt2')

# Set the model to evaluation mode
model.eval()

  from .autonotebook import tqdm as notebook_tqdm
  with safe_open(checkpoint_file, framework="pt") as f:
  return self.fget.__get__(instance, owner)()
  storage = cls(wrap_storage=untyped_storage)
  with safe_open(filename, framework="pt", device=device) as f:


GPT2Model(
  (wte): Embedding(50257, 768)
  (wpe): Embedding(1024, 768)
  (drop): Dropout(p=0.1, inplace=False)
  (h): ModuleList(
    (0-11): 12 x GPT2Block(
      (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): GPT2Attention(
        (c_attn): Conv1D()
        (c_proj): Conv1D()
        (attn_dropout): Dropout(p=0.1, inplace=False)
        (resid_dropout): Dropout(p=0.1, inplace=False)
      )
      (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): GPT2MLP(
        (c_fc): Conv1D()
        (c_proj): Conv1D()
        (act): NewGELUActivation()
        (dropout): Dropout(p=0.1, inplace=False)
      )
    )
  )
  (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
)

## 1. Getting the Embeddings of a Token

The first step is to get the embeddings of a token. You can do this by first isolating the embeddings layer from the loaded GPT-2 Model. We then use the weights from the embedd

In [2]:
import torch

# Isolate the embeddings layer from the GPT 2 model
embeddings_layer = model.wte

def get_embedding(word: str, embeddings_layer: torch.nn.Embedding) -> torch.Tensor:
    """
    Given a word, return its vector embedding from the embedding layer in GPT-2.
    
    Parameters
    ----------
    word : str
        The word to be embedded.
    embeddings_layer : torch.nn.Embedding
        The embeddings layer of the pre-trained GPT-2 model.

    Returns
    -------
    torch.Tensor
        The tensor representation of the word's vector embedding.
    """
    tokens = tokenizer.encode(word)
    with torch.no_grad():
        word_embedding = embeddings_layer.weight[tokens, :]
    return word_embedding


## 2. Nearest Tokens

The next step is to be able to determine the nearest token to a given arbitrary embedding. 

### 2.1 Euclidean Distance for Similarity

An intuitive metric for checking for similarity of token embeddings is Euclidean distance. In the `closest_token` function, we are given a single token's embedding and we determine the Euclidean distance between the given vector with all embeddings in the embeddings layer.

In [3]:

def closest_token(embedding: torch.Tensor, embeddings_layer: torch.nn.Embedding) -> int:
    """
    Given an embedding, return the token id with the most similar embedding from the GPT-2 model.
    
    Parameters
    ----------
    embedding : torch.Tensor
        The tensor representation of a word's vector embedding.
    embeddings_layer : torch.nn.Embedding
        The embeddings layer of the pre-trained GPT-2 model.

    Returns
    -------
    int
        The token id with the most similar embedding.
    """
    embeddings = embeddings_layer.weight
    # Calculate the Euclidean distance
    distances = torch.norm(embeddings - embedding, dim=1)  
    # Get the token id of the smallest distance
    closest_token_id = distances.argmin().item()  
    return closest_token_id


### 2.2 Cosine Similarity

A common alternative metric for similarity is cosine similarity, which is more dependent on the direction of the embedding rather than the position. 

In [4]:

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

def closest_token_cosine(embedding: torch.Tensor, embeddings_layer: torch.nn.Embedding) -> int:
    """
    Given an embedding, return the token id with the most similar embedding from the GPT-2 model,
    measured by cosine similarity.

    Parameters
    ----------
    embedding : torch.Tensor
        The tensor representation of a word's vector embedding.
    embeddings_layer : torch.nn.Embedding
        The embeddings layer of the pre-trained GPT-2 model.

    Returns
    -------
    int
        The token id with the most similar embedding.
    """
    embeddings = embeddings_layer.weight
    # We use cosine_similarity from sklearn, which needs 2D arrays. Reshape our vectors accordingly.
    embedding = embedding.reshape(1, -1)
    embeddings = embeddings.detach().numpy()
    similarities = cosine_similarity(embedding, embeddings)
    # Get the token id of the largest similarity
    closest_token_id = np.argmax(similarities)  
    return closest_token_id

## 3. The Analogies

In [5]:
def analogy(w1: str, w2: str, w3: str, embeddings_layer: torch.nn.Embedding, cosine_sim=False) -> str:
    """
    Given three words, find a word that is related to the third word in the same way the second word is 
    related to the first by manipulating word embeddings.

    Parameters
    ----------
    w1 : str
        The first word.
    w2 : str
        The second word.
    w3 : str
        The third word.
    embeddings_layer : torch.nn.Embedding
        The embeddings layer of the pre-trained GPT-2 model.

    Returns
    -------
    str
        The token that completes the analogy.
    """
    # Ensure that the embeddings have the correct shape
    embed_1 = get_embedding(w1, embeddings_layer)
    embed_2 = get_embedding(w2, embeddings_layer)
    embed_3 = get_embedding(w3, embeddings_layer)
    
    # Ensure that the words result in single token embeddings
    assert(embed_1.shape[0] == 1)
    assert(embed_2.shape[0] == 1)
    assert(embed_3.shape[0] == 1)

    if cosine_sim:
        closest_token_id = closest_token_cosine(embed_2 - embed_1 + embed_3, model.wte)
    else:
        closest_token_id = closest_token(embed_2 - embed_1 + embed_3, model.wte)

    return tokenizer.decode([closest_token_id])

In [6]:
print("Euclidean:",analogy("bank", "money", "bank", embeddings_layer, False))
print("Cosine:"   ,analogy("bank", "money", "bank", embeddings_layer, True))

Euclidean: money
Cosine: money


In [7]:
print("Euclidean:",analogy("bank", "money", "library", embeddings_layer, False))
print("Cosine:"   ,analogy("bank", "money", "library", embeddings_layer, True))

Euclidean: library
Cosine: library


In [8]:
print("Euclidean:",analogy("bank", "money", "school", embeddings_layer, False))
print("Cosine:"   ,analogy("bank", "money", "school", embeddings_layer, True))

Euclidean: school
Cosine: school


We notice in most cases, the linear relationships that the `analogy` function assumes exists between the embeddings learned by GPT-2 does not necessarily exist. 