# Contextual Word Embeddings with ModernBERT

*Parts of this notebook are based on the DutchSimLex code of Lizzy Brans: https://github.com/lizzybrans/Simlex999-Dutch. Thanks to her!*

We will perform the same type of word embedding evaluation from last week, but now with ModernBERT, a recent bidirectional encoder LLM that is an updated version of BERT. We need a recent version of the Transformers library (version 4.48.0 or newer) to be able to use this. You may have to install a bunch of recent Transformers library stuff:

In [None]:
# %pip install torch
# %pip install transformers

In [2]:
import transformers
import torch
import numpy as np
import scipy
import pandas as pd
import seaborn as sns
from scipy.spatial.distance import cosine
from scipy.stats import spearmanr

In [None]:
from transformers import AutoTokenizer, ModernBertModel
import torch

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = ModernBertModel.from_pretrained("answerdotai/ModernBERT-base")

### Other models

See here for pre-trained models available via Huggingface: https://huggingface.co/models

Here is a list of fill-mask models similar to BERT: https://huggingface.co/models?pipeline_tag=fill-mask&sort=trending

### Training a model

We aren't going to attempt training a BERT-based model from scratch this time, this requires a lot of compute and data. Typically you use a pre-trained model, and maybe tune it. You can find a variety of models for most major languages on Huggingface and use them with the Transformers library.

## Embedding single words in a contextual embedding model

Let's define a function that can get am embedding from a specific layer for a specific word, while also subtokenizing it and taking the average of the subtokens as the embedding for the word. We also define a function to calculate similarity between a word pair.

In [None]:
def get_word_embedding(word, layer_nums):
    # Tokenize the word into subtokens and add special tokens [CLS] and [SEP]
    subtokens = [tokenizer.cls_token] + tokenizer.tokenize(word) + [tokenizer.sep_token]
    # Convert subtokens to input IDs
    input_ids = tokenizer.convert_tokens_to_ids(subtokens)
    # Wrap it in a tensor and add an extra batch dimension
    input_ids = torch.tensor(input_ids).unsqueeze(0)
    # Make sure the model does not compute gradients
    with torch.no_grad():
        # Get the model outputs
        outputs = model(input_ids, output_hidden_states=True)
    # Check if layer_nums is a list or a single integer
    if isinstance(layer_nums, int):
        layer_nums = [layer_nums]
    # Use the hidden state from the specified layers as word embedding
    embeddings = [outputs.hidden_states[i] for i in layer_nums]
    # Average the embeddings from the specified layers
    averaged_embedding = torch.mean(torch.stack(embeddings), dim=0)
    # Ignore the first and the last token ([CLS] and [SEP])
    averaged_embedding = averaged_embedding[0, 1:-1]
    # Get the mean of the subtoken vectors to get the word vector
    word_embedding = torch.mean(averaged_embedding, dim=0)
    # Convert tensor to a numpy array
    word_embedding = word_embedding.numpy()
    return word_embedding

In [None]:
def calculate_similarity(word1, word2, layer_nums):
    word1_embedding = get_word_embedding(word1, layer_nums)
    word2_embedding = get_word_embedding(word2, layer_nums)
    similarity = 1 - cosine(word1_embedding, word2_embedding)
    return similarity


Now, we can perform the same kinds of similarity queries as last time:

In [7]:
# similarity queries (default to cosine similarity: 0 least similar, to 1 most similar)
pairs = [
    ('car', 'minivan'),   # a minivan is a kind of car
    ('car', 'bicycle'),   # still a wheeled vehicle
    ('car', 'airplane'),  # ok, no wheels, but still a vehicle
    ('car', 'cereal'),    # ... and so on
    ('car', 'communism'),
]
for w1, w2 in pairs:
    print('%r\t%r\t%.2f' % (w1, w2, calculate_similarity(w1, w2, 0)))

'car'	'minivan'	0.67
'car'	'bicycle'	0.69
'car'	'airplane'	0.65
'car'	'cereal'	0.12
'car'	'communism'	0.04


We choose to get similarities from layer 0, as this typically works best for words without context.

## Evaluating the model

Let's once again try to evaluate this model using the wordSim-353 benchmark, as we did in Notebook 4. This time, we don't need to select words for which we have enough data - the model should be able to embed all words in the benchmark.

This may take a while, as it will embed all words into this large and modern LLM.

In [None]:
wordSim353 = dict()

with open("data/wordSim353.csv","r") as infile:
    next(infile) #skip header
    for line in infile:
        raw_w1, raw_w2, rating = line.strip().split(",")
        w1 = raw_w1+"-NOUN"
        w2 = raw_w2+"-NOUN"
        wordSim353[(w1, w2)] = float(rating)

In [None]:
rhos = []
measures = []

layer_num = 0

print("ModernBERT-based space vs. wordSim353 -> spearman's rho:\t", )

wordSim_ratings = []
vsm_sims_l0 = []

for idx, ((w1, w2), r) in enumerate(wordSim353.items()):
    wordSim_ratings.append(r)
    vsm_sims_l0.append(calculate_similarity(w1, w2, layer_num))
    
    #Let's add a progress counter as this takes a while.
    if idx % 50 == 0 and idx > 0:
        print(f"Embedded {idx} word pairs...")

rho, pval = scipy.stats.spearmanr(wordSim_ratings, vsm_sims_l0)

print(rho)
rhos.append(rho)
measures.append("ModernBERT")

Plot this result.

In [None]:
plt.bar(range(0, len(rhos)), rhos)
plt.xticks(range(0, len(rhos)), measures, size='small', rotation='vertical')
plt.show()

### Layer-wise evaluation

We can also evaluate each layer of the model, and visualize the results. Note that this will take 23 times longer than the previous part. If you don't want to wait for that, you can skip to the "Error Analysis" part, which only uses the previous result from layer 0.

This is why we don't do many exercises with modern LLMs in this course...

In [None]:
# Calculate the Spearman correlation for each layer
rho_layers = []

for layer_num in range(23):  # For ModernBERT base models, there are 23 layers including the output layer
    wordSim_ratings = []  # Initialize similarity_scores in each iteration
    vsm_sims = []

    for (w1, w2), r in wordSim353.items():
        similarity = calculate_similarity(w1, w2, layer_num)
        vsm_sims.append(similarity)
        wordSim_ratings.append(r)

    rho, pval = scipy.stats.spearmanr(wordSim_ratings, vsm_sims)
    rho_layers.append(rho)

    print(f'Layer {layer_num} - Spearman correlation: {rho}')

Plot the results:

In [None]:
# Plotting in blue template
sns.set(style="whitegrid")
fig, ax = plt.subplots(figsize=(10, 6))
ax = sns.barplot(x=list(range(23)), y=rho_layers, palette="Blues_d", ax=ax, edgecolor='black')
ax.set_title('Spearman Correlation Across Transformer Layers', fontsize=14, fontweight='bold')
ax.set_xlabel('Transformer Layers', fontsize=12, fontweight='bold')
ax.set_ylabel('Spearman Correlation', fontsize=12, fontweight='bold')
ax.grid(True)
sns.despine()
plt.show()

### Error analysis

The model does worse than we might expect. Let's perform an error analysis of the layer 0 results to see where the largest differences between the model and human ratings are.

In [None]:
#Let's use a dataframe
wordsim = pd.DataFrame(wordSim353.items(), columns=['pair', 'rating-wordsim'])
wordsim['predicted-ModernBERT'] = vsm_sims_l0
wordsim['predicted-ModernBERT'] = wordsim['predicted-ModernBERT']*10 #rescale the predictions to be 0-10
wordsim['abs_diff-ModernBERT'] = abs(wordsim['rating-wordsim'] - wordsim['predicted-ModernBERT'])
largest_diff = wordsim.nlargest(10, 'abs_diff-ModernBERT')
print(largest_diff)

---

### Exercise

1. Sometimes, it is possible to get better results by summing the embeddings of multiple layers. Can you improve on this result by finding some layer combination that correlates better with the human benchmark?
2. Try to evaluate some other fill-mask models from https://huggingface.co/models?pipeline_tag=fill-mask&sort=trending, such as BERT (the classic), RoBERTa or XLNet. Are any of them better at this task than the more recent ModernBERT? They'll certainly be faster.
3. Try to evaluate using the SimLex-999 benchmark, instead of wordSim-353.

In [23]:
# your code here