# Similarity Evaluation

> **Note**: This notebook is compatible with both Google Colab and local Jupyter environments. Colab-specific sections are clearly marked.

This notebook evaluates and compares two embedding models trained on biomedical text:

- **Word-level Skip-Gram** (trained on whole, lemmatized words)
- **Subword-level Skip-Gram** (trained on Byte Pair Encoded tokens)

Using the UMNSRS similarity dataset, we assess how well each model captures semantic similarity between biomedical terms. We evaluate on both the full dataset and a filtered subset to ensure a fair comparison.

In [13]:
!pip install -U nltk datasets gensim swifter tokenizers --no-cache-dir



In [14]:
import sys
import os
if 'google.colab' in sys.modules:
    from google.colab import drive
    drive.mount('/content/drive')

    project_path = '/content/drive/MyDrive/NLP_Projects/Week_3/word-embeddings-playground'
    if os.path.exists(project_path):
        os.chdir(project_path)
        print(f"Changed working directory to: {project_path}")
    else:
        raise FileNotFoundError(f"Project path not found: {project_path}")
else:
    print("Not running in Colab — skipping Drive mount.")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Changed working directory to: /content/drive/MyDrive/NLP_Projects/Week_3/word-embeddings-playground


## Import Libraries

In [15]:
# import libraries
import pandas as pd
import numpy as np

from gensim.models import Word2Vec, KeyedVectors

from datasets import load_dataset

from sklearn.metrics.pairwise import cosine_similarity

from tokenizers import Tokenizer

from scipy.stats import spearmanr, pearsonr

### Load Embeddings, Tokenizer, and Evaluation Dataset

This section loads the pretrained word embedding models, the BPE tokenizer, and the UMNSRS biomedical similarity dataset for evaluation.

- **Embeddings**:
  - `skip_gram_embeddings`: Word-level Skip-Gram vectors
  - `bpe_embeddings`: Subword-level Skip-Gram vectors trained on BPE tokens

- **Tokenizer**:
  - Trained BPE tokenizer (used for segmenting terms into subwords)

- **Evaluation Dataset**:
  - [UMNSRS (Unified Medical Language System - Similarity and Relatedness Set)](https://huggingface.co/datasets/bigbio/umnsrs)
  - Contains pairs of biomedical terms with human-annotated similarity scores

In [17]:
bpe_embeddings = KeyedVectors.load('./results/bpe_skipgram/bpe_embeddings.embeddings')
skip_gram_embeddings = KeyedVectors.load('./results/skipgram/skipgram.embeddings')

bpe_tokenizer = Tokenizer.from_file('./data/bpe_skipgram/bpe_tokenizer.json')

if os.path.exists('./data/umnsrs_data.csv'):
    data = pd.read_csv('./data/umnsrs_data.csv')
else:
    data = load_dataset('bigbio/umnsrs')
    data = pd.DataFrame(data['train'])
    data = data[['text_1', 'text_2', 'mean_score']]

    data.to_csv('./data/umnsrs_data.csv', header = True, index = False)

### Evaluation Functions

This section defines utility functions used to compute model-based similarity scores and evaluate embedding quality against the UMNSRS biomedical dataset.

- **`get_bpe_embedding()`**  
  Computes the embedding for a word using BPE subword tokens. If the full word is not in the vocabulary, it averages the embeddings of its BPE tokens.

- **`get_similarity()`**  
  Calculates the cosine similarity between two terms using either word-level or BPE-based embeddings, depending on the `bpe` flag.

- **`evaluate()`**  
  Iterates over all term pairs in the dataset, computes similarity scores using the chosen embedding model, and compares them with human-annotated similarity scores using Spearman and Pearson correlation.

These functions form the backbone of the quantitative evaluation used to compare the two embedding approaches.

In [18]:
def get_bpe_embedding(word, tokenizer, embeddings):
  """
  Retrieves the embedding for a given word using BPE subword embeddings.

  If the word exists in the embedding vocabulary, its embedding is returned directly.
  Otherwise, the word is tokenized into subwords using the BPE tokenizer, and the
  average of the available subword embeddings is returned.

  Args:
      word (str): The word to retrieve an embedding for.
      tokenizer (tokenizers.Tokenizer): A pretrained Hugging Face BPE tokenizer.
      embeddings (gensim KeyedVectors): Trained BPE word vector model.

  Returns:
      np.ndarray or None: The embedding vector for the word, or None if no tokens are found in the vocabulary.
  """
  if word in embeddings:
    return embeddings[word]
  else:
    tokens = tokenizer.encode(word).tokens
    vectors = [embeddings[token] for token in tokens if token in embeddings]
    if vectors:
      return np.mean(vectors, axis = 0)
    else:
      return None

In [19]:
def get_similarity(w1, w2, embeddings, bpe = False):
  """
  Computes the cosine similarity between two biomedical terms using either word-level
  or subword-level (BPE) embeddings.

  - If `bpe=True`, each word is converted to its BPE-based embedding using the average
    of its subword vectors (via `get_bpe_embedding`).
  - If `bpe=False`, the words must exist in the embedding vocabulary directly.

  Args:
      w1 (str): First word or term.
      w2 (str): Second word or term.
      embeddings (gensim KeyedVectors): Word or BPE embedding model.
      bpe (bool, optional): Whether to use BPE embeddings. Defaults to False.

  Returns:
      float or None: Cosine similarity between the two terms, or None if one or both embeddings are unavailable.
  """
  if bpe:
    w1 = get_bpe_embedding(w1, bpe_tokenizer, embeddings)
    w2 = get_bpe_embedding(w2, bpe_tokenizer, embeddings)
  else:
    if w1 in embeddings and w2 in embeddings:
      w1 = embeddings[w1]
      w2 = embeddings[w2]
    else:
      return None
  return cosine_similarity([w1], [w2])[0][0]

In [20]:
def evaluate(data, embeddings, bpe = False):
  """
  Evaluates a word embedding model by computing similarity correlations against
  human-annotated biomedical term pairs.

  - For each term pair in the dataset, the model-based cosine similarity is computed.
  - If `bpe=True`, subword-level embeddings are used via `get_bpe_embedding`.
  - Only pairs where both terms have valid embeddings are included.

  Evaluation metrics:
  - **Spearman correlation**: Measures rank agreement with human scores
  - **Pearson correlation**: Measures linear correlation with human scores
  - **Number Evaluated**: Number of pairs included in the evaluation

  Args:
      data (pd.DataFrame): Evaluation dataset with columns `text_1`, `text_2`, and `mean_score`.
      embeddings (gensim KeyedVectors): Word or subword embedding model.
      bpe (bool, optional): Whether to use BPE-based subword embeddings. Defaults to False.

  Returns:
      inds (List[int]): Indices of rows that were successfully evaluated.
      pd.Series: Spearman correlation, Pearson correlation, and number of evaluated pairs.
  """
  similarities = []
  human_scores = []
  inds = []
  for i, row in data.iterrows():
    sim = get_similarity(row['text_1'].lower(), row['text_2'].lower(), embeddings, bpe)
    if sim is not None:
      inds.append(i)
      similarities.append(sim)
      human_scores.append(row['mean_score'])

  spearman_score = spearmanr(similarities, human_scores)[0]
  pearson_score = pearsonr(similarities, human_scores)[0]
  number_evaluated = len(similarities)

  print('Spearman:', spearman_score)
  print('Pearson:', pearson_score)
  print('Number Evaluated:', number_evaluated)
  return inds, pd.Series([spearman_score, pearson_score, number_evaluated])

## Quantitative Evaluation on UMNSRS Dataset

We evaluate both the word-level Skip-Gram model and the subword-level BPE model on the UMNSRS biomedical similarity dataset.

- **Skip-Gram**: Only evaluates term pairs where both full words exist in the vocabulary.
- **BPE**: Can evaluate far more pairs by averaging subword embeddings when full words are missing.

### Full Dataset Evaluation

- **Skip-Gram**
  - Spearman: **0.361**
  - Pearson: **0.452**
  - Term Pairs Evaluated: 33

- **BPE**
  - Spearman: **0.090**
  - Pearson: **0.153**
  - Term Pairs Evaluated: 566

While BPE covers more terms, its overall alignment with human judgments is weaker across the full dataset.

In [21]:
skip_gram_inds, skip_gram_results = evaluate(data, skip_gram_embeddings, bpe = False)

Spearman: 0.36115985753236607
Pearson: 0.4515940312593645
Number Evaluated: 33


In [22]:
bpe_inds, bpe_results = evaluate(data, bpe_embeddings, bpe = True)

Spearman: 0.0899903339261131
Pearson: 0.15292074875644843
Number Evaluated: 566


### ⚖️ Filtered Evaluation (Skip-Gram-Compatible Pairs Only)

To isolate model quality from vocabulary coverage, we compare both models on the **exact 27 term pairs** that the Skip-Gram model was able to process.

- **Skip-Gram**
  - Spearman: **0.361**
  - Pearson: **0.452**

- **BPE**
  - Spearman: **0.333**
  - Pearson: **0.451**

On this shared subset, BPE performs comparably to Skip-Gram, suggesting strong potential when vocabulary coverage is not a limiting factor.

In [23]:
filtered_data = data.iloc[skip_gram_inds, :]

In [24]:
_, skip_gram_results_filtered = evaluate(filtered_data, skip_gram_embeddings, bpe = False)

Spearman: 0.36115985753236607
Pearson: 0.4515940312593645
Number Evaluated: 33


In [25]:
_, bpe_results_filtered = evaluate(filtered_data, bpe_embeddings, bpe = True)

Spearman: 0.3327483925714673
Pearson: 0.4509520721906824
Number Evaluated: 33


In [None]:
results_df = pd.DataFrame([
    ['Skip-Gram (Full Eval)', skip_gram_results[0], skip_gram_results[1], skip_gram_results[2]],
    ['BPE (Full Eval)', bpe_results[0], bpe_results[1], bpe_results[2]],
    ['Skip-Gram (Filtered)', skip_gram_results_filtered[0], skip_gram_results_filtered[1], skip_gram_results_filtered[2]],
    ['BPE (Filtered)', bpe_results_filtered[0], bpe_results_filtered[1], bpe_results_filtered[2]],
], columns=['Model', 'Spearman', 'Pearson', 'Num Evaluated'])

results_df.to_csv('./results/evaluation_summary.csv', index = False, header = True)

## Takeaways

- The **Skip-Gram model** achieved higher correlation with human similarity judgments but was limited to a small subset of term pairs due to vocabulary constraints.
- The **BPE model** handled many more term pairs, demonstrating better coverage, and performed competitively on the shared subset.
- Subword-based embeddings offer a **scalable and flexible alternative** for biomedical text, particularly when dealing with rare or morphologically complex terms.