## Language Model Representations of Ambiguous (Spanish) Nouns in Context

Here, we load a monolingual Spanish-trained large language model (LLM) called [BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased), and we use it to compute vector representations for target ambiguous Spanish nouns. To do so, we also load a dataframe of sentence pairs in Spanish, where each pair contains a target ambiguous noun whose sense is disambiguated by either an adjective or a verb (termed context cue). This context cue marks the only difference across a given pair of sentences. Context cues have been chosen such that sometimes the sentence pair evokes the same sense for the target word, **or** evokes different (homonymous or polysemous) senses for the target word. 

We run each (tokenized version of each) sentence through BETO, and extract the vector representation, or embedding, for the target noun from each of BETO's layers. We then compute and store the cosine distances between the target word embeddings from the first and second sentences of the pair. 

In [2]:
%reset
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!

import functools
import itertools
import os
import torch
import transformers

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from scipy.spatial.distance import cosine
from tqdm.notebook import tqdm
from transformers import AutoTokenizer


sns.set(style='whitegrid',font_scale=1.2)

Once deleted, variables cannot be recovered. Proceed (y/[n])? 
Nothing done.


### define useful custom functions

In [3]:
### Define useful custom functions to ...

### ... find the target tokens within tokenized sequence
def find_sublist_index(mylist, sublist):
    """Find the first occurence of sublist in list.
    Return the start and end indices of sublist in list"""

    for i in range(len(mylist)):
        if mylist[i] == sublist[0] and mylist[i:i+len(sublist)] == sublist:
            return i, i+len(sublist)
    return None

@functools.lru_cache(maxsize=None)  # This will cache results, handy later...


### ... grab the embeddings for your target tokens
def get_embedding(model, tokenizer, sentence, target, layer, device):
    """Get a token embedding for target in sentence"""
    
    # Tokenize sentence
    inputs = tokenizer(sentence, return_tensors="pt").to(device)
    
    # Tokenize target
    target_enc = tokenizer.encode(target, return_tensors="pt",
                                  add_special_tokens=False).to(device)
    
    # Get indices of target in input tokens
    target_inds = find_sublist_index(
        inputs["input_ids"][0].tolist(),
        target_enc[0].tolist()
    )

    # Run model
    with torch.no_grad():
        output = model(**inputs)
        hidden_states = output.hidden_states

    # Get layer
    selected_layer = hidden_states[layer][0]

    #grab just the embeddings for your target word's token(s)
    token_embeddings = selected_layer[target_inds[0]:target_inds[1]]

    #if a word is represented by >1 tokens, take mean
    #across the multiple tokens' embeddings
    embedding = torch.mean(token_embeddings, dim=0)
    
    return embedding

### load your model and tokenizer

In [None]:
### Models
MODELS = ["dccuchile/bert-base-spanish-wwm-cased",
         "bert-base-multilingual-cased",
         "xlm-roberta-base"]

In [20]:
### Define the url path to BETO

mpath = "dccuchile/bert-base-spanish-wwm-cased" 
## Use "bert-base-multilingual-cased" for multlingual BERT comparison


### Decide which device you want the models to run in
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

### Load your model & tokenizer

model = transformers.AutoModelForCausalLM.from_pretrained(mpath,
                                                          output_hidden_states=True)
model.to(device) # allocate model to desired device

tokenizer = transformers.AutoTokenizer.from_pretrained(mpath)  



If you want to use `BertLMHeadModel` as a standalone, add `is_decoder=True.`


### load the dataframe of sentence pairs

In [8]:
df = pd.read_csv("../data/raw/items/sawc_sentence_pairs.csv")

In [9]:
df.shape[0] # number of sentence pairs

812

### compute cosine distances

for each target word within a pair of sentences, for each model layer

In [None]:
### Get the number of layers directly from the model specifications

if mpath == "xlm-roberta-base":
    n_layers = len(model.base_model.encoder.layer)
else:
    n_layers = len(model.bert.encoder.layer)

In [14]:
results = []

for layer in range(n_layers+1): # `range` is non-inclusive for the last value of interval
    for (ix, row) in tqdm(df.iterrows(), total=df.shape[0]):

        ### Get embeddings for S1 and S2
        s1 = get_embedding(model, tokenizer, row['Sentence_1'], row['Word'],layer, device)
        s2 = get_embedding(model, tokenizer, row['Sentence_2'], row['Word'],layer, device)

        ### Now calculate cosine distance 
        #.  note, tensors need to be copied to cpu to make this run
        #.  still faster to do this copy than to just have everything
        #.  running on the cpu
        if device.type == "mps":  
            model_cosine = cosine(s1.cpu(), s2.cpu())
            
        else: 
            model_cosine = cosine(s1, s2)


        ### Figure out how many tokens you're
        ### comparing across sentences
        n_tokens_s1 = len(tokenizer.encode(row['Sentence_1']))
        n_tokens_s2 = len(tokenizer.encode(row['Sentence_1']))

        ### Add to results dictionary
        results.append({
            'Sentence_1': row['Sentence_1'],
            'Sentence_2': row['Sentence_2'],
            'Word': row['Word'],
            'Same_sense': row['Same_sense'],
            'Distance': model_cosine,
            'Layer': layer,
            'S1_ntokens': n_tokens_s1,
            'S2_ntokens': n_tokens_s2
        })
        
df_results = pd.DataFrame(results)
df_results['token_diffs'] = np.abs(df_results['S1_ntokens'].values-df_results['S2_ntokens'].values)


  0%|          | 0/812 [00:00<?, ?it/s]

  incremental_indices = (torch.cumsum(mask, dim=1).type_as(mask) + past_key_values_length) * mask


  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

### hurray! save your cosine distances to load in the next notebook

In [15]:
### Save your cosine distance results 

savepath = "../data/processed/models/"
if not os.path.exists(savepath): 
    os.mkdir(savepath)
    

### Replace "beto" with name of model
df_results.to_csv(os.path.join(savepath,"xlm_sawc_distances.csv"), index=False)
# df_results.to_csv(os.path.join(savepath,"beto_sawc_distances.csv"), index=False)