## Language Model Representations of Ambiguous (Spanish) Nouns in Context

Here, we load monolingual Spanish-trained, and multilingual large language models (LLMs) from the [BERT/BETO](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) family, and we use them to compute vector representations for target ambiguous Spanish nouns. To do so, we also load a dataframe of sentence pairs in Spanish, where each pair contains a target ambiguous noun whose sense is disambiguated by either an adjective or a verb (termed context cue). This context cue marks the only difference across a given pair of sentences. Context cues have been chosen such that sometimes the sentence pair evokes the same sense for the target word, **or** evokes different (homonymous or polysemous) senses for the target word. 

We run each (tokenized version of each) sentence through BETO and its variants, and extract the vector representation, or embedding, for the target noun from each of BETO's layers. We then compute and store the cosine distances between the target word embeddings from the first and second sentences of the pair. 

Here is a list of models we examined: 


### <font color='hotpink'>TODO: add list of models and their corresponding paper citations</font> 

* **BETO-cased** .... Cañete et al. (2020) *ICLR* [hugging-face](https://huggingface.co/dccuchile/bert-base-spanish-wwm-cased) - [paper](https://arxiv.org/pdf/2308.02976.pdf)
* **BETO-uncased** .... Cañete et al. (2020) *ICLR* [hugging-face](https://huggingface.co/dccuchile/bert-base-spanish-wwm-uncased) - [paper](https://arxiv.org/pdf/2308.02976.pdf)
* **BERT-base-multilingual-cased"** .... Devlin et al. (2019) *arXiv* [hugging-face](https://huggingface.co/google-bert/bert-base-multilingual-cased) -  [training-details](https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages) - [paper](https://arxiv.org/pdf/1810.04805.pdf)
* **ROBERTa-BNE-base (MaRIa)** .... Gutiérrez-Fandiño et al. (2022) *Procesamiento del Lenguaje Natural* [hugging-face](https://huggingface.co/PlanTL-GOB-ES/roberta-base-bne) - [paper](https://arxiv.org/pdf/2107.07253.pdf)
* **ROBERTa-BNE-large (MaRIa)** .... Gutiérrez-Fandiño et al. (2022) *Procesamiento del Lenguaje Natural* - [hugging-face](https://huggingface.co/PlanTL-GOB-ES/roberta-large-bne) - [training-details](https://github.com/PlanTL-GOB-ES/lm-spanish?tab=readme-ov-file) - [paper](https://arxiv.org/pdf/2107.07253.pdf)
* **XLM-ROBERTa-base** .... Conneau et al. (2020) *ACL* - [hugging-face](https://huggingface.co/FacebookAI/xlm-roberta-base) - [training-details](https://github.com/facebookresearch/fairseq/tree/main/examples/xlmr) - [paper](https://arxiv.org/pdf/1911.02116.pdf)
* **ALBETO-tiny** .... Cañete et al. (2023) *arXiv* - [hugging-face](https://huggingface.co/dccuchile/albert-tiny-spanish) - [paper](https://arxiv.org/pdf/2204.09145.pdf)
* **ALBETO-base** .... Cañete et al. (2023) *arXiv* - [hugging-face](https://huggingface.co/dccuchile/albert-base-spanish) - [paper](https://arxiv.org/pdf/2204.09145.pdf)
* **ALBETO-large** .... Cañete et al. (2023) *arXiv* - [hugging-face](https://huggingface.co/dccuchile/albert-large-spanish) - [paper](https://arxiv.org/pdf/2204.09145.pdf)
* **ALBETO-xlarge** .... Cañete et al. (2023) *arXiv* - [hugging-face](https://huggingface.co/dccuchile/albert-xlarge-spanish) - [paper](https://arxiv.org/pdf/2204.09145.pdf)
* **ALBETO-xxlarge** .... Cañete et al. (2023) *arXiv* -  [hugging-face](https://huggingface.co/dccuchile/albert-xxlarge-spanish) - [paper](https://arxiv.org/pdf/2204.09145.pdf)
* **DistilBETO-uncased** .... Cañete et al. (2023) *arXiv* -  [hugging-face](https://huggingface.co/dccuchile/distilbert-base-spanish-uncased/tree/main) - [paper](https://arxiv.org/pdf/2204.09145.pdf)


In [1]:
%reset
%matplotlib inline
%config InlineBackend.figure_format = 'retina'  # makes figs nicer!

import functools
import itertools
import os
import torch
import transformers

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns


from scipy.spatial.distance import cosine
from tqdm.notebook import tqdm
from transformers import AutoTokenizer


sns.set(style='whitegrid',font_scale=1.2)

Once deleted, variables cannot be recovered. Proceed (y/[n])? y


### define useful custom functions

In [15]:
### Define useful custom functions to ...

### ... find the target tokens within tokenized sequence
def find_sublist_index(mylist, sublist):
    """Find the first occurence of sublist in list.
    Return the start and end indices of sublist in list"""

    for i in range(len(mylist)):
        if mylist[i] == sublist[0] and mylist[i:i+len(sublist)] == sublist:
            return i, i+len(sublist)
    return None

@functools.lru_cache(maxsize=None)  # This will cache results, handy later...


### ... grab the embeddings for your target tokens
def get_embedding(model, tokenizer, sentence, target, layer, device):
    """Get a token embedding for target in sentence"""
    
    # Tokenize sentence
    inputs = tokenizer(sentence, return_tensors="pt").to(device)
    
    # Tokenize target
    target_enc = tokenizer.encode(target, return_tensors="pt",
                                  add_special_tokens=False).to(device)
    
    # Get indices of target in input tokens
    target_inds = find_sublist_index(
        inputs["input_ids"][0].tolist(),
        target_enc[0].tolist()
    )

    # Run model
    with torch.no_grad():
        output = model(**inputs)
        hidden_states = output.hidden_states

    # Get layer
    selected_layer = hidden_states[layer][0]

    #grab just the embeddings for your target word's token(s)
    token_embeddings = selected_layer[target_inds[0]:target_inds[1]]

    #if a word is represented by >1 tokens, take mean
    #across the multiple tokens' embeddings
    embedding = torch.mean(token_embeddings, dim=0)
    
    return embedding

### ... grab the number of trainable parameters in the model

def count_parameters(model):
    """credit: https://stackoverflow.com/questions/49201236/check-the-total-number-of-parameters-in-a-pytorch-model"""
    
    total_params = 0
    for name, parameter in model.named_parameters():
        
        # if the param is not trainable, skip it
        if not parameter.requires_grad:
            continue
        
        # otherwise, count it towards your number of params
        params = parameter.numel()
        total_params += params
    print(f"Total Trainable Params: {total_params}")
    
    return total_params
    

### load the dataframe of sentence pairs

In [23]:
stimpath = "../data/raw/items/"
df = pd.read_csv(os.path.join(stimpath,"sawc_sentence_pairs.csv"))

df.shape[0] # number of sentence pairs

812

### load your models and tokenizers

In [41]:
### Define the url paths to download your desired models
#.  from Hugging Face

MODELS = ["dccuchile/bert-base-spanish-wwm-cased",
          "google-bert/bert-base-multilingual-cased",
          "FacebookAI/xlm-roberta-base",
          "dccuchile/albert-tiny-spanish",
          "dccuchile/albert-base-spanish",
          "dccuchile/albert-large-spanish",
          "dccuchile/albert-xlarge-spanish",
          "dccuchile/albert-xxlarge-spanish",
          "PlanTL-GOB-ES/roberta-base-bne",
          "PlanTL-GOB-ES/roberta-large-bne",
          "dccuchile/bert-base-spanish-wwm-uncased", 
          "dccuchile/distilbert-base-spanish-uncased"]



### compute cosine distances

for each target word within a pair of sentences, for each model layer, for each model specified in the `MODELS` list

In [None]:
### Iterate over models and do the work! 

for mpath in tqdm(MODELS[3:],colour="cornflowerblue"):

    ### Decide which device you want the models to run in
    
    device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")

    ### Load your model & tokenizer
    
    model = transformers.AutoModel.from_pretrained(mpath,output_hidden_states=True)
    model.to(device) # allocate model to desired device

    tokenizer = transformers.AutoTokenizer.from_pretrained(mpath)  
    
    
    ### Get the number of layers & params directly from the model specifications
    
    # TODO: Double-check for all configurations
    
    n_layers = model.config.num_hidden_layers
    print("number of layers:", n_layers)

    n_params = count_parameters(model)

    results = []

    for layer in range(n_layers+1): # `range` is non-inclusive for the last value of interval
        for (ix, row) in tqdm(df.iterrows(), total=df.shape[0]):

            ### Get embeddings for S1 and S2

            # note: account for tokenization differences in RoBERTa Spanish monolinguals  by
            #.      adding a whitespace in front of the target word (otherwise, the function
            #.      `find_sublist_index` will not be able to identify the target token-s within
            #.      the tokenized sentence)
            
            if mpath in ["PlanTL-GOB-ES/roberta-base-bne", "PlanTL-GOB-ES/roberta-large-bne"]:
                target = " {w}".format(w = row['Word'])
            else:
                target = row['Word']

            s1 = get_embedding(model, tokenizer, row['Sentence_1'], target,layer, device)
            s2 = get_embedding(model, tokenizer, row['Sentence_2'], target,layer, device)

            ### Now calculate cosine distance 
            #.  note, tensors need to be copied to cpu to make this run;
            #.  still faster to do this copy than to just have everything
            #.  running on the cpu
            if device.type == "mps":  
                model_cosine = cosine(s1.cpu(), s2.cpu())

            else: 
                model_cosine = cosine(s1, s2)


            ### Figure out how many tokens you're
            ### comparing across sentences
            n_tokens_s1 = len(tokenizer.encode(row['Sentence_1']))
            n_tokens_s2 = len(tokenizer.encode(row['Sentence_1']))

            ### Add to results dictionary
            results.append({
                'Sentence_1': row['Sentence_1'],
                'Sentence_2': row['Sentence_2'],
                'Word': row['Word'],
                'Same_sense': row['Same_sense'],
                'Distance': model_cosine,
                'Layer': layer,
                'S1_ntokens': n_tokens_s1,
                'S2_ntokens': n_tokens_s2
            })

    df_results = pd.DataFrame(results)
    df_results['token_diffs'] = np.abs(df_results['S1_ntokens'].values-df_results['S2_ntokens'].values)
    df_results['n_params'] = np.repeat(n_params,df_results.shape[0])
    
    
    ### Hurray! Save your cosine distance results to load into R
    #.  for analysis

    savepath = "../data/processed/models/"
    if not os.path.exists(savepath): 
        os.mkdir(savepath)

    filename = "sawc-distances_model-" + mpath.split("/")[1] + ".csv"

    df_results.to_csv(os.path.join(savepath,filename), index=False)




  0%|          | 0/9 [00:00<?, ?it/s]

Downloading config.json:   0%|          | 0.00/828 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/21.7M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

number of layers: 4
Total Trainable Params: 5344136


  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

Downloading config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/47.8M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

number of layers: 12
Total Trainable Params: 11811584


  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

Downloading config.json:   0%|          | 0.00/831 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/71.9M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

number of layers: 24
Total Trainable Params: 17811968


  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

Downloading config.json:   0%|          | 0.00/832 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/237M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

number of layers: 24
Total Trainable Params: 58852864


  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

Downloading config.json:   0%|          | 0.00/858 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/893M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/465 [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading tokenizer.json:   0%|          | 0.00/1.74M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

number of layers: 12
Total Trainable Params: 222723584


  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

  0%|          | 0/812 [00:00<?, ?it/s]

In [19]:
# """
# if mpath == "xlm-roberta-base":
#     n_layers = len(model.base_model.encoder.layer)
# elif mpath =="dccuchile/distilbert-base-spanish-uncased":
#     n_layers = len(model.base_model.transformer.layer)
# elif mpath == "dccuchile/albert-tiny-spanish":
#     n_layers = model.config.num_hidden_layers
# else:
#     n_layers = len(model.bert.encoder.layer)
# """

'\nif mpath == "xlm-roberta-base":\n    n_layers = len(model.base_model.encoder.layer)\nelif mpath =="dccuchile/distilbert-base-spanish-uncased":\n    n_layers = len(model.base_model.transformer.layer)\nelif mpath == "dccuchile/albert-tiny-spanish":\n    n_layers = model.config.num_hidden_layers\nelse:\n    n_layers = len(model.bert.encoder.layer)\n'