## Compositionality Experiment with Bert
- In compositionality tasks, we want to see how well the model understands the structure of the input sentence, and the underlying semnantic relationships of the words

### Encoder Models
Encoder models are one type of language model backbones where the model sees the entire input sentence, and is trained to predict missing words. Its main application is in tasks such as discriminative (sentiment analysis, classifications), etc.

It's believed that encoders have much stronger understanding of the input sentences because of their bidirectional attention (the self-attention matrix is fully visible). Because of this assumption, under encoder models, **should we expect a better compositional performance? And what does it mean to have a good compositional performance? **

- the evaluation methods will be based off the embedding space of the chosen language models. Each language model is pretrained with the bidirectional or adjacent/modified attention mechanisms.
- each original sentence pair to test compositionality is taken from Winoground. Each sentence pair is consisted of two sentences with differnet meanings but very similar structure. Ex: "dog bites man" and "man bites dog". The first sentence will be referred to as the positive text, and the second sentence as the negative text. Though it's just for naming conventions without additional connotations.  
- Then similar 2-3 sentences with similar structure and semantic meaning to the positive text are written/generated.
- We evaluate the embedding of the original sentence pair, and the differences between them and the positive, the negative text.


### Hypothesis
- Given the encoder's strong ability to understand the text, we should be seeing closer embedding of the positive sentence with its 3 similar sentences. And further away from the negative sentence from the positive/similar sentences.
- By the nature & commonality of our selected sentences from Winoground, the model should be getting this correct almost 100% of the time. Because these data are most likely trained.


### Interesting Questions
- Does the model get more pairs right with simpler texts? And more wrong with more complicated texts?
- Does the embedding space give us correct representation of the model performance? Is it possible that at first the model has a better representation but as the data goes through the transformer layers, the information is lost, or the model got confused?
- how much does "semantic smearing" happen? Where the model defaults to the statistically most likely scenario/output?



#### Testing datasets

In [1]:
import numpy as np
from collections import defaultdict
from transformers import AutoModel, AutoTokenizer
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn
import torch.nn.functional as F

In [2]:
# the positive and negative text pair is created as a double array sharing the same indices
# similar sentences are stored as dictionary where the positive sentence is the key
positive_text= list()
negative_text= list()
similar_text = defaultdict(list)

In [3]:
# all data pairs go to here
all_text ={
    "an old person kisses a young person" : "a young person kisses an old person",
    "the taller person hugs the shorter person" : "the shorter person hugs the taller person",
    "the sail rests below the water" : "the water rests below the sail",
    "there is a mug in some grass" : "there is some grass in a mug",
}

In [4]:
# construct pairs
for key in (all_text):
  positive_text.append(key)
  negative_text.append(all_text[key])

In [5]:
# construct similar pairs
similar_text["an old person kisses a young person"]= [
    "a senior person kisses a junior student",
    "An old man kisses a young girl"]

similar_text["the taller person hugs the shorter person"]= [
    "A taller man hug the shorter girl",
    "A tall person hugs the smaller one"
]

similar_text["the sail rests below the water"] = [
    "The sail lies below the water",
    "The sail sits below the water",
    #"The sail sits beneath the water"
]

similar_text["there is a mug in some grass"]= [
    "There is a mug lying in grass",
    "The mug rests inside a bush grass",
    #"A mug lies among blades of grass"
]

#### Model configurations

In [None]:
# deberta model, SOTA encoder model with bidrectional masked attention
deberta_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
deberta_model = AutoModel.from_pretrained("microsoft/deberta-v3-base", dtype="auto")


# robust bert
roberta_tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
roberta_model = AutoModel.from_pretrained(
    "FacebookAI/roberta-base",
    dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)


# regular bert model that handles different cases
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")



#### Similarity and helper functions for experiment

In [11]:
# retrieve embedding values of each model variant
def compute_embedding(input_sentence, tokenizer, model):
  '''
  return: Batch x tokenize amount x hidden_dim

  @param input_sentence: [0] is the positive sentence, [len-1] is the negative sentence. Everything in between is the similar pairs

  take the mean to get single vector
  '''

  try:
    sentence = tokenizer(input_sentence, return_tensors='pt')
  except:
    raise ValueError("Failed to tokenize. Check for input dimension mismatch")

  word_embeddings_layer = model.embeddings.word_embeddings
  with torch.no_grad():
    output = model(**sentence, output_hidden_states=True)

  hidden_states= output.hidden_states
  final_states = hidden_states[-2]

  mean_final_states =torch.mean(final_states, dim=1)
  raw_embeddings = word_embeddings_layer(sentence['input_ids'])
  raw_embeddings = torch.mean(raw_embeddings, dim=1)
  embeddings = hidden_states[0]
  embeddings= torch.mean(embeddings, dim=1)

  if ((mean_final_states.shape== raw_embeddings.shape==embeddings.shape)==False):
      print(f"Dimension mismatch: Final_embedding: {mean_final_states.shape} | Raw embedding: {raw_embeddings.shape} | embedding: {embeddings.shape}")
      #raise ValueError("Final output shape dimension mismatch")
      return None, None, None
  return mean_final_states, raw_embeddings, embeddings

In [8]:
# compute similarity
def similarity (p_text, similar_pairs, n_text):
  '''
  @param p_text: the positive text, single string
  @param similar_pairs: all similar texts as a list
  @param n_text: the negative text, single string
  '''
  p_text = p_text.unsqueeze(0)
  n_text = n_text.unsqueeze(0)

  sim_pairs = F.cosine_similarity(p_text, similar_pairs)
  sim_negative = F.cosine_similarity(p_text, n_text)
  return sim_pairs, sim_negative

#### Running the experiment

In [12]:
def run_experiment(tokenizer, model):
  '''
  negative difference means the positive pair is closer to the negative pair
  positive means the positive pair is closer to the similar sentences

  '''
  for i in range(0,len(positive_text)):
    # construct all sentences pair
    p_text, n_text = positive_text[i], negative_text[i]
    sentences = [p_text] + similar_text[p_text] + [n_text]
    length = len(sentences)
    # compute embeddings
    final_embedding, raw_embedding, embedding = compute_embedding(sentences,tokenizer, model)
    if final_embedding ==None or raw_embedding==None or embedding==None:
      continue

    # compute similarity
    print(f"{p_text}")
    print(f"Final Embeddings:")
    sim_pairs, sim_negative = similarity(final_embedding[0], final_embedding[1:length-1], final_embedding[length-1])
    for i, s in enumerate(similar_text[p_text]):
      print(f"  {s}: {sim_pairs[i].item()}")
    print(f"  Negative: {sim_negative.item()}")
    diff= torch.mean(sim_pairs, dim=0) - sim_negative
    print(f"Mean diff: {diff.item()}")

    print(f"Raw Embeddings (before positional encoding):")
    sim_pairs, sim_negative = similarity(raw_embedding[0], raw_embedding[1:length-1], raw_embedding[length-1])
    for i, s in enumerate(similar_text[p_text]):
      print(f"  {s}: {sim_pairs[i].item()}")
    print(f"  Negative: {sim_negative.item()}")
    diff= torch.mean(sim_pairs, dim=0) - sim_negative
    print(f"Mean diff: {diff.item()}")

    print(f"Embeddings (after positional encoding):")
    sim_pairs, sim_negative = similarity(embedding[0], embedding[1:length-1], embedding[length-1])
    for i, s in enumerate(similar_text[p_text]):
      print(f"  {s}: {sim_pairs[i].item()}")
    print(f"  Negative: {sim_negative.item()}")
    diff= torch.mean(sim_pairs, dim=0) - sim_negative
    print(f"Mean diff: {diff.item()}")

    print("\n")

In [13]:
print("BERT EXPERIMENTS\n")
run_experiment(bert_tokenizer, bert_model)

BERT EXPERIMENTS

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.9150984883308411
  An old man kisses a young girl: 0.874237060546875
  Negative: 0.9952633380889893
  Mean diff: -0.10059559345245361
Raw Embeddings (before positional encoding):
  a senior person kisses a junior student: 0.880440354347229
  An old man kisses a young girl: 0.9306422472000122
  Negative: 1.0000001192092896
  Mean diff: -0.09445881843566895
Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.7819817662239075
  An old man kisses a young girl: 0.8738178610801697
  Negative: 0.9999940395355225
  Mean diff: -0.1720942258834839


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.8669983148574829
  A tall person hugs the smaller one: 0.9315971732139587
  Negative: 0.9984343647956848
  Mean diff: -0.09913665056228638
Raw Embeddings (before positional encoding):
  A taller man hug th

In [14]:
print("ROBERTA EXPERIMENTS\n")
run_experiment(roberta_tokenizer,roberta_model)

ROBERTA EXPERIMENTS

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.982421875
  An old man kisses a young girl: 0.986328125
  Negative: 0.99853515625
  Mean diff: -0.01416015625
Raw Embeddings (before positional encoding):
  a senior person kisses a junior student: 0.685546875
  An old man kisses a young girl: 0.802734375
  Negative: 0.94287109375
  Mean diff: -0.19873046875
Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.8427734375
  An old man kisses a young girl: 0.9013671875
  Negative: 0.97265625
  Mean diff: -0.1005859375


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.98193359375
  A tall person hugs the smaller one: 0.99072265625
  Negative: 0.99951171875
  Mean diff: -0.01318359375
Raw Embeddings (before positional encoding):
  A taller man hug the shorter girl: 0.77880859375
  A tall person hugs the smaller one: 0.82177734375
  Negative

In [15]:
print("DEBERTA EXPERIMENTS\n")
run_experiment(deberta_tokenizer,deberta_model)

DEBERTA EXPERIMENTS

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.74609375
  An old man kisses a young girl: 0.82275390625
  Negative: 0.9638671875
  Mean diff: -0.1796875
Raw Embeddings (before positional encoding):
  a senior person kisses a junior student: 0.935546875
  An old man kisses a young girl: 0.95703125
  Negative: 1.0009765625
  Mean diff: -0.0546875
Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.6396484375
  An old man kisses a young girl: 0.7568359375
  Negative: 0.99951171875
  Mean diff: -0.30126953125


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.54150390625
  A tall person hugs the smaller one: 0.44189453125
  Negative: 0.85888671875
  Mean diff: -0.3671875
Raw Embeddings (before positional encoding):
  A taller man hug the shorter girl: 0.939453125
  A tall person hugs the smaller one: 0.9482421875
  Negative: 1.0
  Mean 

#### Analysis
- in general, the embedding similarity with negative sentence is higher than the similarity with the similar sentences, though the values are generally close.
- future models like Roberta, Deberta did not show a significant performance over BERT
- in all models, only Deberta's final embedding for the "there is a mug in some grass" sentence showed closer similarity with similar sentences. Every single other sentence pair, the positive pair is more similar with the negative pair.
- but the loss value is so small that it doesn't make a significant difference in compositionality
- positional encodings does significantly improve the model's ability to understand similarity. This could be attributed to the benefits from learning structure of the sentence.
- Transformer processing layers also improve similarity. So it's hard to make the argument that in the middle layer, information/semantics is lost.

### Why do we not study decoder models?
