## Compositionality Experiment with Bert
- In compositionality tasks, we want to see how well the model understands the structure of the input sentence, and the underlying semnantic relationships of the words

### Encoder Models
Encoder models are one type of language model backbones where the model sees the entire input sentence, and is trained to predict missing words. Its main application is in tasks such as discriminative (sentiment analysis, classifications), etc.

It's believed that encoders have much stronger understanding of the input sentences because of their bidirectional attention (the self-attention matrix is fully visible). Because of this assumption, under encoder models, **should we expect a better compositional performance? And what does it mean to have a good compositional performance? **

- the evaluation methods will be based off the embedding space of the chosen language models. Each language model is pretrained with the bidirectional or adjacent/modified attention mechanisms.
- each original sentence pair to test compositionality is taken from Winoground. Each sentence pair is consisted of two sentences with differnet meanings but very similar structure. Ex: "dog bites man" and "man bites dog". The first sentence will be referred to as the positive text, and the second sentence as the negative text. Though it's just for naming conventions without additional connotations.  
- Then similar 2-3 sentences with similar structure and semantic meaning to the positive text are written/generated.
- We evaluate the embedding of the original sentence pair, and the differences between them and the positive, the negative text.


### Hypothesis
- Given the encoder's strong ability to understand the text, we should be seeing closer embedding of the positive sentence with its 3 similar sentences. And further away from the negative sentence from the positive/similar sentences.
- By the nature & commonality of our selected sentences from Winoground, the model should be getting this correct almost 100% of the time. Because these data are most likely trained.


### Interesting Questions
- Does the model get more pairs right with simpler texts? And more wrong with more complicated texts?
- Does the embedding space give us correct representation of the model performance? Is it possible that at first the model has a better representation but as the data goes through the transformer layers, the information is lost, or the model got confused?
- how much does "semantic smearing" happen? Where the model defaults to the statistically most likely scenario/output?



#### Testing datasets

In [None]:
import numpy as np
from collections import defaultdict
from transformers import AutoModel, AutoTokenizer
from transformers import BertTokenizer, BertModel
import torch
import torch.nn as nn
import torch.nn.functional as F

In [None]:
# the positive and negative text pair is created as a double array sharing the same indices
# similar sentences are stored as dictionary where the positive sentence is the key
positive_text= list()
negative_text= list()
similar_text = defaultdict(list)

In [None]:
# all data pairs go to here
all_text ={
    "an old person kisses a young person" : "a young person kisses an old person",
    "the taller person hugs the shorter person" : "the shorter person hugs the taller person",
    "the sail rests below the water" : "the water rests below the sail",
    "there is a mug in some grass" : "there is some grass in a mug",
}

In [None]:
# construct pairs
for key in (all_text):
  positive_text.append(key)
  negative_text.append(all_text[key])

In [None]:
# construct similar pairs
similar_text["an old person kisses a young person"]= [
    "a senior person kisses a junior student",
    "An old man kisses a young girl",
    "An aged man plants a kiss on a young boy"]

similar_text["the taller person hugs the shorter person"]= [
    "A taller man hug the shorter girl",
    "A tall person hugs the smaller one",
    "One who is taller hugs their shorter companion"
]

similar_text["the sail rests below the water"] = [
    "The sail lies below the water",
    "The sail sits below the water",
    "The sail sits beneath the water"
]

similar_text["there is a mug in some grass"]= [
    "There is a mug lying in grass",
    "The mug rests inside a bush grass",
    "A mug lies among blades of grass"
]

#### Model configurations
- all models here are embedding models, models that output embedding layers. They do not have logits, only embedding outputs

In [None]:
'''
encoder/embedding models, no MLP linear layer to output head
'''

# deberta model, SOTA encoder model with bidrectional masked attention
deberta_tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
deberta_model = AutoModel.from_pretrained("microsoft/deberta-v3-base", dtype="auto")


# robust bert
roberta_tokenizer = AutoTokenizer.from_pretrained("FacebookAI/roberta-base")
roberta_model = AutoModel.from_pretrained(
    "FacebookAI/roberta-base",
    dtype=torch.float16,
    device_map="auto",
    attn_implementation="sdpa"
)


# regular bert model that handles different cases
bert_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")



config.json:   0%|          | 0.00/579 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/52.0 [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/2.46M [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/371M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/198 [00:00<?, ?it/s]

DebertaV2Model LOAD REPORT from: microsoft/deberta-v3-base
Key                                     | Status     |  | 
----------------------------------------+------------+--+-
lm_predictions.lm_head.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.bias   | UNEXPECTED |  | 
mask_predictions.LayerNorm.weight       | UNEXPECTED |  | 
mask_predictions.dense.bias             | UNEXPECTED |  | 
lm_predictions.lm_head.LayerNorm.weight | UNEXPECTED |  | 
mask_predictions.dense.weight           | UNEXPECTED |  | 
lm_predictions.lm_head.dense.bias       | UNEXPECTED |  | 
mask_predictions.classifier.weight      | UNEXPECTED |  | 
lm_predictions.lm_head.dense.weight     | UNEXPECTED |  | 
mask_predictions.classifier.bias        | UNEXPECTED |  | 
mask_predictions.LayerNorm.bias         | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


config.json:   0%|          | 0.00/481 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/371M [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/197 [00:00<?, ?it/s]

RobertaModel LOAD REPORT from: FacebookAI/roberta-base
Key                             | Status     | 
--------------------------------+------------+-
roberta.embeddings.position_ids | UNEXPECTED | 
lm_head.layer_norm.weight       | UNEXPECTED | 
lm_head.dense.bias              | UNEXPECTED | 
lm_head.dense.weight            | UNEXPECTED | 
lm_head.layer_norm.bias         | UNEXPECTED | 
lm_head.bias                    | UNEXPECTED | 
pooler.dense.weight             | MISSING    | 
pooler.dense.bias               | MISSING    | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/199 [00:00<?, ?it/s]

BertModel LOAD REPORT from: bert-base-uncased
Key                                        | Status     |  | 
-------------------------------------------+------------+--+-
cls.seq_relationship.weight                | UNEXPECTED |  | 
cls.predictions.transform.dense.weight     | UNEXPECTED |  | 
cls.seq_relationship.bias                  | UNEXPECTED |  | 
cls.predictions.bias                       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.bias   | UNEXPECTED |  | 
cls.predictions.transform.dense.bias       | UNEXPECTED |  | 
cls.predictions.transform.LayerNorm.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [None]:
'''
CLIP model from OpenAI
Mainly use text encoder to grab textual embeddings
'''
from transformers import AutoProcessor, AutoModel, CLIPTextModel

clip_text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
clip_tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")



config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/196 [00:00<?, ?it/s]

CLIPTextModel LOAD REPORT from: openai/clip-vit-base-patch32
Key                                                            | Status     |  | 
---------------------------------------------------------------+------------+--+-
vision_model.encoder.layers.{0...11}.self_attn.v_proj.bias     | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.mlp.fc2.bias              | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.layer_norm2.weight        | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.layer_norm2.bias          | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.layer_norm1.weight        | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.self_attn.k_proj.weight   | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.layer_norm1.bias          | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.self_attn.k_proj.bias     | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.self_attn.out_proj.bias   | UNEXPECTED |  | 
vision_model.encoder.layers.{0...11}.

tokenizer_config.json:   0%|          | 0.00/592 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/605M [00:00<?, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

In [None]:
'''
Qwen3 Embedding model, created and optimized for create embedding and similarity search (another way of doing CLIP)
'''

from sentence_transformers import SentenceTransformer
qwen_model = SentenceTransformer("Qwen/Qwen3-Embedding-4B")

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/215 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

config.json:   0%|          | 0.00/727 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/398 [00:00<?, ?it/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/313 [00:00<?, ?B/s]

In [None]:
'''
flanT5 is an encoder + decoder model
we only look at the encoder output, future test on decoder output is also doable
how does the decoder training improve the encoder's ability to grab semantic information?
'''
from transformers import AutoTokenizer, T5EncoderModel
t5_base_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-base")
t5_base_model = T5EncoderModel.from_pretrained("google/flan-t5-base", device_map="auto")

t5_large_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
t5_large_model = T5EncoderModel.from_pretrained("google/flan-t5-large", device_map="auto")


t5_xl_tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-xl")
t5_xl_model = T5EncoderModel.from_pretrained("google/flan-t5-xl", device_map="auto")


Loading weights:   0%|          | 0/111 [00:00<?, ?it/s]

T5EncoderModel LOAD REPORT from: google/flan-t5-base
Key            | Status     |  | 
---------------+------------+--+-
lm_head.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Loading weights:   0%|          | 0/219 [00:00<?, ?it/s]

T5EncoderModel LOAD REPORT from: google/flan-t5-large
Key            | Status     |  | 
---------------+------------+--+-
lm_head.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Downloading (incomplete total...): 0.00B [00:00, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

Loading weights:   0%|          | 0/219 [00:00<?, ?it/s]

T5EncoderModel LOAD REPORT from: google/flan-t5-xl
Key            | Status     |  | 
---------------+------------+--+-
lm_head.weight | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


## Similarity and helper functions for experiment

#### Compute embedding function
- each token from the sentence gets turned into a n-dimensional vector

- take the mean of each of the n-dimensional vector for each token, as the mean value for that token (embedding calculated as a mean) at the 3 different stages
- ex: [ 3 5 10] as token Ids, become [ [1,2,3], [4,5,6], [7,8,9]] after embedding, take the mean across each column to get [4, 5, 6] as the mean for each sentence.
- taking mean across each token (row) is alike mean pooling that summarizes all sentence information into one summary vector
- eventual output would become [num_sentences, hidden_dim]

#### Technical Details
- All sentences are tokenized and placed into a single array for faster processing (using parallelization)
- After the embeddings are calculated, we use cosine similarity to measure the score. Cosine similarity measures the cosine similarity between two vectors.
- Padding of input tokens:
  - not all sentences will have the same length, so padding is applied during to pad input tokens with 0s.
  - attetion mask is applied automatically to transformer layers so that after Q*K^T, attention mask applies to make all the 0 token place into -inf, so that after softmax, they become 0. V matrix doesn't look at them.
  - same attention mask is applied before taking the mean
  - all padded locations are zeroed out, then take the sum across row-wise at each token, then divide by amount to get the mean

#### Which embedding to take?
- we need to take the output embeddings where these embedding values and their similarity scores show us model's true ability at understanding/embedding the input sentences
- layer right after embedding, after positional encoding, last 1 layer of transformer, logits

In [None]:
def mean_pooling(token_embeddings, attention_mask):
  '''
  token_embedding: [Batch, Seq, Hidden_dim]
  '''

  # Expand the mask to match the embedding dimensions
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()

  # Multiply embeddings by mask (zeros out padding), then sum
  sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)


  # Count only non-padded tokens
  sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)


  # Divide to get the true mean of ONLY the real words
  return sum_embeddings / sum_mask

In [None]:
# retrieve embedding values of each model variant
def compute_embedding(input_sentence, tokenizer, model, eos=False):
  '''
  return: Batch x token amount x hidden_dim

  @param input_sentence: [0] is the positive sentence, [len-1] is the negative sentence. Everything in between is the similar pairs

  '''
  device = "cuda" if torch.cuda.is_available() else "cpu"
  model = model.to(device)
  try:
    sentence = tokenizer(input_sentence,padding=True, truncation= True,return_tensors='pt').to(device)
  except:
    raise ValueError("Failed to tokenize. Check for input dimension mismatch")

  attention_mask = sentence["attention_mask"]

  try:
    word_embeddings_layer = model.embeddings.word_embeddings
  except:
    print("No raw word embedding layer")
    word_embeddings_layer= None
    raw_embeddings= None

  with torch.no_grad():
    try:
      output = model(**sentence, output_hidden_states=True)
    except:
      output = model(sentence.input_ids, output_hidden_states=True)

  hidden_states = output.hidden_states
  # retrieve the last state
  final_states = output.last_hidden_state


  # take the mean
  mean_final_states =mean_pooling(final_states, attention_mask)


  # raw embedding wihtout positional encoding
  if (word_embeddings_layer is not None):
    raw_embeddings = word_embeddings_layer(sentence['input_ids'])
    raw_embeddings = mean_pooling(raw_embeddings, attention_mask)

  # with position encoding
  embeddings = hidden_states[0]
  embeddings= mean_pooling(embeddings, attention_mask)

  pooled_output = None
  if (eos):
    pooled_output = output.pooler_output


  return mean_final_states, raw_embeddings, embeddings, pooled_output

In [None]:
# compute similarity
def similarity (p_text, similar_pairs, n_text):
  '''
  @param p_text: the positive text, single string
  @param similar_pairs: all similar texts as a list
  @param n_text: the negative text, single string
  '''

  # need to make p_text into (1,hidden_dim)
  p_text = p_text.unsqueeze(0)
  n_text = n_text.unsqueeze(0)

  sim_pairs = F.cosine_similarity(p_text, similar_pairs)
  sim_negative = F.cosine_similarity(p_text, n_text)
  return sim_pairs, sim_negative

#### Running the experiment

In [None]:
def run_experiment(tokenizer, model, eos=False):
  '''
  negative difference means the positive pair is closer to the negative pair
  positive means the positive pair is closer to the similar sentences

  '''
  for i in range(0,len(positive_text)):
    # construct all sentences pair
    p_text, n_text = positive_text[i], negative_text[i]
    sentences = [p_text] + similar_text[p_text] + [n_text]
    length = len(sentences)

    # compute embeddings
    final_embedding, raw_embedding, embedding, pooled= compute_embedding(sentences, tokenizer, model, eos)


    # compute similarity
    print(f"{p_text}")
    if (final_embedding is not None):
      print(f"Final Embeddings:")
      sim_pairs, sim_negative = similarity(final_embedding[0], final_embedding[1:length-1], final_embedding[length-1])
      for i, s in enumerate(similar_text[p_text]):
        print(f"  {s}: {sim_pairs[i].item()}")
      print(f"  Negative: {sim_negative.item()}")
      diff= torch.mean(sim_pairs, dim=0) - sim_negative
      print(f"Mean diff: {diff.item()}")


    if (eos and pooled is not None):
        print(f"\nEOS embedding output:")
        sim_pairs, sim_negative = similarity(pooled[0], pooled[1:length-1], pooled[length-1])
        for i, s in enumerate(similar_text[p_text]):
          print(f"  {s}: {sim_pairs[i].item()}")

        print(f"  Negative: {sim_negative.item()}")
        # negative means the negative sentence is more alike
        diff= torch.mean(sim_pairs, dim=0) - sim_negative
        print(f"Mean diff: {diff.item()}")

    if (raw_embedding is not None):
      print(f"\nRaw Embeddings (before positional encoding):")
      sim_pairs, sim_negative = similarity(raw_embedding[0], raw_embedding[1:length-1], raw_embedding[length-1])
      for i, s in enumerate(similar_text[p_text]):
        print(f"  {s}: {sim_pairs[i].item()}")
      print(f"  Negative: {sim_negative.item()}")
      diff= torch.mean(sim_pairs, dim=0) - sim_negative
      print(f"Mean diff: {diff.item()}")

    if (embedding is not None):
      print(f"\nEmbeddings (after positional encoding):")
      sim_pairs, sim_negative = similarity(embedding[0], embedding[1:length-1], embedding[length-1])
      for i, s in enumerate(similar_text[p_text]):
        print(f"  {s}: {sim_pairs[i].item()}")
      print(f"  Negative: {sim_negative.item()}")
      # negative means the negative sentence is more alike
      diff= torch.mean(sim_pairs, dim=0) - sim_negative
      print(f"Mean diff: {diff.item()}")


    print("\n")

In [None]:
def qwen_experiment(model):
  '''
  experiment just for the Qwen3 embedding model
  '''
  for i in range(0,len(positive_text)):
    # construct all sentences pair
    p_text, n_text = positive_text[i], negative_text[i]
    sentences = [p_text] + similar_text[p_text] + [n_text]
    length = len(sentences)

    # compute embeddings
    final_embedding= model.encode(sentences)

    final_embedding= torch.from_numpy(final_embedding)
    final_embedding= final_embedding.squeeze(0)

    # compute similarity
    print(f"{p_text}")
    if (final_embedding is not None):
      print(f"Final Embeddings:")
      sim_pairs, sim_negative = similarity(final_embedding[0], final_embedding[1:length-1], final_embedding[length-1])
      for i, s in enumerate(similar_text[p_text]):
        print(f"  {s}: {sim_pairs[i].item()}")
      print(f"  Negative: {sim_negative.item()}")
      diff= torch.mean(sim_pairs, dim=0) - sim_negative
      print(f"Mean diff: {diff.item()}")

    print("\n")

In [None]:
print("BERT EXPERIMENTS\n")
run_experiment(bert_tokenizer, bert_model)

BERT EXPERIMENTS

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.8767836093902588
  An old man kisses a young girl: 0.8696824908256531
  An aged man plants a kiss on a young boy: 0.7986450791358948
  Negative: 0.9926751852035522
Mean diff: -0.14430475234985352

Raw Embeddings (before positional encoding):
  a senior person kisses a junior student: 0.8804402351379395
  An old man kisses a young girl: 0.9306421279907227
  An aged man plants a kiss on a young boy: 0.8763240575790405
  Negative: 0.9999998807907104
Mean diff: -0.1041976809501648

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.7819818258285522
  An old man kisses a young girl: 0.8738179802894592
  An aged man plants a kiss on a young boy: 0.773882269859314
  Negative: 0.999994158744812
Mean diff: -0.19010013341903687


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.8049286603927612
  A

In [None]:
print("ROBERTA EXPERIMENTS\n")
run_experiment(roberta_tokenizer,roberta_model)

ROBERTA EXPERIMENTS

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.9821566343307495
  An old man kisses a young girl: 0.9849722385406494
  An aged man plants a kiss on a young boy: 0.9812824130058289
  Negative: 0.9989384412765503
Mean diff: -0.016134679317474365

Raw Embeddings (before positional encoding):
  a senior person kisses a junior student: 0.6853720545768738
  An old man kisses a young girl: 0.8029927015304565
  An aged man plants a kiss on a young boy: 0.6952104568481445
  Negative: 0.9432175159454346
Mean diff: -0.21535909175872803

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.8426203727722168
  An old man kisses a young girl: 0.9013096690177917
  An aged man plants a kiss on a young boy: 0.8418176174163818
  Negative: 0.972170889377594
Mean diff: -0.11025494337081909


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.97932529449462

In [None]:
print("DEBERTA EXPERIMENTS\n")
run_experiment(deberta_tokenizer,deberta_model)

DEBERTA EXPERIMENTS

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.7652999758720398
  An old man kisses a young girl: 0.7461739778518677
  An aged man plants a kiss on a young boy: 0.6826192140579224
  Negative: 0.9569706320762634
Mean diff: -0.22560620307922363

Raw Embeddings (before positional encoding):
  a senior person kisses a junior student: 0.9353735446929932
  An old man kisses a young girl: 0.9563971757888794
  An aged man plants a kiss on a young boy: 0.9407403469085693
  Negative: 0.9999998807907104
Mean diff: -0.055829524993896484

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.6401230096817017
  An old man kisses a young girl: 0.7571167349815369
  An aged man plants a kiss on a young boy: 0.635161280632019
  Negative: 0.9999999403953552
Mean diff: -0.32253289222717285


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.58320963382720

In [None]:
print("CLIP EXPERIMENTS\n")
run_experiment(clip_tokenizer,clip_text_encoder, eos=True)

CLIP EXPERIMENTS

No raw word embedding layer
an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.8533490896224976
  An old man kisses a young girl: 0.9525268077850342
  An aged man plants a kiss on a young boy: 0.8868050575256348
  Negative: 0.9483888149261475
Mean diff: -0.050828516483306885

EOS embedding output:
  a senior person kisses a junior student: 0.8757849335670471
  An old man kisses a young girl: 0.9231071472167969
  An aged man plants a kiss on a young boy: 0.8676836490631104
  Negative: 0.952176570892334
Mean diff: -0.06331801414489746

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.8756532073020935
  An old man kisses a young girl: 0.9243560433387756
  An aged man plants a kiss on a young boy: 0.8413453698158264
  Negative: 0.9999999403953552
Mean diff: -0.11954838037490845


No raw word embedding layer
the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the 

In [None]:
print("Qwen3 Experiments\n")
'''
inference time of the qwen model is a lot longer than the other models
'''
qwen_experiment(qwen_model)

Qwen3 Experiments

an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.846871018409729
  An old man kisses a young girl: 0.8874929547309875
  An aged man plants a kiss on a young boy: 0.8290497660636902
  Negative: 0.9459046125411987
Mean diff: -0.09143334627151489


the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.8819760680198669
  A tall person hugs the smaller one: 0.9549587368965149
  One who is taller hugs their shorter companion: 0.9367029666900635
  Negative: 0.9478232264518738
Mean diff: -0.02327728271484375


the sail rests below the water
Final Embeddings:
  The sail lies below the water: 0.9869880080223083
  The sail sits below the water: 0.9876495003700256
  The sail sits beneath the water: 0.9789270758628845
  Negative: 0.9339830279350281
Mean diff: 0.05053853988647461


there is a mug in some grass
Final Embeddings:
  There is a mug lying in grass: 0.9713650941848755
  The mu

In [None]:
print("Flan T5 Base Experiments\n")
'''
inference time of the qwen model is a lot longer than the other models
'''
run_experiment(t5_base_tokenizer, t5_base_model)

Flan T5 Experiments

No raw word embedding layer
an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.8969503045082092
  An old man kisses a young girl: 0.9256932139396667
  An aged man plants a kiss on a young boy: 0.8601130247116089
  Negative: 0.9696336984634399
Mean diff: -0.07538145780563354

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.7410325407981873
  An old man kisses a young girl: 0.8538724780082703
  An aged man plants a kiss on a young boy: 0.7282896041870117
  Negative: 1.0
Mean diff: -0.22560173273086548


No raw word embedding layer
the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.818711519241333
  A tall person hugs the smaller one: 0.8222359418869019
  One who is taller hugs their shorter companion: 0.7576948404312134
  Negative: 0.9735816717147827
Mean diff: -0.1740342378616333

Embeddings (after positional encoding):
  A taller man 

In [None]:
print("Flan T5 Large Experiments\n")
'''
inference time of the qwen model is a lot longer than the other models
'''
run_experiment(t5_large_tokenizer, t5_large_model)

Flan T5 Large Experiments

No raw word embedding layer
an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.8457169532775879
  An old man kisses a young girl: 0.9139134883880615
  An aged man plants a kiss on a young boy: 0.8278751373291016
  Negative: 0.9713282585144043
Mean diff: -0.1088263988494873

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.7291390895843506
  An old man kisses a young girl: 0.824893593788147
  An aged man plants a kiss on a young boy: 0.7017545104026794
  Negative: 1.0
Mean diff: -0.2480708360671997


No raw word embedding layer
the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.8236874341964722
  A tall person hugs the smaller one: 0.8436329364776611
  One who is taller hugs their shorter companion: 0.8363988399505615
  Negative: 0.9692001938819885
Mean diff: -0.13462704420089722

Embeddings (after positional encoding):
  A taller

In [None]:
print("Flan T5 XL Experiments\n")
'''
inference time of the qwen model is a lot longer than the other models
'''
run_experiment(t5_xl_tokenizer, t5_xl_model)

Flan T5 XL Experiments

No raw word embedding layer
an old person kisses a young person
Final Embeddings:
  a senior person kisses a junior student: 0.8657392859458923
  An old man kisses a young girl: 0.8601407408714294
  An aged man plants a kiss on a young boy: 0.8063251376152039
  Negative: 0.9574733972549438
Mean diff: -0.11340498924255371

Embeddings (after positional encoding):
  a senior person kisses a junior student: 0.7270920276641846
  An old man kisses a young girl: 0.8266953825950623
  An aged man plants a kiss on a young boy: 0.6772215962409973
  Negative: 0.9999998807907104
Mean diff: -0.2563301920890808


No raw word embedding layer
the taller person hugs the shorter person
Final Embeddings:
  A taller man hug the shorter girl: 0.8018569946289062
  A tall person hugs the smaller one: 0.9073591828346252
  One who is taller hugs their shorter companion: 0.801571249961853
  Negative: 0.9684673547744751
Mean diff: -0.13153815269470215

Embeddings (after positional encoding

#### Analysis
- in general, the embedding similarity with negative sentence is higher than the similarity with the similar sentences, though the values are generally close.
- future models like Roberta, Deberta did not show a significant performance over BERT
- in all models, only Deberta's final embedding for the "there is a mug in some grass" sentence showed closer similarity with similar sentences. Every single other sentence pair, the positive pair is more similar with the negative pair.
- but the loss value is so small that it doesn't make a significant difference in compositionality
- positional encodings does significantly improve the model's ability to understand similarity. This could be attributed to the benefits from learning structure of the sentence.
- Transformer processing layers also improve similarity. So it's hard to make the argument that in the middle layer, information/semantics is lost.
- In encoder + decoder models, the encoded dimension scores are also not that great, for example Flan. Scale doesn't seem to have improved the scores. Just looking at the encoded version shows that the encoder portion of Flan is not good.


[Results Table](https://docs.google.com/document/d/17KIbLTMhTp6EmgEjUXwuOLTjkUtmJz1Tt16ASPpIYpk/edit?usp=sharing)

### Why do we not study decoder models?
