First, let's load a pre-trained `SentenceTransformer` model. We'll use `all-MiniLM-L6-v2`, which is a good general-purpose model.

In [None]:
from sentence_transformers import SentenceTransformer, util

# Load a pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model loaded successfully!")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]



README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Model loaded successfully!


In [None]:
import torch
import spacy
import numpy as np
from transformers import AutoTokenizer, AutoModel
from scipy.spatial.distance import cosine

Now, let's define some sentences to test. We'll include sentences with proper nouns and compare them to similar sentences where the proper noun is replaced or removed, or where the proper noun is a variant of another.

We'll use three groups of sentences:
1.  **Direct comparison**: "Paris is the capital of France." vs "London is the capital of England."
2.  **Proper noun vs. common noun**: "Apple released a new iPhone." vs "The company released a new phone."
3.  **Variant proper nouns**: "Barack Obama was the President." vs "Obama was the President."

In [None]:
sentences = [
    "Paris is the capital of France.",
    "London is the capital of England.",
    "Apple released a new iPhone.",
    "The company released a new phone.",
    "Barack Obama was the President.",
    "Obama was the President."
]

print("Sentences to analyze:")
for i, s in enumerate(sentences):
    print(f"{i+1}. {s}")

Sentences to analyze:
1. Paris is the capital of France.
2. London is the capital of England.
3. Apple released a new iPhone.
4. The company released a new phone.
5. Barack Obama was the President.
6. Obama was the President.


Next, we compute the embeddings for all these sentences. An embedding is a numerical vector representation of the sentence.

In [None]:
embeddings = model.encode(sentences, convert_to_tensor=True)
print(f"Embeddings created. Shape: {embeddings.shape}")

Embeddings created. Shape: torch.Size([6, 384])


Finally, we calculate the cosine similarity between pairs of sentences to see how similar their meanings are according to the model. Cosine similarity ranges from -1 (opposite) to 1 (identical).

We will compare:
*   Sentence 1 and 2 (different proper nouns)
*   Sentence 3 and 4 (proper noun vs. common noun)
*   Sentence 5 and 6 (full proper noun vs. shortened proper noun)

In [None]:
import torch

# Calculate cosine similarities
similarity_1_2 = util.cos_sim(embeddings[0], embeddings[1])
similarity_3_4 = util.cos_sim(embeddings[2], embeddings[3])
similarity_5_6 = util.cos_sim(embeddings[4], embeddings[5])

print(f"Similarity between '{sentences[0]}' and '{sentences[1]}': {similarity_1_2.item():.4f}")
print(f"Similarity between '{sentences[2]}' and '{sentences[3]}': {similarity_3_4.item():.4f}")
print(f"Similarity between '{sentences[4]}' and '{sentences[5]}': {similarity_5_6.item():.4f}")

print("\nInterpretation:")
print("A higher similarity score indicates that the model perceives the sentences to be more semantically similar.")
print("By comparing these scores, you can observe how the model treats proper nouns in terms of their semantic contribution.")

Similarity between 'Paris is the capital of France.' and 'London is the capital of England.': 0.3513
Similarity between 'Apple released a new iPhone.' and 'The company released a new phone.': 0.7241
Similarity between 'Barack Obama was the President.' and 'Obama was the President.': 0.9539

Interpretation:
A higher similarity score indicates that the model perceives the sentences to be more semantically similar.
By comparing these scores, you can observe how the model treats proper nouns in terms of their semantic contribution.


In [None]:
Apple_embedding = model.encode("Apple makes iPhones", convert_to_tensor=True)

In [None]:
apple_embedding = model.encode("apples grow in Washington", convert_to_tensor=True)

In [None]:
util.cos_sim(Apple_embedding, apple_embedding)

tensor([[0.4004]])

# Incorporating POS Tagging in Sentence Transformers

In [None]:
# 1. Setup Models
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
nlp = spacy.load("en_core_web_sm")

# Define our sentences
anchor_text = "Explorer’s termination in arctic boat accident."
distractor_text = "The explorer is in the arctic boat."

# Define Weights: Heavy penalty for function words, boost for nouns/verbs
POS_WEIGHTS = {
    "NOUN": 2.0, "PROPN": 2.0, "VERB": 2.0, "ADJ": 1.0,
    "ADV": 1.0, "DET": 0.05, "ADP": 0.05, "CCONJ": 0.05,
    "PRON": 0.1, "PART": 0.05, "AUX": 0.05
}

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


In [None]:
def get_embedding(text, use_syntax_weights=False):
    # Tokenize
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

    with torch.no_grad():
        outputs = model(**inputs)
        # Shape: [1, seq_len, hidden_dim]
        token_embeddings = outputs.last_hidden_state[0]

    # Get word_ids to map sub-tokens to original words
    word_ids = inputs.word_ids()

    # Generate Weights vector
    weights = torch.ones(token_embeddings.shape[0])

    if use_syntax_weights:
        doc = nlp(text)
        # Map spaCy tags to BERT tokens
        for i, word_id in enumerate(word_ids):
            if word_id is not None:  # Skip [CLS], [SEP]
                # If word_id is within bounds of spaCy doc
                if word_id < len(doc):
                    pos = doc[word_id].pos_
                    weight = POS_WEIGHTS.get(pos, 1.0) # Default to 1.0 if unknown
                    weights[i] = weight
            else:
                # Keep [CLS] and [SEP] as neutral 1.0 or lower if desired
                weights[i] = 1.0

    # Apply Weighted Pooling
    # Expand weights to [seq_len, hidden_dim]
    weights_expanded = weights.unsqueeze(-1).expand(token_embeddings.size())

    # Multiply
    weighted_embeddings = token_embeddings * weights_expanded

    # Sum and Normalize
    sum_embeddings = torch.sum(weighted_embeddings, dim=0)
    sum_weights = torch.sum(weights_expanded, dim=0)

    # Avoid div by zero
    sentence_embedding = sum_embeddings / torch.clamp(sum_weights, min=1e-9)

    return sentence_embedding.numpy()

In [None]:
# --- RUN THE COMPARISON ---

# 1. Standard Embeddings (Mean Pooling)
vec_anchor_std = get_embedding(anchor_text, use_syntax_weights=False)
vec_dist_std = get_embedding(distractor_text, use_syntax_weights=False)

# 2. Syntax-Weighted Embeddings
vec_anchor_syn = get_embedding(anchor_text, use_syntax_weights=True)
vec_dist_syn = get_embedding(distractor_text, use_syntax_weights=True)

# Calculate Similarity (1 - Cosine Distance)
sim_std = 1 - cosine(vec_anchor_std, vec_dist_std)
sim_syn = 1 - cosine(vec_anchor_syn, vec_dist_syn)

print(f"Sentence A: {anchor_text}")
print(f"Sentence B: {distractor_text}")
print("-" * 30)
print(f"Standard Similarity:       {sim_std:.4f}")
print(f"Syntax-Weighted Similarity: {sim_syn:.4f}")
print("-" * 30)
print(f"Difference: {sim_std - sim_syn:.4f}")

Sentence A: Explorer’s termination in arctic boat accident.
Sentence B: The explorer is in the arctic boat.
------------------------------
Standard Similarity:       0.6547
Syntax-Weighted Similarity: 0.7956
------------------------------
Difference: -0.1409


## Comparing between POS speech weighted and normal embedding

In [None]:
import torch
import spacy
import pandas as pd
from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity

# 1. Setup
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
nlp = spacy.load("en_core_web_sm")

sentence = "Explorer’s termination in arctic boat accident"

# Weights Configuration
POS_WEIGHTS = {
    "NOUN": 3.0, "PROPN": 3.0, "VERB": 3.0, # Increased for dramatic effect
    "ADJ": 1.0, "ADV": 1.0,
    "DET": 0.05, "ADP": 0.05, "PART": 0.05
}

def analyze_vector_shift(text):
    # Tokenize and get raw embeddings
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
        token_embeddings = outputs.last_hidden_state[0] # [seq_len, 384]

    # Get word mapping and POS tags
    word_ids = inputs.word_ids()
    doc = nlp(text)

    # --- 1. Calculate Standard Mean Pooling ---
    # (Exclude CLS/SEP for fair comparison)
    mask = torch.tensor([1 if w is not None else 0 for w in word_ids]).unsqueeze(1)
    sum_std = torch.sum(token_embeddings * mask, dim=0)
    sum_mask = torch.sum(mask)
    vec_std = sum_std / sum_mask if sum_mask > 0 else torch.zeros_like(sum_std)

    # --- 2. Calculate POS Weighted Pooling ---
    weights = []
    tokens = []

    # Build weight vector aligned with tokens
    for i, word_id in enumerate(word_ids):
        # Ensure word_id is not None and is within the bounds of spaCy's doc tokens
        if word_id is not None and word_id < len(doc):
            span = doc[word_id]
            w = POS_WEIGHTS.get(span.pos_, 1.0) # Default to 1.0 if POS not in weights
            weights.append(w)
            tokens.append(tokenizer.decode([inputs['input_ids'][0][i]]))
        else:
            # For special tokens (word_id is None) or words not aligned with spaCy doc
            weights.append(0.0) # Assign zero weight
            tokens.append(tokenizer.decode([inputs['input_ids'][0][i]]) if word_id is not None else "[SPL]")

    weights_tensor = torch.tensor(weights).unsqueeze(1)

    # Weighted Average
    weighted_emb = token_embeddings * weights_tensor
    sum_weighted = torch.sum(weighted_emb, dim=0)
    sum_weights = torch.sum(weights_tensor)
    vec_weighted = sum_weighted / sum_weights if sum_weights > 0 else torch.zeros_like(sum_weighted)

    # --- 3. Compare Contributions ---
    # We measure: Cosine Sim(Word_i, Sentence_Vector)
    # This tells us: "How much does the sentence look like this specific word?"

    data = []
    for i, word_id in enumerate(word_ids):
        # Only process if it's a valid word and within doc bounds
        if word_id is not None and word_id < len(doc):
            token_vec = token_embeddings[i].reshape(1, -1)

            # Similarity to Standard Sentence Vector
            sim_std = cosine_similarity(token_vec, vec_std.reshape(1, -1))[0][0]

            # Similarity to Weighted Sentence Vector
            sim_wgt = cosine_similarity(token_vec, vec_weighted.reshape(1, -1))[0][0]

            data.append({
                "Token": tokens[i],
                "POS": doc[word_id].pos_,
                "Std Impact": sim_std,
                "Wgt Impact": sim_wgt,
                "Shift": sim_wgt - sim_std
            })

    return pd.DataFrame(data)

# Run Analysis
df = analyze_vector_shift(sentence)

# formatting for display
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
print(df[["Token", "POS", "Std Impact", "Wgt Impact", "Shift"]])

Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


         Token   POS  Std Impact  Wgt Impact     Shift
0     explorer  NOUN    0.645442    0.750619  0.105177
1            ’  PART    0.872635    0.782999 -0.089636
2            s  NOUN    0.871457    0.738461 -0.132996
3  termination   ADP    0.466393    0.150323 -0.316069
4           in   ADJ    0.875690    0.730932 -0.144758
5       arctic  NOUN    0.564866    0.716202  0.151336
6         boat  NOUN    0.615948    0.761304  0.145356


In [None]:
sentence = nlp("Explorer’s termination in arctic boat accident")

for token in sentence:
  print(token, token.pos_)

Explorer NOUN
’s PART
termination NOUN
in ADP
arctic ADJ
boat NOUN
accident NOUN


In [None]:
sentence = nlp("Explorer’s termination in Arctic boat accident")

for token in sentence:
  print(token, token.pos_)

Explorer NOUN
’s PART
termination NOUN
in ADP
Arctic PROPN
boat NOUN
accident NOUN
