<a href="https://colab.research.google.com/github/vriadi/CS614-Gen-AI-with-LLMs/blob/main/Topic2_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import transformers
from transformers import AutoTokenizer, AutoModel

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)


In [None]:
model = transformers.BertModel.from_pretrained("bert-base-cased")

model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [None]:
 embedding_layer = model.embeddings

In [None]:
word_embeddings = embedding_layer.word_embeddings

In [None]:
word1 = 'red'
print(tokenizer(word1))

{'input_ids': [101, 1894, 102], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}


101 is CLS <br>
102 is SEP

In [None]:
word2 = 'pink'
print(tokenizer(word2))

{'input_ids': [101, 5325, 102], 'token_type_ids': [0, 0, 0], 'attention_mask': [1, 1, 1]}


Now, let's compute cosine similarity of the words

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

In [None]:
token_ids1 = tokenizer.encode(word1, add_special_tokens=False)
token_ids2 = tokenizer.encode(word2, add_special_tokens=False)

In [None]:
# Re-select embeddings from embeddings layer without the extra tokens this time
embedding1 = embedding_layer.word_embeddings.weight[token_ids1]
embedding2 = embedding_layer.word_embeddings.weight[token_ids2]
# Convert tensors to NumPy arrays and squeeze result to remove extra dimensions
embedding1 = embedding1.squeeze().detach().numpy()
embedding2 = embedding2.squeeze().detach().numpy()

remove CLS and SEP tokens for this computation

In [None]:
# Norms of the embeddings
magnitude1 = np.linalg.norm(embedding1)
magnitude2 = np.linalg.norm(embedding2)

# Cosine similarity
cosine_sim = np.dot(embedding1, embedding2) / (magnitude1 * magnitude2)

In [None]:
cosine_sim = cosine_similarity([embedding1], [embedding2])
print(f"Cosine sim of '{word1}' and '{word2}': {cosine_sim}")

Cosine sim of 'red' and 'pink': [[0.54829156]]


Let's use more than just a word and see benefit of the attention mechanism

In [None]:
import torch, transformers
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel

model_name = "bert-base-cased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = transformers.BertModel.from_pretrained("bert-base-cased")
embedding_layer = model.embeddings

In [None]:
sent1 = "I love animals, generally all of them."
sent2 = "But I don't really like creepy crawlies."

In [None]:
tokens_sent1 = tokenizer.encode(sent1)
tokens_sent2 = tokenizer.encode(sent2)

In [None]:
s_embedding1 = embedding_layer.word_embeddings.weight[tokens_sent1].mean(dim=0).squeeze().detach().numpy()
s_embedding2 = embedding_layer.word_embeddings.weight[tokens_sent2].mean(dim=0).squeeze().detach().numpy()

In [None]:
cosine_sim_static = cosine_similarity([s_embedding1], [s_embedding2])

print(f"Cosine sim of \n'{sent1}'\n and '{sent2}'\n is: {cosine_sim_static}")

Cosine sim of 
'I love animals, generally all of them.'
 and 'But I don't really like creepy crawlies.'
 is: [[0.58911043]]


In [None]:
sent_ids1 = tokenizer(sent1, return_tensors="pt", padding=True)
sent_ids2 = tokenizer(sent2, return_tensors="pt", padding=True)
with torch.no_grad():
  outputs1 = model(**sent_ids1)
  outputs2 = model(**sent_ids2)
s_embedding3 = outputs1.last_hidden_state.mean(dim=1).squeeze().numpy()
s_embedding4 = outputs2.last_hidden_state.mean(dim=1).squeeze().numpy()

cosine_sim_mean = cosine_similarity([s_embedding3], [s_embedding4])

print(f"Cosine sim with BERT forward is {cosine_sim_mean} \n"
      f"With STATIC embeddings it was {cosine_sim_static}")

Cosine sim with BERT forward is [[0.8429634]] 
With STATIC embeddings it was [[0.58911043]]
