Eg8

Write a function to compute sentence embeddings using a pre-trained BERT model from Hugging Face and calculate the similarity between two sentences given below.
"Generative AI or generative artificial intelligence refers to the use of AI to create new content, like text, images, music, audio, and videos."
"Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data."**bold text**

In [2]:
pip install transformers torch scipy



In [4]:
from transformers import BertTokenizer, BertModel
import torch
from scipy.spatial.distance import cosine

def get_sentence_embedding(sentence, model, tokenizer):
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True, max_length=512)

    with torch.no_grad():
        outputs = model(**inputs)

    sentence_embedding = outputs.last_hidden_state[:, 0, :]
    return sentence_embedding.squeeze().numpy()

def cosine_similarity(embedding1, embedding2):
    return 1 - cosine(embedding1, embedding2)

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

sentence1 = "Generative AI or generative artificial intelligence refers to the use of AI to create new content, like text, images, music, audio, and videos."
sentence2 = "Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data."

embedding1 = get_sentence_embedding(sentence1, model, tokenizer)
embedding2 = get_sentence_embedding(sentence2, model, tokenizer)

similarity = cosine_similarity(embedding1, embedding2)
print(f"Cosine Similarity between the two sentences: {similarity:.4f}")

Cosine Similarity between the two sentences: 0.8916


Easy

In [3]:
from transformers import BertTokenizer, BertModel
import torch
from scipy.spatial.distance import cosine

def compute_sentence_embedding(sentence, model, tokenizer):
    inputs = tokenizer(sentence, return_tensors='pt', truncation=True, padding=True)
    outputs = model(**inputs)
    cls_embedding = outputs.last_hidden_state[:, 0, :].squeeze()
    return cls_embedding

def calculate_similarity(sentence1, sentence2):
    model_name = 'bert-base-uncased'
    tokenizer = BertTokenizer.from_pretrained(model_name)
    model = BertModel.from_pretrained(model_name)

    embedding1 = compute_sentence_embedding(sentence1, model, tokenizer)
    embedding2 = compute_sentence_embedding(sentence2, model, tokenizer)

    similarity = 1 - cosine(embedding1.detach().numpy(), embedding2.detach().numpy())
    return similarity


sentence1 = "Generative AI or generative artificial intelligence refers to the use of AI to create new content, like text, images, music, audio, and videos."
sentence2 = "Generative AI is a type of artificial intelligence technology that can produce various types of content, including text, imagery, audio and synthetic data."

similarity_score = calculate_similarity(sentence1, sentence2)
print(f"Similarity score: {similarity_score}")

Similarity score: 0.8915842421848057


END