# Calculating Embeddings
Let's look at three methods to calculate embeddings. We will consider calculating cosine similarity as an example, but generally these embeddings could be used in a number of different applications.

We are considering two models for each method, but these models could potentially be replaced with any other models for calculating embeddings from HuggingFace (for example [here](https://huggingface.co/models?pipeline_tag=sentence-similarity&sort=trending)). We will compare `sentences_1` and `sentences_2`:

In [20]:
model_baai = "BAAI/bge-small-en-v1.5"
model_multi = "sentence-transformers/distiluse-base-multilingual-cased-v2"

sentences_1 = [
    "Assessing the functionality of different embedding models.",
    "Two approaches for creating embeddings are provided here."
]
# To demonstrate how the similarity metrics works we are taking similar sentences:
sentences_2 = [
    "Evaluating the performance of various types of embeddings.",
    "Here, we are presented with two methods for generating embeddings."
]

### Method 1: Using Sentence Transformers


In [21]:
# TODO add all the necessary installations with pip install statements
from sentence_transformers import SentenceTransformer

# Define a function to compute cosine similarity between embeddings of two sets of sentences


def compute_similarity_st(sentences_1, sentences_2, model_name):
    # Initialize the SentenceTransformer model with the specified model name
    model = SentenceTransformer(model_name)

    # Encode the first set of sentences to get their embeddings, normalizing the results
    embeddings_1 = model.encode(sentences_1, normalize_embeddings=True)

    # Encode the second set of sentences to get their embeddings, normalizing the results
    embeddings_2 = model.encode(sentences_2, normalize_embeddings=True)

    # Compute the cosine similarity between the two sets of embeddings
    # The @ operator performs matrix multiplication which is equivalent to cosine similarity
    # when the embeddings are normalized.
    return embeddings_1 @ embeddings_2.T


# Compute the similarity matrices using the BAAI model and the multilingual model
similarity_baai_st = compute_similarity_st(
    sentences_1, sentences_2, model_baai)
similarity_multi_st = compute_similarity_st(
    sentences_1, sentences_2, model_multi)

# Print out the cosine similarity matrices for comparison
print("Compare similarity calculated for BAAI embeddings: ", similarity_baai_st)
print("and for multilingual embeddings: ", similarity_multi_st)

Compare similarity calculated for BAAI embeddings:  [[0.9387083  0.83786726]
 [0.8362595  0.95773757]]
and for multilingual embeddings:  [[0.7264205  0.06265946]
 [0.09025779 0.83387554]]


### Method 2: Using LangChain

In [36]:
!pip install langchain==0.1.12
!pip install sentence_transformers==2.5.1

In [22]:
from langchain.embeddings import HuggingFaceEmbeddings
import numpy as np

# Define a function to compute cosine similarity using the HuggingFaceEmbeddings


def compute_similarity_langchain(sentences_1, sentences_2, model_name):
    # Initialize the HuggingFaceEmbeddings with the specified model and normalization
    model = HuggingFaceEmbeddings(model_name=model_name, encode_kwargs={
                                  'normalize_embeddings': True})

    # Generate embeddings for the first set of sentences and convert to a numpy array
    embeddings_1 = np.array(model.embed_documents(sentences_1))

    # Generate embeddings for the second set of sentences and convert to a numpy array
    embeddings_2 = np.array(model.embed_documents(sentences_2))

    # Compute and return the cosine similarity matrix between the two sets of embeddings
    return embeddings_1 @ embeddings_2.T


# Compute the similarity matrices for the two models
similarity_baai_langchain = compute_similarity_langchain(
    sentences_1, sentences_2, model_baai)
similarity_multi_langchain = compute_similarity_langchain(
    sentences_1, sentences_2, model_multi)

# Print out the cosine similarity matrices for comparison
print("Compare similarity calculated for BAAI embeddings: ",
      similarity_baai_langchain)
print("and for multilingual embeddings: ", similarity_multi_langchain)

Compare similarity calculated for BAAI embeddings:  [[0.93870836 0.83786722]
 [0.83625949 0.95773762]]
and for multilingual embeddings:  [[0.72642046 0.06265945]
 [0.09025782 0.83387566]]


### Method 3: Loading model directly

In [27]:
from transformers import AutoTokenizer, AutoModel
import torch

# Define a function to compute embeddings and their cosine similarity


def compute_similarity_direct(sentences_1, sentences_2, model_name):
    # Initialize the tokenizer and model with the specified model name
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModel.from_pretrained(model_name)
    model.eval()

    # Tokenize and encode the first set of sentences
    encoded_input_1 = tokenizer(
        sentences_1, padding=True, truncation=True, return_tensors='pt')
    # Tokenize and encode the second set of sentences
    encoded_input_2 = tokenizer(
        sentences_2, padding=True, truncation=True, return_tensors='pt')

    # Generate embeddings for the first set of sentences without gradient calculation for efficiency
    with torch.no_grad():
        model_output_1 = model(**encoded_input_1)
        # Extract the [CLS] token's embeddings as sentence embeddings
        sentence_embeddings_1 = model_output_1[0][:, 0]

    # Generate embeddings for the second set of sentences
    with torch.no_grad():
        model_output_2 = model(**encoded_input_2)
        # Extract the [CLS] token's embeddings as sentence embeddings
        sentence_embeddings_2 = model_output_2[0][:, 0]

    # Normalize the embeddings to unit length
    embeddings_1 = torch.nn.functional.normalize(
        sentence_embeddings_1, p=2, dim=1)
    embeddings_2 = torch.nn.functional.normalize(
        sentence_embeddings_2, p=2, dim=1)

    # Calculate and return the cosine similarity matrix
    return embeddings_1 @ embeddings_2.T


# Compute the similarity using the BAAI model
similarity_baai_direct = compute_similarity_direct(
    sentences_1, sentences_2, model_baai)
# Compute the similarity using the multilingual model
similarity_multi_direct = compute_similarity_direct(
    sentences_1, sentences_2, model_multi)

# Print the similarity results for comparison
print("Compare similarity calculated for BAAI embeddings: ", similarity_baai_direct)
print("and for multilingual embeddings: ", similarity_multi_direct)

Compare similarity calculated for BAAI embeddings:  tensor([[0.9387, 0.8379],
        [0.8363, 0.9577]])
and for multilingual embeddings:  tensor([[0.7713, 0.3157],
        [0.3659, 0.8487]])


The discrepancy in the embeddings and resulting similarity scores you're observing between the first two and the third methods arises from the differences in how the embeddings are generated, even though the same underlying model (`"sentence-transformers/distiluse-base-multilingual-cased-v2"`) is used.

In the first two functions, we're using the libraries, which are specifically designed to produce **sentence embeddings**. These libraries apply additional processing steps to the raw model outputs to produce optimized sentence-level embeddings.

While when we are loading a model directly from the Hugging Face transformers library, we manually handle the model's output to generate embeddings, selecting the first token's output `([0][:, 0])` as the sentence representation, which is typically a token in BERT-like models. This approach doesn't necessarily align with how sentence transformers are designed to produce sentence embeddings.

If you're looking to maintain consistency in embedding generation across different parts of your code or application, *it's crucial to use the same method for embedding generation*.