### Embedding Model Evaluation
> Based on the use case / purpose, the embedding models can be evaluated with custom data  
> The custom data can represent the industry / domain. It can represent the use case (classification / clustering / similarity etc)  
> Depending on need, there can be multiple embedding model used within a system for different parts

#### Import of required Libraries

In [None]:
# Sentence transformers to use the embedding models locally
from sentence_transformers import SentenceTransformer, util
import pandas as pd

# Google AI library
from google import genai
from google.genai import types

# Load Environment variables from file
from dotenv import load_dotenv

# Initialise an client object with API key
load_dotenv ()
client = genai.Client()

# Import library for making the tests
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Functions from SciPy for check / test
from scipy import spatial
from scipy.stats import pearsonr

import csv

#### Utility Functions
> Functions defined for basic comparison and test  
> It can be re-used in different modules

In [None]:
def cosine_similarity(vec1, vec2):

    """
    Function to provide cosine similarity between given 2 vectors.
    Vectors are provided in the form of list of numbers
    Cosine Similarity is returned as a number. 1 being the highest representing high similarity
    """

    # spatial.distance.cosine returns the cosine distance
    # from the distance similarity can be computed, considering the unit vector
    
    return 1 - spatial.distance.cosine(vec1, vec2)

In [None]:
def get_cosine_similarities_HF(model, pairs):
    
    """
    Function to compute similarity scores in each pair for a given list of text pairs.
    Model from HF is given as an argument : as a sentence transformer model.
    For the Text pairs provided, embedding vectors are captured and consine similarities are provided as a list of numbers
    """

    similarities = []
    
    for s1, s2 in pairs:
                
        # Embed the texts
        emb1 = model.encode(s1, convert_to_tensor=True)
        emb2 = model.encode(s2, convert_to_tensor=True)
        
        # Identify the cosine similarity
        sim = util.cos_sim(emb1, emb2).item()
        similarities.append(sim)
    
    return similarities

In [None]:
def get_cosine_similarities_gemini(model, pairs):
    
    """
    Function to compute similarity scores in each pair for a given list of text pairs.
    Model from Gemini is given as an argument : Name of the embedding model as string.
    For the Text pairs provided, embedding vectors are captured and consine similarities are provided as a list of numbers    
    """

    similarities = []
    
    for s1, s2 in pairs:
        
        result = client.models.embed_content(
                model=model,        
                contents=[s1, s2],
                config=types.EmbedContentConfig(output_dimensionality=768)
                )

        sim = cosine_similarity (result.embeddings[0].values, result.embeddings[1].values)

        similarities.append(sim)

    return similarities

#### Make a Similarity Test
> Identify Known sentence pair, with known level of similarity  
> Calculate similarity score for each pair from various model  
> Compare the similarity scores from diff models for same text pair and compare

In [None]:
# Choose 2 embedders from HF to compare
a = "sentence-transformers/all-MiniLM-L6-v2"
b = "BAAI/bge-m3"
model_a = SentenceTransformer(a)
model_b = SentenceTransformer(b)

# Choose an embedding model from Gemini
g = "gemini-embedding-001"

In [None]:
# Set of known text with expected similarity.
sentence_pairs = [
    ("A man is eating pizza.", "A man is eating pizza."), # Same
    ("The weather was rainy last week.", "Currently I am wokring under hot sun"), # opposite
    ("AI helps techies", "Techies actuall help AI"), # opposite ..?
    ("He is playing flute.", "A person plays an instrument."),   # related
    ("I love dogs.", "She has a dog as pet."),      # somewhat related
    ("My car is red.", "My vehicle is painted Red"),        # similar
]

# Compute similarity list from the chosen HF models for same set of texts
sims_a = get_cosine_similarities_HF(model_a, sentence_pairs)
sims_b = get_cosine_similarities_HF(model_b, sentence_pairs)

# Compute similiarity list from the gemini model as well
sims_g = get_cosine_similarities_gemini (g, sentence_pairs)

# Result comparison
Result = pd.DataFrame({
    'Sentence 1': [p[0] for p in sentence_pairs],
    'Sentence 2': [p[1] for p in sentence_pairs],
    'Model A': sims_a,
    'Model B': sims_b,
    'Model G' : sims_g,
})

# Difference in similarity score between models
print ("Model A : "+a+"\nModel B : "+b+"\nModel G : "+g)

Result

**Similarity Test : Data Set**  
> Perform similarity test with pre defined data set, which is organised as text pairs and corresponding annotated score  
> Its a sample data from a pre-defined HF dataset  
> Similarly, a custom data set can be created, with annotated score Or enumerated range of score
> Finally its compared with correlation score

In [None]:
# Load the Test data from CSV file
Similarity_Sample = pd.read_csv ('Similarity_Data_sample.csv')

# Text pairs that are going to be used for similarity test
Test_Pairs = list (zip (Similarity_Sample['sentence1'], Similarity_Sample['sentence2']))

# Similarity score from Test Data as a reference
Test_Sim = Similarity_Sample['score'].to_list()

# Calculate pair wise similarity from 2 different models
sims_a = get_cosine_similarities_HF(model_a, Test_Pairs)
sims_b = get_cosine_similarities_HF(model_b, Test_Pairs)

**Pearson Correlation**  
It indicates how well the Test result (list of numbers) and the reference score (numbers) correlate  
Higher the number well correlated. It means Test result of Similarity score is aligned to what is present in Reference data   
Instead of absolute value check, the similarity scores can be enumerated based on range ( > 0.95 : Same; 0.8 .. 0.95 : Similar etc)

In [None]:
# Compute Pearson correlation : Model A
corr_a, p_value_a = pearsonr(Test_Sim, sims_a)
print(f"Model A : Pearson correlation: {corr_a:.4f}")

# Compute Pearson correlation : Model B
corr_b, p_value_b = pearsonr(Test_Sim, sims_b)
print(f"Model B : Pearson correlation: {corr_b:.4f}")

#### Make a Classification Test
> When text are vectorised by embedding model, the resulting vectors (number array) can be treated as classification features  
> If there is a pre-defined classification labels for the text, the corrsponding vectors from the model can be used for creating a classification model and a test can be made
> This will indicate how well the vectors are grouped cohesively in the vector space for the given set of classes

In [None]:
# Define a Test data set with text and corresponding class label
# this label can be representing various tags - depending on the use case
texts = [
        "I love this movie!",
        "This was terrible.",
        "Amazing performance.",
        "Not worth watching.",
        "It was okay, not great.",
        "Absolutely fantastic!"
]
labels = ['Positive', 'Negative','Positive', 'Negative','Negative','Positive']

# Split the data set for training and testing
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.3, random_state=42)

> For Each model, make the embddings for both training data and test data  
> Make a simple classification model, based on vector data and corresponding label for the Test data split  
> Once the model is built, for the Test vectors predict the class  
> make comparison for accuracy of class predicted vs the class label from the test data set

In [None]:
# The models for comparison
models = [
    "sentence-transformers/all-MiniLM-L6-v2",
    "sentence-transformers/all-mpnet-base-v2"
]

for name in models:
    model = SentenceTransformer(name)
    train_emb = model.encode(X_train)
    test_emb = model.encode(X_test)

    cl_model = LogisticRegression(max_iter=1000)
    cl_model.fit(train_emb, y_train)
    preds = cl_model.predict(test_emb)

    acc = accuracy_score(y_test, preds)
    print(f"{name} → Accuracy: {acc:.3f}")

**Data Set : Industry operation**  
Pick a data set that represents a industry data.  
Each text provided mentions a process step / operation / requirement from an industry  
the classification label provides the department in the organisation which holds the responsibility / relevant to the process  
Similarly vectorise the texts, train and test the classification model  
The score can indicate how well vectors from 2 models can captured the nuance of the industry / organisation, when used in **classification**

In [None]:
# Read Sample data from CSV
Data = pd.read_csv ('Process_n_Departments.csv')

# Consider the text and label form the data frame
texts = Data['Process'].to_list ()
labels = Data['Department'].to_list ()

# Split
X_train, X_test, y_train, y_test = train_test_split(texts, labels, test_size=0.4, random_state=42)


**Visualise**  
> The vectors that are made from embedding model and the corresponding label can be used for visualisation  
> High dimensionality vector cannot be visualised, unless it is reduced in dimension  
> Store them as tab separated file and can be loaded in https://projector.tensorflow.org/

In [None]:
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
Vectors = model.encode (texts, convert_to_numpy=True)

# Write to tab-separated file

with open('vectors.csv', "w", newline="", encoding="utf-8") as f:
    
    writer = csv.writer(f, delimiter='\t')
    
    for V in Vectors:
        writer.writerow(V.tolist())

with open('labels.csv', "w", newline="", encoding="utf-8") as f:
    
    writer = csv.writer(f, delimiter='\t')
    
    for L in labels:
        writer.writerow([L])