# Cosine Similarity Evaluation for RAG Responses

This notebook evaluates RAG model responses by computing cosine similarity between different text pairs. It helps measure how well the generated responses align with the original question, retrieved context, and true answer.

## 1. Import libraries and dependencies?

In [1]:
# ! pip install sentence-transformers

In [2]:
import pandas as pd

from datetime import datetime

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity


In [3]:
# Load the sentence transformer model for embeddings
# reference: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
# model = SentenceTransformer("all-MiniLM-L6-v2")
model = SentenceTransformer("all-MiniLM-L12-v2")

In [4]:
# function to calsulate the cosine similarity between two text fields
def compute_cosine_similarity(text1, text2):
    if pd.isna(text1) or pd.isna(text2) or text1.strip() == "" or text2.strip() == "":
        return None  # handle missing values
    emb1 = model.encode(text1).reshape(1, -1)
    emb2 = model.encode(text2).reshape(1, -1)
    return cosine_similarity(emb1, emb2)[0][0]

## 2. Load the RAG results

In [5]:
rag_results_df = pd.read_csv("data/results/trulens_results_2025-03-15_21-55-15.csv")

## 3. Compute Cosine Similarity

In [6]:
# Answer Relevance (RAG_generated_answer vs. true_answer)
rag_results_df["cosine_Answer_Relevance"] = rag_results_df.apply(lambda row: compute_cosine_similarity(row["RAG_generated_answer"], row["true_answer"]), axis=1)

# Groundedness (RAG_generated_answer vs. context)
rag_results_df["cosine_Groundedness"] = rag_results_df.apply(lambda row: compute_cosine_similarity(row["RAG_generated_answer"], row["context"]), axis=1)

# Context Relevance (question vs. context)
rag_results_df["cosine_Context_Relevance"] = rag_results_df.apply(lambda row: compute_cosine_similarity(row["question"], row["context"]), axis=1)

# Answer Accuracy (RAG-generated answer and ground truth)
rag_results_df["similarity_to_ground_truth"] = rag_results_df.apply(lambda row: compute_cosine_similarity(row["RAG_generated_answer"], row["true_answer"]), axis=1)

In [7]:
rag_results_df.head()

Unnamed: 0,question,true_answer,context,RAG_generated_answer,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance,cosine_Answer_Relevance,cosine_Groundedness,cosine_Context_Relevance,similarity_to_ground_truth
0,"Hi, \nI’m following this tutorial: The LSST S...",Quick comment on the code: \n \n \n \n petarz...,Draft\nLVV-P106: Data Management Acceptance Te...,```go\n\nAnswer:\n\nThe behavior you observe i...,0.0,0.0,0.5,0.159147,0.314478,0.602906,0.159147
1,I have the following C++ class : \n class CcdI...,After several iteration with @ktl and @rowe...,"In most cases, the SWIG files from the current...",\nAnswer:\nTo make the CcdImageList iterable i...,0.666667,0.0,0.5,0.545343,0.417874,0.446407,0.545343
2,Question on how forced photometry will be run ...,I take this to mean that a DIASource which is ...,DPDD | LSE-163 | Latest Revision 2023-07-10\n1...,\nAnswer: Forced photometry measurements with ...,1.0,0.555556,0.666667,0.810223,0.661697,0.751487,0.810223
3,"Hi there, \n Is there some way I find out what...",Hi James \nmaybe \n dafButler.Butler.get_known...,3 Overview\nThe Butler is implemented as thr...,\nAnswer:\nThe 'butler' object created in line...,1.0,0.5,0.5,0.446853,0.696487,0.606108,0.446853
4,I’m having trouble building FFTW with texinfo ...,This has now been fixed and 3.3.4 is the curre...,LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use ...,\nAnswer: The known issue with Texinfo and FFT...,,,,0.297717,0.279426,0.356063,0.297717


## 4. Save the results

In [8]:
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")
file_name = f"data/results/cosine_similarity_results_{timestamp}.csv"
rag_results_df.to_csv(file_name, index=False)