# DeepEval Evaluation for RAG Responses

This notebook uses DeepEval for evaluation. The goal is to use the astrophysics-related questions, true answers, and RAG generated answers to evaluate the quality of responses using DeepEval.

## 1. Import libraries and dependencies

In [9]:
# !pip install deepeval
# !pip install ragas

In [10]:
import os
import pandas as pd
from datetime import datetime

from deepeval.test_case import LLMTestCase
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import FaithfulnessMetric
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.metrics import ContextualRecallMetric
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.metrics import BiasMetric
from deepeval.metrics import ToxicityMetric
from deepeval.metrics import HallucinationMetric

from deepeval.metrics.ragas import RagasMetric
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric
from deepeval.metrics.ragas import RAGASFaithfulnessMetric
from deepeval.metrics.ragas import RAGASContextualRecallMetric
from deepeval.metrics.ragas import RAGASContextualPrecisionMetric

## 2. Load the RAG results

In [11]:
# load the dataset that has the question, true answer and RAG generated answer
trulens_cosinesimilarity_results_df = pd.read_csv("data/results/cosine_similarity_results_2025-03-15_22-29-45.csv")
trulens_cosinesimilarity_results_df.head()

Unnamed: 0,question,true_answer,context,RAG_generated_answer,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance,cosine_Answer_Relevance,cosine_Groundedness,cosine_Context_Relevance,similarity_to_ground_truth
0,"Hi, \nI’m following this tutorial: The LSST S...",Quick comment on the code: \n \n \n \n petarz...,Draft\nLVV-P106: Data Management Acceptance Te...,```go\n\nAnswer:\n\nThe behavior you observe i...,0.0,0.0,0.5,0.159146,0.314478,0.602906,0.159146
1,I have the following C++ class : \n class CcdI...,After several iteration with @ktl and @rowe...,"In most cases, the SWIG files from the current...",\nAnswer:\nTo make the CcdImageList iterable i...,0.666667,0.0,0.5,0.545343,0.417874,0.446407,0.545343
2,Question on how forced photometry will be run ...,I take this to mean that a DIASource which is ...,DPDD | LSE-163 | Latest Revision 2023-07-10\n1...,\nAnswer: Forced photometry measurements with ...,1.0,0.555556,0.666667,0.810223,0.661697,0.751487,0.810223
3,"Hi there, \n Is there some way I find out what...",Hi James \nmaybe \n dafButler.Butler.get_known...,3 Overview\nThe Butler is implemented as thr...,\nAnswer:\nThe 'butler' object created in line...,1.0,0.5,0.5,0.446853,0.696487,0.606108,0.446853
4,I’m having trouble building FFTW with texinfo ...,This has now been fixed and 3.3.4 is the curre...,LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use ...,\nAnswer: The known issue with Texinfo and FFT...,,,,0.297717,0.279427,0.356063,0.297717


## 3. Get the DeepEval Metrics

Firstly, define the metrics

In [12]:
threshold07 = 0.7
threshold05 = 0.5
model = "gpt-3.5-turbo"
include_reason=True

In [13]:
# how relevant the RAG's response is compared to the provided question
answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# whether the RAG's response factually aligns with the contents of the context retrieved
faithfulness_metric = FaithfulnessMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

#  checks if the retrieved context is relevant to the question. 
#  it ranks relevant information higher and filters out irrelevant details
contextual_precision_metric = ContextualPrecisionMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# extent to which the retrieval context aligns with the true answer
contextual_recall_metric = ContextualRecallMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# evaluates the overall relevance of the information presented in retrieval context for a given question
contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# determine whether our RAG contains gender, racial, or political bias.
bias_metric = BiasMetric(
    threshold=threshold05,
    model=model,
    include_reason=include_reason
)

# evaluate toxicness in our RAG output. This is particularly useful for a fine-tuning use case.
toxicity_metric = ToxicityMetric(
    threshold=threshold05,
    model=model,
    include_reason=include_reason
)

# whether our RAG generates factually correct information by comparing the RAG's response to the retrieved context
hallucination_metric = HallucinationMetric(
    threshold=threshold05,
    model=model,
    include_reason=include_reason
)

# The RAGAS metric is the average of four distinct metrics:
#   RAGASAnswerRelevancyMetric
#   RAGASFaithfulnessMetric
#   RAGASContextualPrecisionMetric
#   RAGASContextualRecallMetric
# This metric provides a score to holistically evaluate of our RAG pipeline's generator and retriever
RAGAS_metric = RagasMetric(
    threshold=threshold05,
    model=model
)

# RAGAS  - Answer Relevancy Metric - how well the generated answer is semantically relevant to the question
RAGAS_answer_relevancy_metric = RAGASAnswerRelevancyMetric(
    threshold=threshold07,
    model=model
)

# RAGAS  - Faithfulness Metric - if the generated answer is truthful and grounded in the retrieved context
RAGAS_faithfulness_metric = RAGASFaithfulnessMetric(
    threshold=threshold07,
    model=model
)

# RAGAS  - Contextual Precision Metric -  whether the retrieved context contains only relevant information for answering the question
RAGAS_contextual_precision_metric = RAGASContextualPrecisionMetric(
    threshold=threshold07,
    model=model
)

# RAGAS  - Contextual Recall Metric - whether the retrieved context provides enough details to answer the question completely
RAGAS_contextual_recall_metric = RAGASContextualRecallMetric(
    threshold=threshold07,
    model=model
)

metrics=[answer_relevancy_metric, faithfulness_metric, 
         contextual_precision_metric, contextual_recall_metric, contextual_relevancy_metric,
         bias_metric, toxicity_metric, hallucination_metric, 
         RAGAS_answer_relevancy_metric, RAGAS_faithfulness_metric, 
         RAGAS_contextual_precision_metric, RAGAS_contextual_recall_metric, RAGAS_metric]

## 4. Perform evaluation on all rows

In [14]:
test_results = []
for idx, row in trulens_cosinesimilarity_results_df.iterrows():
    test_case = LLMTestCase(
        input=row["question"],
        actual_output=row["RAG_generated_answer"],
        expected_output=row["true_answer"],
        retrieval_context=[row["context"]],
        context=[row["context"]]
    )

    # get the test result
    try:
        results = evaluate(test_cases=[test_case], metrics=metrics)
    except Exception as e:
        print(f"Error processing row {idx}: {e}")

    # iterate through the test results
    for test in results.test_results:
        test_data = {
            "test_case": test.name,
            "success": test.success,
            "question": test.input,
            "RAG_generated_answer": test.actual_output,
            "true_answer": test.expected_output,
            "context": test.retrieval_context
        }

        # extract metrics
        for metric in test.metrics_data:
            test_data[f"{metric.name}_score"] = metric.score
            test_data[f"{metric.name}_reason"] = metric.reason
            test_data[f"{metric.name}_success"] = metric.success

        # Append the structured test result
        test_results.append(test_data)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:23, 23.80s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 0.8333333333333334, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.83 because the statement talks about ensuring results are not affected by processing order, not about why results differ each time., error: None)
  - ✅ Faithfulness (score: 0.875, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.88 because there is a contradiction regarding the guaranteed minimum sky coverage in the actual output., error: None)
  - ✅ Contextual Precision (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 1.00 because the retrieval context directly addresses Petar's issue by providing a solution to understand the varying results., error: None)
  - ✅ Contextual Recall (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 1.00 because all sentences in the expected output directly relate to s




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:17, 17.99s/test case]



Metrics Summary

  - ❌ Answer Relevancy (score: 0.6666666666666666, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.67 because while the actual output addresses the transmission of Swig object to Python, it fails to provide a solution on how to make the CcdImageList iterable in Python, which is a crucial aspect of the input., error: None)
  - ❌ Faithfulness (score: 0.6666666666666666, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.67 because the actual output contains incorrect information about using auto& in the iteration process, which was not mentioned in the retrieval context., error: None)
  - ❌ Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the only retrieval context present did not provide relevant information for modifying the SWIG interface for iteration in Python., error: None)
  - ❌ Contextual Recall (score: 0.




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:23, 23.07s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 0.875, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.88 because the statement about the availability timescale depending on resources is not directly addressing where data with S/N < 5 will be stored, which is relevant to the input, but overall the answer is highly relevant., error: None)
  - ❌ Faithfulness (score: 0.3333333333333333, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.33 because the actual output contains contradictions such as Alerts being issued with precovery photometry information, PPDB data being publicly available, forced photometry data availability being dependent on computational resources, and the system being able to detect sources below the nominal threshold without additional criteria., error: None)
  - ✅ Contextual Precision (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 1.00 beca




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Exception raised in Job[0]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)


Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:45, 45.24s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 0.875, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.88 because the statement is directly addressing the availability of 'dp85' for the 'instrument' parameter, hence it is relevant., error: None)
  - ❌ Faithfulness (score: 0.375, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.38 because the 'actual output' contains several contradictions such as the mention of unmentioned classes like 'dafButler' and strategies like creating multiple butler objects, which are not in line with the 'retrieval context'., error: None)
  - ❌ Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the retrieval context provided does not directly align with the specific method dafButler.Butler.get_known_repos() mentioned in the input., error: None)
  - ❌ Contextual Recall (score: 0.42857142857142855, thres




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating:   0%|          | 0/1 [00:00<?, ?it/s]

Evaluating 1 test case(s) in parallel: |██████████|100% (1/1) [Time Taken: 00:24, 24.95s/test case]



Metrics Summary

  - ✅ Answer Relevancy (score: 0.9, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.90 because the irrelevant statements include information about software installation methods, which do not directly address the issue with texinfo and FFTW mentioned in the input., error: None)
  - ✅ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: Great job! There are no contradictions found in the actual output., error: None)
  - ❌ Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because irrelevant nodes with 'no' verdicts are not providing information related to the fix of FFTW or the update to version 3.3.4., error: None)
  - ❌ Contextual Recall (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the sentence does not correspond to any part of the retrieval conte




In [15]:
# save the results to a dataframe
deepeval_results_df = pd.DataFrame(test_results)

# merge the results with the trulens and cosine similarity results
rag_results_df = trulens_cosinesimilarity_results_df.merge(deepeval_results_df, 
                                                           on="question", 
                                                           how="inner")
rag_results_df.drop(columns=["test_case", "success", 
                             "true_answer_x", "context_x", "RAG_generated_answer_x",
                             "Answer Relevancy (ragas)_reason", "Faithfulness (ragas)_reason",
                             'Contextual Precision (ragas)_reason', 'Contextual Recall (ragas)_reason'], inplace=True)

rag_results_df.rename(columns={"true_answer_y": "true_answer", 
                               "context_y": "context", 
                               "RAG_generated_answer_y": "RAG_generated_answer"}, inplace=True)
rag_results_df.head()

Unnamed: 0,question,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance,cosine_Answer_Relevance,cosine_Groundedness,cosine_Context_Relevance,similarity_to_ground_truth,RAG_generated_answer,true_answer,...,Answer Relevancy (ragas)_success,Faithfulness (ragas)_score,Faithfulness (ragas)_success,Contextual Precision (ragas)_score,Contextual Precision (ragas)_success,Contextual Recall (ragas)_score,Contextual Recall (ragas)_success,RAGAS_score,RAGAS_reason,RAGAS_success
0,"Hi, \nI’m following this tutorial: The LSST S...",0.0,0.0,0.5,0.159146,0.314478,0.602906,0.159146,```go\n\nAnswer:\n\nThe behavior you observe i...,Quick comment on the code: \n \n \n \n petarz...,...,False,0.4,False,0.0,False,0.0,False,0.133333,,False
1,I have the following C++ class : \n class CcdI...,0.666667,0.0,0.5,0.545343,0.417874,0.446407,0.545343,\nAnswer:\nTo make the CcdImageList iterable i...,After several iteration with @ktl and @rowe...,...,True,0.0,False,1.0,True,0.333333,False,0.504217,,True
2,Question on how forced photometry will be run ...,1.0,0.555556,0.666667,0.810223,0.661697,0.751487,0.810223,\nAnswer: Forced photometry measurements with ...,I take this to mean that a DIASource which is ...,...,True,0.0,False,1.0,True,0.833333,True,0.516458,,True
3,"Hi there, \n Is there some way I find out what...",1.0,0.5,0.5,0.446853,0.696487,0.606108,0.446853,\nAnswer:\nThe 'butler' object created in line...,Hi James \nmaybe \n dafButler.Butler.get_known...,...,True,0.25,False,1.0,True,0.0,False,,,False
4,I’m having trouble building FFTW with texinfo ...,,,,0.297717,0.279427,0.356063,0.297717,\nAnswer: The known issue with Texinfo and FFT...,This has now been fixed and 3.3.4 is the curre...,...,True,0.0,False,0.0,False,0.0,False,0.178994,,False


## 5.Save the dataframe for reference

In [16]:
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

In [17]:
answer_relevancy_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                                'trulens_Answer_Relevance', 'cosine_Answer_Relevance', 
                                'Answer Relevancy (ragas)_score', 'Answer Relevancy (ragas)_success',
                                'Answer Relevancy_score', 'Answer Relevancy_reason', 'Answer Relevancy_success']
answer_relevancy_rag_results_df = rag_results_df[answer_relevancy_metric_cols]
answer_relevancy_rag_results_df.rename(columns={"trulens_Answer_Relevance": "TruLens",
                                                "cosine_Answer_Relevance": "Cosine_Similarity",
                                                "Answer Relevancy (ragas)_score": "RAGAS_score",
                                                "Answer Relevancy (ragas)_success": "is_RAGAS_threshold_success",
                                                "Answer Relevancy_score": "DeepEval_score",
                                                "Answer Relevancy_reason": "DeepEval_reason",
                                                "Answer Relevancy_success": "is_DeepEval_threshold_success"}, inplace=True)

filename = f"data/results/full/RAG_results_answer_relevancy_{timestamp}.csv"
answer_relevancy_rag_results_df.to_csv(filename, index=False)
answer_relevancy_rag_results_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answer_relevancy_rag_results_df.rename(columns={"trulens_Answer_Relevance": "TruLens",


Unnamed: 0,question,context,RAG_generated_answer,true_answer,TruLens,Cosine_Similarity,RAGAS_score,is_RAGAS_threshold_success,DeepEval_score,DeepEval_reason,is_DeepEval_threshold_success
0,"Hi, \nI’m following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,```go\n\nAnswer:\n\nThe behavior you observe i...,Quick comment on the code: \n \n \n \n petarz...,0.0,0.159146,0.0,False,0.833333,The score is 0.83 because the statement talks ...,True
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make the CcdImageList iterable i...,After several iteration with @ktl and @rowe...,0.666667,0.545343,0.854417,True,0.666667,The score is 0.67 because while the actual out...,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: Forced photometry measurements with ...,I take this to mean that a DIASource which is ...,1.0,0.810223,0.871607,True,0.875,The score is 0.88 because the statement about ...,True
3,"Hi there, \n Is there some way I find out what...",[3 Overview\nThe Butler is implemented as th...,\nAnswer:\nThe 'butler' object created in line...,Hi James \nmaybe \n dafButler.Butler.get_known...,1.0,0.446853,0.770608,True,0.875,The score is 0.88 because the statement is dir...,True
4,I’m having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue with Texinfo and FFT...,This has now been fixed and 3.3.4 is the curre...,,0.297717,0.894968,True,0.9,The score is 0.90 because the irrelevant state...,True


In [18]:
groundedness_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                           'trulens_Groundedness', 'cosine_Groundedness', 
                           'Faithfulness_score', 'Faithfulness_reason','Faithfulness_success',
                           'Faithfulness (ragas)_score', 'Faithfulness (ragas)_success',]
groundedness_metric_rag_results_df = rag_results_df[groundedness_metric_cols]
groundedness_metric_rag_results_df.rename(columns={"trulens_Groundedness": "TruLens",
                                                   "cosine_Groundedness": "Cosine_Similarity",
                                                   "Faithfulness_score": "DeepEval_score",
                                                   "Faithfulness_reason": "DeepEval_reason",
                                                   "Faithfulness_success": "is_DeepEval_threshold_success",
                                                   "Faithfulness (ragas)_score": "RAGAS_score",
                                                   "Faithfulness (ragas)_success": "is_RAGAS_threshold_success"}, inplace=True)

filename = f"data/results/full/RAG_results_groundedness_{timestamp}.csv"
groundedness_metric_rag_results_df.to_csv(filename, index=False)
groundedness_metric_rag_results_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  groundedness_metric_rag_results_df.rename(columns={"trulens_Groundedness": "TruLens",


Unnamed: 0,question,context,RAG_generated_answer,true_answer,TruLens,Cosine_Similarity,DeepEval_score,DeepEval_reason,is_DeepEval_threshold_success,RAGAS_score,is_RAGAS_threshold_success
0,"Hi, \nI’m following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,```go\n\nAnswer:\n\nThe behavior you observe i...,Quick comment on the code: \n \n \n \n petarz...,0.0,0.314478,0.875,The score is 0.88 because there is a contradic...,True,0.4,False
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make the CcdImageList iterable i...,After several iteration with @ktl and @rowe...,0.0,0.417874,0.666667,The score is 0.67 because the actual output co...,False,0.0,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: Forced photometry measurements with ...,I take this to mean that a DIASource which is ...,0.555556,0.661697,0.333333,The score is 0.33 because the actual output co...,False,0.0,False
3,"Hi there, \n Is there some way I find out what...",[3 Overview\nThe Butler is implemented as th...,\nAnswer:\nThe 'butler' object created in line...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.5,0.696487,0.375,The score is 0.38 because the 'actual output' ...,False,0.25,False
4,I’m having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue with Texinfo and FFT...,This has now been fixed and 3.3.4 is the curre...,,0.279427,1.0,Great job! There are no contradictions found i...,True,0.0,False


In [19]:
contextual_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                           'trulens_Context_Relevance',  'cosine_Context_Relevance',
                           'Contextual Relevancy_score', 'Contextual Relevancy_reason', 'Contextual Relevancy_success',
                           'Contextual Precision (ragas)_score', 'Contextual Precision (ragas)_success',
                           'Contextual Precision_score', 'Contextual Precision_reason', 'Contextual Precision_success',
                           'Contextual Recall_score', 'Contextual Recall_reason', 'Contextual Recall_success', 
                           'Contextual Recall (ragas)_score', 'Contextual Recall (ragas)_success']
contextual_metric_rag_results_df = rag_results_df[contextual_metric_cols]
contextual_metric_rag_results_df.rename(columns={"trulens_Context_Relevance":"TruLens_Context_Relevance",
                                                "cosine_Context_Relevance":"Cosine_Similarity_Context_Relevance"}, inplace=True)

filename = f"data/results/full/RAG_results_contextual_metrics_{timestamp}.csv"
contextual_metric_rag_results_df.to_csv(filename, index=False)
contextual_metric_rag_results_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  contextual_metric_rag_results_df.rename(columns={"trulens_Context_Relevance":"TruLens_Context_Relevance",


Unnamed: 0,question,context,RAG_generated_answer,true_answer,TruLens_Context_Relevance,Cosine_Similarity_Context_Relevance,Contextual Relevancy_score,Contextual Relevancy_reason,Contextual Relevancy_success,Contextual Precision (ragas)_score,Contextual Precision (ragas)_success,Contextual Precision_score,Contextual Precision_reason,Contextual Precision_success,Contextual Recall_score,Contextual Recall_reason,Contextual Recall_success,Contextual Recall (ragas)_score,Contextual Recall (ragas)_success
0,"Hi, \nI’m following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,```go\n\nAnswer:\n\nThe behavior you observe i...,Quick comment on the code: \n \n \n \n petarz...,0.5,0.602906,0.0,The score is 0.00 because the input provided d...,False,0.0,False,1.0,The score is 1.00 because the retrieval contex...,True,1.0,The score is 1.00 because all sentences in the...,True,0.0,False
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make the CcdImageList iterable i...,After several iteration with @ktl and @rowe...,0.5,0.446407,0.333333,The score is 0.33 because the high-level infor...,False,1.0,True,0.0,The score is 0.00 because the only retrieval c...,False,0.2,The score is 0.20 because the supportive reaso...,False,0.333333,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: Forced photometry measurements with ...,I take this to mean that a DIASource which is ...,0.666667,0.751487,0.75,The score is 0.75 because the input addresses ...,True,1.0,True,1.0,The score is 1.00 because all relevant retriev...,True,0.833333,The score is 0.83 because the majority of the ...,True,0.833333,True
3,"Hi there, \n Is there some way I find out what...",[3 Overview\nThe Butler is implemented as th...,\nAnswer:\nThe 'butler' object created in line...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.5,0.606108,0.0,The score is 0.00 because the input is focused...,False,1.0,True,0.0,The score is 0.00 because the retrieval contex...,False,0.428571,The score is 0.43 because the sentence in the ...,False,0.0,False
4,I’m having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue with Texinfo and FFT...,This has now been fixed and 3.3.4 is the curre...,,0.356063,0.333333,The score is 0.33 because while there is relev...,False,0.0,False,0.0,The score is 0.00 because irrelevant nodes wit...,False,0.0,The score is 0.00 because the sentence does no...,False,0.0,False


In [20]:
cosine_similarity_ground_truth_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 'similarity_to_ground_truth']
cosine_similarity_ground_truth_df = rag_results_df[cosine_similarity_ground_truth_cols]
cosine_similarity_ground_truth_df.rename(columns={"'similarity_to_ground_truth'":"cosine_similarity_true_answer_ground_truth'"}, inplace=True)

filename = f"data/results/full/RAG_results_cosine_similarity_ground_truth_{timestamp}.csv"
cosine_similarity_ground_truth_df.to_csv(filename, index=False)
cosine_similarity_ground_truth_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosine_similarity_ground_truth_df.rename(columns={"'similarity_to_ground_truth'":"cosine_similarity_true_answer_ground_truth'"}, inplace=True)


Unnamed: 0,question,context,RAG_generated_answer,true_answer,similarity_to_ground_truth
0,"Hi, \nI’m following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,```go\n\nAnswer:\n\nThe behavior you observe i...,Quick comment on the code: \n \n \n \n petarz...,0.159146
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make the CcdImageList iterable i...,After several iteration with @ktl and @rowe...,0.545343
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: Forced photometry measurements with ...,I take this to mean that a DIASource which is ...,0.810223
3,"Hi there, \n Is there some way I find out what...",[3 Overview\nThe Butler is implemented as th...,\nAnswer:\nThe 'butler' object created in line...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.446853
4,I’m having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue with Texinfo and FFT...,This has now been fixed and 3.3.4 is the curre...,0.297717


In [21]:
other_deepeval_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                            'Bias_score', 'Bias_reason', 'Bias_success', 
                            'Toxicity_score', 'Toxicity_reason', 'Toxicity_success', 
                            'Hallucination_score', 'Hallucination_reason', 'Hallucination_success']
other_deepeval_metric_rag_results_df = rag_results_df[other_deepeval_metric_cols]

filename = f"data/results/full/RAG_results_other_deepeval_metrics_{timestamp}.csv"
other_deepeval_metric_rag_results_df.to_csv(filename, index=False)
other_deepeval_metric_rag_results_df.head()

Unnamed: 0,question,context,RAG_generated_answer,true_answer,Bias_score,Bias_reason,Bias_success,Toxicity_score,Toxicity_reason,Toxicity_success,Hallucination_score,Hallucination_reason,Hallucination_success
0,"Hi, \nI’m following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,```go\n\nAnswer:\n\nThe behavior you observe i...,Quick comment on the code: \n \n \n \n petarz...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because there are no factual...,False
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make the CcdImageList iterable i...,After several iteration with @ktl and @rowe...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because there are contradict...,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: Forced photometry measurements with ...,I take this to mean that a DIASource which is ...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,0.5,The score is 0.50 because the actual output al...,True
3,"Hi there, \n Is there some way I find out what...",[3 Overview\nThe Butler is implemented as th...,\nAnswer:\nThe 'butler' object created in line...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because the actual output do...,False
4,I’m having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue with Texinfo and FFT...,This has now been fixed and 3.3.4 is the curre...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no instanc...,True,1.0,The score is 1.00 because the actual output gr...,False
