# DeepEval Evaluation for RAG Responses

This notebook uses DeepEval for evaluation. The goal is to use the astrophysics-related questions, true answers, and RAG generated answers to evaluate the quality of responses using DeepEval.

## 1. Import libraries and dependencies

In [2]:
# !pip install deepeval
# !pip install ragas

In [3]:
import os
import pandas as pd
from datetime import datetime

from deepeval.test_case import LLMTestCase
from deepeval import evaluate
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.metrics import FaithfulnessMetric
from deepeval.metrics import ContextualPrecisionMetric
from deepeval.metrics import ContextualRecallMetric
from deepeval.metrics import ContextualRelevancyMetric
from deepeval.metrics import BiasMetric
from deepeval.metrics import ToxicityMetric
from deepeval.metrics import HallucinationMetric

from deepeval.metrics.ragas import RagasMetric
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric
from deepeval.metrics.ragas import RAGASFaithfulnessMetric
from deepeval.metrics.ragas import RAGASContextualRecallMetric
from deepeval.metrics.ragas import RAGASContextualPrecisionMetric

## 2. Load the RAG results

In [5]:
# load the dataset that has the question, true answer and RAG generated answer
trulens_cosinesimilarity_results_df = pd.read_csv("data/results/cosine_similarity_results_2025-03-12_22-55-53.csv")
trulens_cosinesimilarity_results_df.head()

Unnamed: 0,app_id,question,true_answer,context,RAG_generated_answer,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance,cosine_Answer_Relevance,cosine_Groundedness,cosine_Context_Relevance,similarity_to_ground_truth
0,app_hash_249cdbf218ae5cad989db0fbaf09493b,"Hi, \nI‚Äôm following this tutorial: The LSST S...",Quick comment on the code: \n \n \n \n petarz...,Draft\nLVV-P106: Data Management Acceptance Te...,\nAnswer: Hi Petar! This behavior is expected ...,1.0,0.121212,0.333333,0.130791,0.266586,0.602906,0.130791
1,app_hash_249cdbf218ae5cad989db0fbaf09493b,I have the following C++ class : \n class CcdI...,After several iteration with @ktl and @rowe...,"In most cases, the SWIG files from the current...",\nAnswer:\nTo make `CcdImageList` iterable in ...,1.0,0.0,0.5,0.593059,0.469574,0.455618,0.593059
2,app_hash_249cdbf218ae5cad989db0fbaf09493b,Question on how forced photometry will be run ...,I take this to mean that a DIASource which is ...,DPDD | LSE-163 | Latest Revision 2023-07-10\n1...,\nAnswer: The system will store DIAForcedSourc...,1.0,0.111111,0.666667,0.772692,0.690596,0.751487,0.772692
3,app_hash_249cdbf218ae5cad989db0fbaf09493b,"Hi there, \n Is there some way I find out what...",Hi James \nmaybe \n dafButler.Butler.get_known...,3¬†¬†¬†Overview\nThe Butler is implemented as thr...,\nAnswer: To list the available values for X w...,1.0,0.0,0.333333,0.559476,0.547306,0.606108,0.559476
4,app_hash_249cdbf218ae5cad989db0fbaf09493b,I‚Äôm having trouble building FFTW with texinfo ...,This has now been fixed and 3.3.4 is the curre...,LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use ...,\nAnswer: The known issue you are experiencing...,,,,0.26236,0.319938,0.356063,0.26236


## 3. Get the DeepEval Metrics

Firstly, define the metrics

In [6]:
threshold07 = 0.7
threshold05 = 0.5
model = "gpt-3.5-turbo"
include_reason=True

In [7]:
# how relevant the RAG's response is compared to the provided question
answer_relevancy_metric = AnswerRelevancyMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# whether the RAG's response factually aligns with the contents of the context retrieved
faithfulness_metric = FaithfulnessMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

#  checks if the retrieved context is relevant to the question. 
#  it ranks relevant information higher and filters out irrelevant details
contextual_precision_metric = ContextualPrecisionMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# extent to which the retrieval context aligns with the true answer
contextual_recall_metric = ContextualRecallMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# evaluates the overall relevance of the information presented in retrieval context for a given question
contextual_relevancy_metric = ContextualRelevancyMetric(
    threshold=threshold07,
    model=model,
    include_reason=include_reason
)

# determine whether our RAG contains gender, racial, or political bias.
bias_metric = BiasMetric(
    threshold=threshold05,
    model=model,
    include_reason=include_reason
)

# evaluate toxicness in our RAG output. This is particularly useful for a fine-tuning use case.
toxicity_metric = ToxicityMetric(
    threshold=threshold05,
    model=model,
    include_reason=include_reason
)

# whether our RAG generates factually correct information by comparing the RAG's response to the retrieved context
hallucination_metric = HallucinationMetric(
    threshold=threshold05,
    model=model,
    include_reason=include_reason
)

# The RAGAS metric is the average of four distinct metrics:
#   RAGASAnswerRelevancyMetric
#   RAGASFaithfulnessMetric
#   RAGASContextualPrecisionMetric
#   RAGASContextualRecallMetric
# This metric provides a score to holistically evaluate of our RAG pipeline's generator and retriever
RAGAS_metric = RagasMetric(
    threshold=threshold05,
    model=model
)

# RAGAS  - Answer Relevancy Metric - how well the generated answer is semantically relevant to the question
RAGAS_answer_relevancy_metric = RAGASAnswerRelevancyMetric(
    threshold=threshold07,
    model=model
)

# RAGAS  - Faithfulness Metric - if the generated answer is truthful and grounded in the retrieved context
RAGAS_faithfulness_metric = RAGASFaithfulnessMetric(
    threshold=threshold07,
    model=model
)

# RAGAS  - Contextual Precision Metric -  whether the retrieved context contains only relevant information for answering the question
RAGAS_contextual_precision_metric = RAGASContextualPrecisionMetric(
    threshold=threshold07,
    model=model
)

# RAGAS  - Contextual Recall Metric - whether the retrieved context provides enough details to answer the question completely
RAGAS_contextual_recall_metric = RAGASContextualRecallMetric(
    threshold=threshold07,
    model=model
)

metrics=[answer_relevancy_metric, faithfulness_metric, 
         contextual_precision_metric, contextual_recall_metric, contextual_relevancy_metric,
         bias_metric, toxicity_metric, hallucination_metric, 
         RAGAS_answer_relevancy_metric, RAGAS_faithfulness_metric, 
         RAGAS_contextual_precision_metric, RAGAS_contextual_recall_metric, RAGAS_metric]

## 4. Perform evaluation on all rows

In [8]:
test_results = []
for idx, row in trulens_cosinesimilarity_results_df.iterrows():
    test_case = LLMTestCase(
        input=row["question"],
        actual_output=row["RAG_generated_answer"],
        expected_output=row["true_answer"],
        retrieval_context=[row["context"]],
        context=[row["context"]]
    )

    # get the test result
    try:
        results = evaluate(test_cases=[test_case], metrics=metrics)
    except Exception as e:
        print(f"Error processing row {idx}: {e}")

    # iterate through the test results
    for test in results.test_results:
        test_data = {
            "test_case": test.name,
            "success": test.success,
            "question": test.input,
            "RAG_generated_answer": test.actual_output,
            "true_answer": test.expected_output,
            "context": test.retrieval_context
        }

        # extract metrics
        for metric in test.metrics_data:
            test_data[f"{metric.name}_score"] = metric.score
            test_data[f"{metric.name}_reason"] = metric.reason
            test_data[f"{metric.name}_success"] = metric.success

        # Append the structured test result
        test_results.append(test_data)

Event loop is already running. Applying nest_asyncio patch to allow async execution...


  from .autonotebook import tqdm as notebook_tqdm

[A

[A[A


[A[A[A



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.41s/it]




Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.42s/it]


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.43s/it]



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.58s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.71s/it]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:08<00:00,  8.67s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.91s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.63s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:07<00:00,  7.82s/it]
Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:23, 23.23s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: Great job on providing a very relevant and concise response! The score is 1.00 because the output directly addresses the question asked in the input., error: None)
  - ‚úÖ Faithfulness (score: 0.875, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.88 because the contradictions point out that the claim about the task modifying the original data is inaccurate., error: None)
  - ‚ùå Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the only retrieval context provided is irrelevant to addressing the behavior of the code in question., error: None)
  - ‚ùå Contextual Recall (score: 0.3333333333333333, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.33 because the expected output includes instructions that al




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]
[A

[A[A


[A[A[A



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.95s/it]


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.95s/it]




Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.95s/it]



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:02<00:00,  2.91s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.12s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.82s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.78s/it]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:14<00:00, 14.03s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:11<00:00, 11.08s/it]
Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:25, 25.64s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 0.8571428571428571, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.86 because although the response provided relevant information on making 'CcdImageList' iterable, there was an irrelevant comment about adding a custom iterator to 'swig_main.i'., error: None)
  - ‚ùå Faithfulness (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the actual output contains contradictory information not mentioned in the retrieval context., error: None)
  - ‚ùå Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the one retrieval context present is a 'no' verdict for a node, and it ranks the information as irrelevant to the input., error: None)
  - ‚ùå Contextual Recall (score: 0.2, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.20 becau




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]
[A

[A[A


[A[A[A



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.62s/it]




Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.62s/it]


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.62s/it]



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.23s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:04<00:00,  4.65s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.85s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.61s/it]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:13<00:00, 13.71s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:05<00:00,  5.95s/it]
Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:21, 21.49s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 1.00 because the response addresses the specific questions asked in the input with relevant information., error: None)
  - ‚úÖ Faithfulness (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: Great job! The actual output perfectly aligns with the retrieval context with no contradictions., error: None)
  - ‚ùå Contextual Precision (score: 0.5, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.50 because the 'no' verdicts (2nd, 3rd, 4th nodes) are not directly relevant to the specific question asked about alert issuance for DIASources with S/N<5, while the 'yes' verdict (1st node) addresses the prompt availability of forced photometry in PPDB., error: None)
  - ‚ùå Contextual Recall (score: 0.6666666666666666, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]
[A

[A[A


[A[A[A



[A[A[A[A

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:00<00:00,  1.06it/s]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.98s/it]




Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.98s/it]



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.98s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.44s/it]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:09<00:00,  9.21s/it]
Exception raised in Job[0]: LLMDidNotFinishException(The LLM generation was not completed. Please increase the max_tokens and try again.)
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:35<00:00, 35.29s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.42s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:17<00:00, 17.47s/it]
Evaluating 1 te



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 1.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 1.00 because the output directly addresses the question asked with relevant information., error: None)
  - ‚ùå Faithfulness (score: 0.625, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.62 because the actual output includes information that is not supported by the retrieval context, such as mentioning LSST DAF package, claiming Zurich is a city in London, and providing a code example without relevant details., error: None)
  - ‚ùå Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because the single 'no' verdict indicates that the context provided does not directly address the inquiry in a manner that would demonstrate contextual precision., error: None)
  - ‚ùå Contextual Recall (score: 0.2857142857142857, threshold: 0.7, strict: Fals




Event loop is already running. Applying nest_asyncio patch to allow async execution...


Evaluating 1 test case(s) in parallel: |          |  0% (0/1) [Time Taken: 00:00, ?test case/s]
[A

[A[A


[A[A[A



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.74s/it]


Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.74s/it]



Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.74s/it]




Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.74s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.45s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:03<00:00,  3.66s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:01<00:00,  1.22s/it]

Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:13<00:00, 13.32s/it]
Evaluating: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 1/1 [00:10<00:00, 10.78s/it]
Evaluating 1 test case(s) in parallel: |‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà|100% (1/1) [Time Taken: 00:22, 22.58s/test case]



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 0.9090909090909091, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.91 because the output addresses the main issue of building FFTW with texinfo installed, but contains some irrelevant statements about a bug in texinfo-5., error: None)
  - ‚ùå Faithfulness (score: 0.2, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.20 because the actual output contradicts the fact that texinfo-6 is the latest version and that FFTW 3.3.4 includes fixes for texinfo, as well as the information about the 'info' directory containing all necessary .texi files for building FFTW manuals, as per the context., error: None)
  - ‚ùå Contextual Precision (score: 0.0, threshold: 0.7, strict: False, evaluation model: gpt-3.5-turbo, reason: The score is 0.00 because all the retrieved contexts were deemed irrelevant to the issue at hand., error: None)
  - ‚ùå Contextual Recall (score: 0.0, 




In [9]:
# save the results to a dataframe
deepeval_results_df = pd.DataFrame(test_results)

# merge the results with the trulens and cosine similarity results
rag_results_df = trulens_cosinesimilarity_results_df.merge(deepeval_results_df, 
                                                           on="question", 
                                                           how="inner")
rag_results_df.drop(columns=["app_id", "test_case", "success", 
                             "true_answer_x", "context_x", "RAG_generated_answer_x",
                             "Answer Relevancy (ragas)_reason", "Faithfulness (ragas)_reason",
                             'Contextual Precision (ragas)_reason', 'Contextual Recall (ragas)_reason'], inplace=True)

rag_results_df.rename(columns={"true_answer_y": "true_answer", 
                               "context_y": "context", 
                               "RAG_generated_answer_y": "RAG_generated_answer"}, inplace=True)
rag_results_df.head()

Unnamed: 0,question,trulens_Answer_Relevance,trulens_Groundedness,trulens_Context_Relevance,cosine_Answer_Relevance,cosine_Groundedness,cosine_Context_Relevance,similarity_to_ground_truth,RAG_generated_answer,true_answer,...,Answer Relevancy (ragas)_success,Faithfulness (ragas)_score,Faithfulness (ragas)_success,Contextual Precision (ragas)_score,Contextual Precision (ragas)_success,Contextual Recall (ragas)_score,Contextual Recall (ragas)_success,RAGAS_score,RAGAS_reason,RAGAS_success
0,"Hi, \nI‚Äôm following this tutorial: The LSST S...",1.0,0.121212,0.333333,0.130791,0.266586,0.602906,0.130791,\nAnswer: Hi Petar! This behavior is expected ...,Quick comment on the code: \n \n \n \n petarz...,...,True,0.222222,False,0.0,False,0.0,False,0.268265,,False
1,I have the following C++ class : \n class CcdI...,1.0,0.0,0.5,0.593059,0.469574,0.455618,0.593059,\nAnswer:\nTo make `CcdImageList` iterable in ...,After several iteration with @ktl and @rowe...,...,True,0.666667,False,0.0,False,0.0,False,0.174075,,False
2,Question on how forced photometry will be run ...,1.0,0.111111,0.666667,0.772692,0.690596,0.751487,0.772692,\nAnswer: The system will store DIAForcedSourc...,I take this to mean that a DIASource which is ...,...,True,0.0,False,1.0,True,0.5,False,0.496016,,False
3,"Hi there, \n Is there some way I find out what...",1.0,0.0,0.333333,0.559476,0.547306,0.606108,0.559476,\nAnswer: To list the available values for X w...,Hi James \nmaybe \n dafButler.Butler.get_known...,...,True,0.111111,False,0.0,False,0.0,False,,,False
4,I‚Äôm having trouble building FFTW with texinfo ...,,,,0.26236,0.319938,0.356063,0.26236,\nAnswer: The known issue you are experiencing...,This has now been fixed and 3.3.4 is the curre...,...,True,0.785714,True,0.0,False,0.0,False,0.329934,,False


## 5.Save the dataframe for reference

In [10]:
timestamp = datetime.now().strftime("%Y-%m-%d_%H-%M-%S")

In [11]:
answer_relevancy_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                                'trulens_Answer_Relevance', 'cosine_Answer_Relevance', 
                                'Answer Relevancy (ragas)_score', 'Answer Relevancy (ragas)_success',
                                'Answer Relevancy_score', 'Answer Relevancy_reason', 'Answer Relevancy_success']
answer_relevancy_rag_results_df = rag_results_df[answer_relevancy_metric_cols]
answer_relevancy_rag_results_df.rename(columns={"trulens_Answer_Relevance": "TruLens",
                                                "cosine_Answer_Relevance": "Cosine_Similarity",
                                                "Answer Relevancy (ragas)_score": "RAGAS_score",
                                                "Answer Relevancy (ragas)_success": "is_RAGAS_threshold_success",
                                                "Answer Relevancy_score": "DeepEval_score",
                                                "Answer Relevancy_reason": "DeepEval_reason",
                                                "Answer Relevancy_success": "is_DeepEval_threshold_success"}, inplace=True)

filename = f"data/results/full/RAG_results_answer_relevancy_{timestamp}.csv"
answer_relevancy_rag_results_df.to_csv(filename, index=False)
answer_relevancy_rag_results_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  answer_relevancy_rag_results_df.rename(columns={"trulens_Answer_Relevance": "TruLens",


Unnamed: 0,question,context,RAG_generated_answer,true_answer,TruLens,Cosine_Similarity,RAGAS_score,is_RAGAS_threshold_success,DeepEval_score,DeepEval_reason,is_DeepEval_threshold_success
0,"Hi, \nI‚Äôm following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,\nAnswer: Hi Petar! This behavior is expected ...,Quick comment on the code: \n \n \n \n petarz...,1.0,0.130791,0.754025,True,1.0,Great job on providing a very relevant and con...,True
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make `CcdImageList` iterable in ...,After several iteration with @ktl and @rowe...,1.0,0.593059,0.870377,True,0.857143,The score is 0.86 because although the respons...,True
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: The system will store DIAForcedSourc...,I take this to mean that a DIASource which is ...,1.0,0.772692,0.769487,True,1.0,The score is 1.00 because the response address...,True
3,"Hi there, \n Is there some way I find out what...",[3¬†¬†¬†Overview\nThe Butler is implemented as th...,\nAnswer: To list the available values for X w...,Hi James \nmaybe \n dafButler.Butler.get_known...,1.0,0.559476,0.847545,True,1.0,The score is 1.00 because the output directly ...,True
4,I‚Äôm having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue you are experiencing...,This has now been fixed and 3.3.4 is the curre...,,0.26236,0.834894,True,0.909091,The score is 0.91 because the output addresses...,True


In [12]:
groundedness_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                           'trulens_Groundedness', 'cosine_Groundedness', 
                           'Faithfulness_score', 'Faithfulness_reason','Faithfulness_success',
                           'Faithfulness (ragas)_score', 'Faithfulness (ragas)_success',]
groundedness_metric_rag_results_df = rag_results_df[groundedness_metric_cols]
groundedness_metric_rag_results_df.rename(columns={"trulens_Groundedness": "TruLens",
                                                   "cosine_Groundedness": "Cosine_Similarity",
                                                   "Faithfulness_score": "DeepEval_score",
                                                   "Faithfulness_reason": "DeepEval_reason",
                                                   "Faithfulness_success": "is_DeepEval_threshold_success",
                                                   "Faithfulness (ragas)_score": "RAGAS_score",
                                                   "Faithfulness (ragas)_success": "is_RAGAS_threshold_success"}, inplace=True)

filename = f"data/results/full/RAG_results_groundedness_{timestamp}.csv"
groundedness_metric_rag_results_df.to_csv(filename, index=False)
groundedness_metric_rag_results_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  groundedness_metric_rag_results_df.rename(columns={"trulens_Groundedness": "TruLens",


Unnamed: 0,question,context,RAG_generated_answer,true_answer,TruLens,Cosine_Similarity,DeepEval_score,DeepEval_reason,is_DeepEval_threshold_success,RAGAS_score,is_RAGAS_threshold_success
0,"Hi, \nI‚Äôm following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,\nAnswer: Hi Petar! This behavior is expected ...,Quick comment on the code: \n \n \n \n petarz...,0.121212,0.266586,0.875,The score is 0.88 because the contradictions p...,True,0.222222,False
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make `CcdImageList` iterable in ...,After several iteration with @ktl and @rowe...,0.0,0.469574,0.0,The score is 0.00 because the actual output co...,False,0.666667,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: The system will store DIAForcedSourc...,I take this to mean that a DIASource which is ...,0.111111,0.690596,1.0,Great job! The actual output perfectly aligns ...,True,0.0,False
3,"Hi there, \n Is there some way I find out what...",[3¬†¬†¬†Overview\nThe Butler is implemented as th...,\nAnswer: To list the available values for X w...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.0,0.547306,0.625,The score is 0.62 because the actual output in...,False,0.111111,False
4,I‚Äôm having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue you are experiencing...,This has now been fixed and 3.3.4 is the curre...,,0.319938,0.2,The score is 0.20 because the actual output co...,False,0.785714,True


In [13]:
contextual_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                           'trulens_Context_Relevance',  'cosine_Context_Relevance',
                           'Contextual Relevancy_score', 'Contextual Relevancy_reason', 'Contextual Relevancy_success',
                           'Contextual Precision (ragas)_score', 'Contextual Precision (ragas)_success',
                           'Contextual Precision_score', 'Contextual Precision_reason', 'Contextual Precision_success',
                           'Contextual Recall_score', 'Contextual Recall_reason', 'Contextual Recall_success', 
                           'Contextual Recall (ragas)_score', 'Contextual Recall (ragas)_success']
contextual_metric_rag_results_df = rag_results_df[contextual_metric_cols]
contextual_metric_rag_results_df.rename(columns={"trulens_Context_Relevance":"TruLens_Context_Relevance",
                                                "cosine_Context_Relevance":"Cosine_Similarity_Context_Relevance"}, inplace=True)

filename = f"data/results/full/RAG_results_contextual_metrics_{timestamp}.csv"
contextual_metric_rag_results_df.to_csv(filename, index=False)
contextual_metric_rag_results_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  contextual_metric_rag_results_df.rename(columns={"trulens_Context_Relevance":"TruLens_Context_Relevance",


Unnamed: 0,question,context,RAG_generated_answer,true_answer,TruLens_Context_Relevance,Cosine_Similarity_Context_Relevance,Contextual Relevancy_score,Contextual Relevancy_reason,Contextual Relevancy_success,Contextual Precision (ragas)_score,Contextual Precision (ragas)_success,Contextual Precision_score,Contextual Precision_reason,Contextual Precision_success,Contextual Recall_score,Contextual Recall_reason,Contextual Recall_success,Contextual Recall (ragas)_score,Contextual Recall (ragas)_success
0,"Hi, \nI‚Äôm following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,\nAnswer: Hi Petar! This behavior is expected ...,Quick comment on the code: \n \n \n \n petarz...,0.333333,0.602906,0.0,The score is 0.00 because there are no relevan...,False,0.0,False,0.0,The score is 0.00 because the only retrieval c...,False,0.333333,The score is 0.33 because the expected output ...,False,0.0,False
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make `CcdImageList` iterable in ...,After several iteration with @ktl and @rowe...,0.5,0.455618,0.166667,The score is 0.17 because the statements provi...,False,0.0,False,0.0,The score is 0.00 because the one retrieval co...,False,0.2,The score is 0.20 because although the sentenc...,False,0.0,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: The system will store DIAForcedSourc...,I take this to mean that a DIASource which is ...,0.666667,0.751487,0.75,The score is 0.75 because the statement in the...,True,1.0,True,0.5,The score is 0.50 because the 'no' verdicts (2...,False,0.666667,The score is 0.67 because the sentences in the...,False,0.5,False
3,"Hi there, \n Is there some way I find out what...",[3¬†¬†¬†Overview\nThe Butler is implemented as th...,\nAnswer: To list the available values for X w...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.333333,0.606108,0.0,The score is 0.00 because the input is not dir...,False,0.0,False,0.0,The score is 0.00 because the single 'no' verd...,False,0.285714,The score is 0.29 because the sentence matches...,False,0.0,False
4,I‚Äôm having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue you are experiencing...,This has now been fixed and 3.3.4 is the curre...,,0.356063,0.0,The score is 0.00 because the context provided...,False,0.0,False,0.0,The score is 0.00 because all the retrieved co...,False,0.0,The score is 0.00 because the sentence cannot ...,False,0.0,False


In [14]:
cosine_similarity_ground_truth_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 'similarity_to_ground_truth']
cosine_similarity_ground_truth_df = rag_results_df[cosine_similarity_ground_truth_cols]
cosine_similarity_ground_truth_df.rename(columns={"'similarity_to_ground_truth'":"cosine_similarity_true_answer_ground_truth'"}, inplace=True)

filename = f"data/results/full/RAG_results_cosine_similarity_ground_truth_{timestamp}.csv"
cosine_similarity_ground_truth_df.to_csv(filename, index=False)
cosine_similarity_ground_truth_df.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cosine_similarity_ground_truth_df.rename(columns={"'similarity_to_ground_truth'":"cosine_similarity_true_answer_ground_truth'"}, inplace=True)


Unnamed: 0,question,context,RAG_generated_answer,true_answer,similarity_to_ground_truth
0,"Hi, \nI‚Äôm following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,\nAnswer: Hi Petar! This behavior is expected ...,Quick comment on the code: \n \n \n \n petarz...,0.130791
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make `CcdImageList` iterable in ...,After several iteration with @ktl and @rowe...,0.593059
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: The system will store DIAForcedSourc...,I take this to mean that a DIASource which is ...,0.772692
3,"Hi there, \n Is there some way I find out what...",[3¬†¬†¬†Overview\nThe Butler is implemented as th...,\nAnswer: To list the available values for X w...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.559476
4,I‚Äôm having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue you are experiencing...,This has now been fixed and 3.3.4 is the curre...,0.26236


In [15]:
other_deepeval_metric_cols = ['question', 'context', 'RAG_generated_answer', 'true_answer', 
                            'Bias_score', 'Bias_reason', 'Bias_success', 
                            'Toxicity_score', 'Toxicity_reason', 'Toxicity_success', 
                            'Hallucination_score', 'Hallucination_reason', 'Hallucination_success']
other_deepeval_metric_rag_results_df = rag_results_df[other_deepeval_metric_cols]

filename = f"data/results/full/RAG_results_other_deepeval_metrics_{timestamp}.csv"
other_deepeval_metric_rag_results_df.to_csv(filename, index=False)
other_deepeval_metric_rag_results_df.head()

Unnamed: 0,question,context,RAG_generated_answer,true_answer,Bias_score,Bias_reason,Bias_success,Toxicity_score,Toxicity_reason,Toxicity_success,Hallucination_score,Hallucination_reason,Hallucination_success
0,"Hi, \nI‚Äôm following this tutorial: The LSST S...",[Draft\nLVV-P106: Data Management Acceptance T...,\nAnswer: Hi Petar! This behavior is expected ...,Quick comment on the code: \n \n \n \n petarz...,0.0,The score is 0.00 because there are no identif...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because the actual output do...,False
1,I have the following C++ class : \n class CcdI...,"[In most cases, the SWIG files from the curren...",\nAnswer:\nTo make `CcdImageList` iterable in ...,After several iteration with @ktl and @rowe...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because there are contradict...,False
2,Question on how forced photometry will be run ...,[DPDD | LSE-163 | Latest Revision 2023-07-10\n...,\nAnswer: The system will store DIAForcedSourc...,I take this to mean that a DIASource which is ...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,0.5,The score is 0.50 because the actual output al...,True
3,"Hi there, \n Is there some way I find out what...",[3¬†¬†¬†Overview\nThe Butler is implemented as th...,\nAnswer: To list the available values for X w...,Hi James \nmaybe \n dafButler.Butler.get_known...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because the actual output pr...,False
4,I‚Äôm having trouble building FFTW with texinfo ...,[LARGE SYNOPTIC SURVEY TELESCOPE\nNotes on use...,\nAnswer: The known issue you are experiencing...,This has now been fixed and 3.3.4 is the curre...,0.0,The score is 0.00 because there are no reasons...,True,0.0,The score is 0.00 because there are no reasons...,True,1.0,The score is 1.00 because the actual output is...,False
