Evaluation Prompt Templates

In [49]:
context_relevance_prompt="""
You are an expert evaluator for a Retrieval Augmented Generation system. Your task is to assess the relevance of the provided context for answering a given question.

**Question**:
{question}

Retrieved Context:
---
{doc_1_content}
---

Instructions:
Consider document in the retrieved context. Evaluate how relevant each document is to the user's **Question**.
Then, provide an overall score for the entire set of retrieved context. The score should be a single floating-point number between 0.0 and 1.0, where 0.0 means the context is completely irrelevant or unhelpful, and 1.0 means the context is perfectly relevant and provides all necessary information to answer the question.

Briefly explain your reasoning for the score, noting why document is most/least relevant.

Please provide a response in a structured JSON format that matches the following format:
{{
  "reasoning": <Your brief explanation>,
  "score": <Your Score as a float, e.g., 0.85>
}}
"""

In [51]:
faithfulness_prompt="""
You are an expert evaluator for a Retrieval Augmented Generation system. Your task is to assess the faithfulness of a generated answer to its provided context. An answer is faithful if all claims made in the answer are supported by the information present in the retrieved context. The answer should not make up information or contradict the context.

Retrieved Context:
---
{doc_1_content}
---

Question (for reference):
{question}

Generated Answer:
{generated_answer}

Carefully compare the **Generated Answer** with the **Retrieved Context**. Determine if all factual statements in the answer can be verified from the context.
Provide a score between 0.0 and 1.0, where:
* 0.0 means the answer is completely unfaithful (e.g., contradicts the context or is entirely based on information outside the context).
* 1.0 means the answer is perfectly faithful and all its claims are fully supported by the context.

Identify any specific claims in the answer that are not supported by the context or contradict it. If the answer is fully faithful, state that.

Please provide a response in a structured JSON format that matches the following format:
{{
  "reasoning": <Your brief explanation>,
  "score": <Your Score as a float, e.g., 0.85>
}}
"""

In [53]:
answer_relevance_to_question_prompt="""
You are an expert evaluator for a Retrieval Augmented Generation system. Your task is to assess how relevant and complete a generated answer is with respect to the user's question.

**Question:**
{question}

**Generated Answer:**
{generated_answer}

Evaluate the **Generated Answer** based on how well it addresses the **Question**. Consider the following:
* **Directness:** Does the answer directly address the main intent of the question?
* **Completeness:** Does the answer provide a reasonably complete response to the question, or does it miss key aspects?
* **Focus:** Is the answer focused on the question, or does it include irrelevant information?

Provide a score between 0.0 and 1.0, where:
* 0.0 means the answer is completely irrelevant to the question or fails to address it at all.
* 1.0 means the answer is perfectly relevant, directly addresses the question, and is comprehensive.

Explain why the answer is or isn't relevant, noting any aspects of the question that were well-addressed or missed.

Please provide a response in a structured JSON format that matches the following format:

{{
  "reasoning": <Your brief explanation>,
  "score": <Your Score as a float, e.g., 0.85>
}}

"""

In [55]:
answer_correctness_score_prompt="""
You are an expert evaluator for a Retrieval Augmented Generation system. Your task is to assess the factual correctness and completeness of a **Generated Answer** by comparing it against a **Ground Truth Answer**.

**Question:**
{question}

**Generated Answer:**
{generated_answer}

**Ground Truth Answer**
{ground_truth_answer}

**Instructions:**
Carefully compare the **Generated Answer** with the **Ground Truth Answer**.
Evaluate the **Generated Answer** based on the following criteria:
* **Factual Accuracy:** Are all facts presented in the **Generated Answer** accurate when compared to the **Ground Truth Answer**? Does it introduce any inaccuracies or contradictions?
* **Completeness:** Does the **Generated Answer** cover all the key information present in the **Ground Truth Answer** relevant to the question? Does it omit any critical details?
* **Conciseness (Optional, if important):** Does the **Generated Answer** provide the information without unnecessary verbosity compared to the ground truth? (Consider if this is a primary evaluation goal).

Provide an overall score for correctness between 0.0 and 1.0, where:
* 0.0 means the **Generated Answer** is completely incorrect, factually inaccurate, or entirely misses the information present in the **Ground Truth Answer**.
* 0.5 means the **Generated Answer** has some correct elements but also contains significant inaccuracies or omissions when compared to the **Ground Truth Answer**.
* 1.0 means the **Generated Answer** is perfectly correct, factually accurate, and complete with respect to the **Ground Truth Answer**.

**Reasoning (Optional but Recommended):**
Explain your score by highlighting any factual inaccuracies, omissions, or (if evaluating conciseness) unnecessary information in the **Generated Answer** when compared to the **Ground Truth Answer**.

Please provide a response in a structured JSON format that matches the following format:
{{
  "reasoning": <Your brief explanation>,
  "score": <Your Score as a float, e.g., 0.5>
}}

"""

In [33]:
import pandas as pd
ansvals=pd.read_parquet("datafiles/output_files/rag_generated_answers.parquet")

In [35]:
ansvals.head()

Unnamed: 0,question,answer,source_category,retrieved_context,rag_answer
0,What percentage of 16 to 20-year-olds in the U...,81%,politics,UK youth 'interested' in politics\n\nThe major...,81%
1,When did Portishead win the Mercury Music Prize?,1995,entertainment,"""We've just had our heads down really, we've n...",The provided text does not state when Portishe...
2,What animals are covered by Texas hunting laws...,"State laws on hunting only covered ""regulated ...",tech,". ""Animals hit but not killed would without do...",Texas hunting laws cover regulated animals suc...
3,What is the slogan for the Conservative Party'...,"""It's not racist to impose limits on immigration""",politics,The Tories have promised an upper limit on the...,"""It's not racist to impose limits on immigration"""
4,Who is the chairman of Wada?,Dick Pound,sport,Wada will appeal against ruling\n\nThe World A...,The provided text does not name the chairman o...


In [45]:
import mlflow
import datetime

In [47]:
retriever_model_name = "gemini_text-embedding-004"
generator_model_name = "gemini-1.5-flash"
retriever_top_k_config = "1"

In [None]:
evaluation_dataset=ansvals[:2]

Defining Metrics call functions

In [105]:
from langchain_google_genai import ChatGoogleGenerativeAI
import os
from pydantic import BaseModel

os.environ["GOOGLE_API_KEY"] = "<gemini api key>"
judge_llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash")

class JudgeResponse(BaseModel):
    reasoning: str
    score: str

structured_llm = judge_llm.with_structured_output(JudgeResponse)

def call_llm(prompt):
    response = structured_llm.invoke(prompt)
    return response
    
def calculate_context_relevance(question,context):
    promptval=context_relevance_prompt.format(question=question,doc_1_content=context)
    response=call_llm(promptval)
    return response

def calculate_faithfulness(question,context,generated_answer):
    promptval=faithfulness_prompt.format(question=question,doc_1_content=context,generated_answer=generated_answer)
    response=call_llm(promptval)
    return response

def calculate_answer_relevance_to_question(question,generated_answer):
    promptval=answer_relevance_to_question_prompt.format(question=question,generated_answer=generated_answer)
    response=call_llm(promptval)
    return response

def calculate_answer_correctness_score(question,generated_answer,ground_truth_answer):
    promptval=answer_correctness_score_prompt.format(question=question,generated_answer=generated_answer,ground_truth_answer=ground_truth_answer)
    response=call_llm(promptval)
    return response

In [81]:
ansvals.iloc[0]

question             What percentage of 16 to 20-year-olds in the U...
answer                                                             81%
source_category                                               politics
retrieved_context    UK youth 'interested' in politics\n\nThe major...
rag_answer                                                         81%
Name: 0, dtype: object

In [69]:
calculate_context_relevance(ansvals.iloc[0]["question"],ansvals.iloc[0]["retrieved_context"])

JudgeResponse(reasoning='The retrieved context directly answers the question. It states that 81% of 16 to 20-year-olds in the UK feel strongly about issues like crime and education. Therefore, the context is highly relevant.', score='1.0')

In [73]:
calculate_faithfulness(ansvals.iloc[0]["question"],ansvals.iloc[0]["retrieved_context"],ansvals.iloc[0]["rag_answer"])

JudgeResponse(reasoning='The answer states that 81% of 16 to 20-year-olds in the UK feel strongly about issues like crime and education. The context states that research suggests 81% of 16 to 20-year-olds feel strongly about issues like crime and education. Therefore, the answer is faithful to the context.', score='1.0')

In [77]:
calculate_answer_relevance_to_question(ansvals.iloc[0]["question"],ansvals.iloc[0]["rag_answer"])

JudgeResponse(reasoning="The answer provides a percentage that directly answers the question about the proportion of 16 to 20-year-olds in the UK who feel strongly about issues like crime and education. It's concise and focused.", score='1.0')

In [85]:
calculate_answer_correctness_score(ansvals.iloc[0]["question"],ansvals.iloc[0]["rag_answer"],ansvals.iloc[0]["answer"])

JudgeResponse(reasoning='The generated answer is factually accurate and complete when compared to the ground truth answer. It provides the correct percentage.', score='1.0')

In [129]:
evaluation_dataset=ansvals

MLFlow Evaluation Run

In [131]:
@mlflow.trace(name="RAG.evaluate_question_pipeline")
def get_metrics(question,retrieved_context,generated_answer,ground_truth_answer):
    context_relevance = calculate_context_relevance(question, retrieved_context)
    faithfulness = calculate_faithfulness(question, retrieved_context, generated_answer)
    answer_relevance_q = calculate_answer_relevance_to_question(question,generated_answer)
    correctness = calculate_answer_correctness_score(question, generated_answer, ground_truth_answer)
    return {"context_relevance":context_relevance,
            "faithfulness":faithfulness,
            "answer_relevance_q":answer_relevance_q,
            "correctness":correctness}

In [133]:
mlflow.set_tracking_uri("http://localhost:5000/")
with mlflow.start_run(run_name=f"RAG_Eval_{datetime.datetime.now().strftime('%Y%m%d_%H%M%S')}") as run:
        print(f"MLflow Run ID: {run.info.run_id}")
        print(f"MLflow Experiment ID: {run.info.experiment_id}")

        # Log parameters
        mlflow.log_param("retriever_model", retriever_model_name)
        mlflow.log_param("generator_model", generator_model_name)
        mlflow.log_param("retriever_top_k", retriever_top_k_config)
        mlflow.log_param("evaluation_dataset_size", len(evaluation_dataset))
        context_relevance_score=0
        faithfulness_score=0
        answer_relevance_q_score=0
        correctness_score=0
        for i, item in evaluation_dataset.iterrows():
            question_id = i+1
            question = item["question"]
            ground_truth_answer = item["answer"]
            retrieved_context = item["retrieved_context"] 
            generated_answer = item["rag_answer"]
            
            
            # Calculate metrics
            metrics_response=get_metrics(question,retrieved_context,generated_answer,ground_truth_answer)
            context_relevance = metrics_response["context_relevance"]
            faithfulness= metrics_response["faithfulness"]
            answer_relevance_q= metrics_response["answer_relevance_q"]
            correctness= metrics_response["correctness"]

            context_relevance_score+=float(context_relevance.score)
            faithfulness_score+=float(faithfulness.score)
            answer_relevance_q_score+=float(answer_relevance_q.score)
            correctness_score+=float(correctness.score)
        total_items=len(evaluation_dataset)
        mlflow.log_metric(f"avg_context_relevance",context_relevance_score/total_items)
        mlflow.log_metric(f"avg_faithfulness", faithfulness_score/total_items)
        mlflow.log_metric(f"avg_answer_relevance_q", answer_relevance_q_score/total_items)
        mlflow.log_metric(f"avg_similarity_gt", correctness_score/total_items)    

MLflow Run ID: 63d4820eee494ad1a73f2910ac3401be
MLflow Experiment ID: 0
🏃 View run RAG_Eval_20250606_234705 at: http://localhost:5000/#/experiments/0/runs/63d4820eee494ad1a73f2910ac3401be
🧪 View experiment at: http://localhost:5000/#/experiments/0


In [121]:
len(evaluation_dataset)

5

In [None]:
num_items = len(evaluation_dataset)
avg_metrics = {k: v / num_items if num_items > 0 else 0 for k, v in metric_totals.items()}
for metric_name, avg_value in avg_metrics.items():
    mlflow.log_metric(f"avg_{metric_name}_llm", avg_value)