# Raw RAG 08: Evaluation

## Introduction

In the development of Retrieval-Augmented Generation (RAG) systems, continuous improvement is essential. However, relying solely on intuition or "vibe" to assess performance enhancements can be misleading. This notebook introduces a systematic approach to evaluating RAG systems using key metrics inspired by the (RAGAS)[https://docs.ragas.io/en/latest/concepts/metrics/index.html] framework. Additionally, we'll implement these metrics in a Streamlit application, providing an interactive and user-friendly interface to visualize and analyze the results.

### The Importance of Systematic Evaluation

As we implement various enhancements in our RAG pipeline, it's crucial to have reliable methods to quantify improvements. Systematic evaluation allows us to:

1. **Objectively Measure Performance**: Move beyond subjective assessments to concrete, comparable metrics.
2. **Identify Strengths and Weaknesses**: Pinpoint specific areas where our system excels or needs improvement.
3. **Guide Development Decisions**: Make informed choices about optimizations based on quantifiable results.
4. **Ensure Consistency**: Maintain a standard benchmark to track progress over time and across different implementations.

### Focus on Key Metrics

In this notebook, we'll explore several critical evaluation metrics for RAG systems, including:

1. **Faithfulness**: Assessing how accurately the generated response reflects the retrieved context.
2. **Relevance**: Evaluating how well the response addresses the given query.
3. **Context Precision**: Measuring the accuracy and relevance of the retrieved context.
4. **Context Recall**: Evaluating how well the retrieved context covers the information needed to answer the query.
5. **Context Entities Recall**: Assessing how effectively the system retrieves specific entities relevant to the query.

These metrics, while inspired by the RAGAS framework, represent a focused approach to RAG evaluation that can be implemented without the full RAGAS toolkit.

### LLM-as-Judge Approach: Considerations and Mitigation Strategies

In this evaluation framework, we employ an LLM-as-judge approach for assessing our RAG system's performance. While this method allows for efficient processing of large volumes of data, it's important to acknowledge its limitations:

1. **Potential Bias**: LLMs may have inherent biases that could affect their judgment.
2. **Consistency Concerns**: LLMs might not always provide consistent evaluations across similar inputs.
3. **Lack of Human Nuance**: LLMs may miss subtle context or nuances that a human evaluator would catch.

To mitigate these limitations and enhance the accuracy of our evaluation, we implement the following strategies:

1. **Detailed Evaluation Criteria**: We provide the LLM judge with comprehensive and specific criteria for each metric. This ensures a more structured and consistent evaluation process.

2. **Example-based Instruction**: We include carefully crafted examples of good and poor performances for each metric. This helps calibrate the LLM's judgment and provides clear benchmarks.

3. **Multiple Evaluations**: For critical or ambiguous cases, we may run multiple evaluations and aggregate the results to reduce inconsistency.

4. **Human Oversight**: While the bulk of the evaluation is automated, we suggest periodic human reviews to validate the LLM's judgments and refine our evaluation process.

5. **Transparency**: We clearly communicate that an LLM is used for evaluation, acknowledging both the benefits (scale, speed) and potential limitations of this approach.

By implementing these strategies, we aim to leverage the efficiency of LLM-based evaluation while maintaining a high standard of accuracy and fairness in our assessment of the RAG system's performance.

### Streamlit Implementation

To enhance the accessibility and interactivity of our evaluation results, we'll implement a Streamlit application. This will allow users to:

- Visualize metric scores through interactive charts and graphs
- Compare performance across different RAG system configurations
- Drill down into specific examples to understand metric calculations
- Easily share and present evaluation results with stakeholders

### What to Expect

By the end of this notebook and accompanying Streamlit application, you'll understand:

- How to calculate and interpret key evaluation metrics for RAG systems
- Implementation techniques for these metrics in Python
- How to use these metrics to guide your RAG system development
- How to leverage Streamlit to create an interactive dashboard for RAG system evaluation

While we're not implementing the full RAGAS framework, the concepts and techniques covered here provide a solid foundation for systematic RAG evaluation. This approach, combined with the Streamlit visualization, allows for a comprehensive and intuitive assessment of your RAG system's performance across various dimensions, from the relevance of retrieved information to the accuracy of generated responses.

Let's dive in and explore how we can quantifiable assess, visualize, and improve our RAG implementations!

**Note**: All the evaluation prompts are generated by Claude 3.5 Sonnet.

In [1]:
%pip install openai python-dotenv streamlit pandas numpy

Collecting streamlit
  Downloading streamlit-1.37.0-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting pandas
  Using cached pandas-2.2.2-cp312-cp312-macosx_11_0_arm64.whl.metadata (19 kB)
Collecting altair<6,>=4.0 (from streamlit)
  Downloading altair-5.3.0-py3-none-any.whl.metadata (9.2 kB)
Collecting blinker<2,>=1.0.0 (from streamlit)
  Using cached blinker-1.8.2-py3-none-any.whl.metadata (1.6 kB)
Collecting pyarrow>=7.0 (from streamlit)
  Downloading pyarrow-17.0.0-cp312-cp312-macosx_11_0_arm64.whl.metadata (3.3 kB)
Collecting toml<2,>=0.10.1 (from streamlit)
  Downloading toml-0.10.2-py2.py3-none-any.whl.metadata (7.1 kB)
Collecting gitpython!=3.1.19,<4,>=3.0.7 (from streamlit)
  Downloading GitPython-3.1.43-py3-none-any.whl.metadata (13 kB)
Collecting pydeck<1,>=0.8.0b4 (from streamlit)
  Downloading pydeck-0.9.1-py2.py3-none-any.whl.metadata (4.1 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (f

In [2]:
# Load the environment variables from the .env file

from dotenv import load_dotenv
import os

# Specify the path to your .env file if it's not in the same directory
dotenv_path = ".env"
load_dotenv(dotenv_path=dotenv_path)

True

In [3]:
from openai import OpenAI

client = OpenAI()

def generate_response(system_prompt, full_query):
    response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt,
            },
            {"role": "user", "content": full_query},
        ],
        model="gpt-4-turbo",
        temperature=0,
    )

    return response.choices[0].message.content

In [4]:
# All the information needed for the LLM "Judge"

query = "How does climate change affect global sea levels, and what are some potential consequences?"

answer = "Climate change is causing global sea levels to rise at an accelerating rate. This is primarily due to two factors: thermal expansion of the oceans as they warm and the melting of land-based ice, particularly glaciers and ice sheets in Greenland and Antarctica. The rate of sea-level rise has more than doubled from 1.4 mm per year throughout most of the 20th century to 3.6 mm per year from 2006-2015. Consequences of rising sea levels include increased coastal flooding, erosion of coastlines, saltwater intrusion into freshwater aquifers, and the potential displacement of millions of people living in low-lying coastal areas. Some island nations, like the Maldives, are at risk of becoming completely submerged. Additionally, rising seas can damage critical infrastructure and ecosystems, such as wetlands and mangrove forests, which provide natural protection against storms and serve as important habitats for many species."

context = ["Global mean sea level has risen about 8-9 inches (21-24 centimeters) since 1880, with about a third of that coming in just the last two and a half decades.",
"The two major causes of global sea-level rise are thermal expansion caused by warming of the ocean and increased melting of land-based ice, such as glaciers and ice sheets.",
"The oceans are absorbing more than 90 percent of the increased atmospheric heat associated with emissions from human activity.",
"Scientists estimate that the global mean sea level could rise by 2-7 feet (0.6-2.1 meters) by 2100.",
"Sea level rise poses threats to coastal communities, infrastructure, and ecosystems.",
"Saltwater intrusion into freshwater aquifers can contaminate drinking water sources and affect agricultural production in coastal areas.",
"The Intergovernmental Panel on Climate Change (IPCC) projects that climate change could displace millions of people due to coastal flooding by 2100.",
"Some low-lying island nations, such as the Maldives and Marshall Islands, are particularly vulnerable to sea-level rise.",
"Wetlands and mangrove forests along coastlines provide natural protection against storms and flooding.",
"The rate of sea-level rise varies regionally due to factors such as local land subsidence, ocean currents, and variations in land height."]

ground_truth = [
    "Climate change is causing global sea levels to rise at an accelerating rate.",
    "The primary factors contributing to sea level rise are thermal expansion of warming oceans and melting of land-based ice, particularly glaciers and ice sheets in Greenland and Antarctica.",
    "The rate of sea-level rise has more than doubled from 1.4 mm per year throughout most of the 20th century to 3.6 mm per year from 2006-2015."
    "Consequences of rising sea levels include increased coastal flooding, erosion of coastlines, and saltwater intrusion into freshwater aquifers.",
    "Millions of people living in low-lying coastal areas are at risk of displacement due to sea level rise.",
    "Some island nations, like the Maldives, are at risk of becoming completely submerged.",
    "Rising seas can damage critical infrastructure and ecosystems, such as wetlands and mangrove forests.",
    "Wetlands and mangrove forests provide natural protection against storms and serve as important habitats for many species.",
]

Faithfulness is a crucial metric in RAG systems that measures how accurately the generated response aligns with the information provided in the retrieved context. In other words, it evaluates whether the system's answer is truly based on the retrieved information rather than hallucinated or externally sourced content.

Key aspects of faithfulness include:

1. Factual Accuracy: The generated response should not contradict or misrepresent facts present in the retrieved context.
2. Information Source: All key information in the response should be traceable back to the retrieved context.
3. Avoidance of Hallucination: The system should not introduce new information that isn't supported by the context.
4. Appropriate Uncertainty: When the context doesn't provide complete information, the response should reflect this uncertainty rather than making unfounded claims.

Evaluating faithfulness typically involves comparing the generated response against the retrieved context, often using techniques like natural language inference or carefully designed prompts for LLM-based judges.

High faithfulness is essential for building trust in RAG systems, as it ensures that the system's outputs are grounded in the provided information rather than being fabricated or misleading.


In [5]:
# Faithfulness Calculation Prompt
faithfulness_prompt = f"""
You are an impartial judge tasked with evaluating the faithfulness of an answer to a given question based on the provided context. Your goal is to determine how well the answer is supported by the context and calculate a faithfulness score using the following formula:

Faithfulness score = (Number of claims in the generated answer that can be inferred from given context) / (Total number of claims in the generated answer)

Given:
- Question: {query}
- Answer: {answer}
- Context: {context}

Please follow these steps:

1. Identify claims in the answer:
   a. Break down the answer into individual claims or statements.
   b. List each claim separately.
   c. Count the total number of claims.

2. Analyze each claim:
   a. For each claim, determine if it can be inferred from the given context.
   b. Provide a brief explanation for your decision on each claim.
   c. Count the number of claims that can be inferred from the context.

3. Calculate the faithfulness score:
   a. Use the formula: (Number of claims inferred from context) / (Total number of claims)
   b. Express the result as a decimal between 0 and 1.

4. Provide a summary:
   a. State the faithfulness score.
   b. Summarize your evaluation, including:
      - Total number of claims in the answer
      - Number of claims that can be inferred from the context
      - Any notable observations about the answer's faithfulness to the context
   c. If applicable, highlight any claims in the answer that are not supported by the context.

Please present your evaluation in a clear, structured format, showing your work for each step of the process.

Example Output Format:

1. Claims Identification:
   - Claim 1: [Text of claim]
   - Claim 2: [Text of claim]
   ...
   Total claims: [Number]

2. Claims Analysis:
   - Claim 1: [Can/Cannot be inferred] - [Brief explanation]
   - Claim 2: [Can/Cannot be inferred] - [Brief explanation]
   ...
   Claims inferred from context: [Number]

3. Faithfulness Score Calculation:
   [Number of inferred claims] / [Total claims] = [Score]

4. Summary:
   Faithfulness Score: 
   <score>
   [Score]
   </score>
   [Summary text]
   [Unsupported claims, if any]
"""

In [6]:
faithfulness_system_prompt = (
    "You are an impartial judge evaluating the faithfulness of an answer to a question based on given context. Analyze the answer's claims, determine if each claim can be inferred from the context, and calculate a faithfulness score. Be objective and thorough in your assessment."
)

faithfulness_response = generate_response(
    faithfulness_system_prompt, faithfulness_prompt
)

print(faithfulness_response)

1. Claims Identification:
   - Claim 1: Climate change is causing global sea levels to rise at an accelerating rate.
   - Claim 2: This is primarily due to two factors: thermal expansion of the oceans as they warm and the melting of land-based ice, particularly glaciers and ice sheets in Greenland and Antarctica.
   - Claim 3: The rate of sea-level rise has more than doubled from 1.4 mm per year throughout most of the 20th century to 3.6 mm per year from 2006-2015.
   - Claim 4: Consequences of rising sea levels include increased coastal flooding, erosion of coastlines, saltwater intrusion into freshwater aquifers, and the potential displacement of millions of people living in low-lying coastal areas.
   - Claim 5: Some island nations, like the Maldives, are at risk of becoming completely submerged.
   - Claim 6: Rising seas can damage critical infrastructure and ecosystems, such as wetlands and mangrove forests, which provide natural protection against storms and serve as important ha

### Relevance Calculation

Relevance measures how well the generated response addresses the user's query. Calculating relevance involves:

1. **Query-Response Alignment**: Assessing how directly the response answers the question posed.
2. **Information Pertinence**: Evaluating whether the response contains information that is actually useful for the query.
3. **Semantic Similarity**: Measuring the topical closeness between the query and the response.
4. **Conciseness vs. Comprehensiveness**: Balancing providing enough information without irrelevant details.

Relevance calculation often employs techniques such as semantic similarity metrics, supervised learning models trained on human-labeled data, or carefully prompt-engineered language models to assess the quality of the response in relation to the query.

In [7]:
# Answer Relevancy Calculation Prompt

relevance_prompt = f"""
You are an AI assistant tasked with evaluating the relevancy of an answer to a given question. Your goal is to calculate an Answer Relevancy score using the following process:

1. Given:
   - Original Question: {query}
   - Generated Answer: {answer}
   - Context (if provided): {context}

2. Task Overview:
   Your task is to generate artificial questions based on the answer, compare these to the original question, and calculate a relevancy score.

3. Steps:

   a. Generate Questions:
      - Create 3 artificial questions that the given answer could be responding to.
      - These questions should be diverse and cover different aspects of the answer.
      - List these questions clearly.

   b. Conceptual Embedding and Similarity:
      - For each generated question and the original question, imagine you are creating a semantic embedding. This embedding would represent the meaning of the question in a high-dimensional space.
      - Conceptually compare each generated question to the original question. Consider how similar they are in meaning and intent.
      - Assign a similarity score between -1 and 1 for each comparison, where:
        * 1 indicates perfect similarity
        * 0 indicates no relation
        * -1 indicates opposite meanings
      - Explain your reasoning for each similarity score.

   c. Calculate Answer Relevancy:
      - Use the formula: Answer Relevancy = (Sum of similarity scores) / (Number of generated questions)
      - Show your calculation.

4. Final Output:
   - List the original question and generated questions
   - Show similarity scores and explanations
   - Present the final Answer Relevancy score
   - Provide a brief interpretation of the score

Remember, while you're conceptualizing embeddings and similarity, you're not actually generating numerical embeddings. Use your understanding of language and context to estimate similarity.

Example Output Format:

Original Question: [Original question text]

Generated Questions:
1. [Question 1]
2. [Question 2]
3. [Question 3]

Similarity Scores:
1. Score: [X.XX] - Explanation: [Your reasoning]
2. Score: [X.XX] - Explanation: [Your reasoning]
3. Score: [X.XX] - Explanation: [Your reasoning]

Calculation:
Answer Relevancy = (Score1 + Score2 + Score3) / 3 = <final_score>[Final Score]</final_score>

Interpretation:
[Brief interpretation of the score and what it means for the answer's relevancy]
"""

In [8]:
relevance_system_prompt = "You are an AI assistant specialized in evaluating the relevancy of answers to questions. Your task is to generate artificial questions based on a given answer, compare them to the original question, and calculate an Answer Relevancy score. Use your language understanding to conceptualize semantic similarities without performing actual mathematical calculations."

relevance_response = generate_response(relevance_system_prompt, relevance_prompt)

print(relevance_response)

Original Question: How does climate change affect global sea levels, and what are some potential consequences?

Generated Questions:
1. What are the primary causes of rising sea levels and how have they changed over time?
2. What impact does sea level rise have on coastal ecosystems and human populations?
3. How have global sea levels changed since the 20th century, and what predictions exist for future levels?

Similarity Scores:
1. Score: 0.8 - Explanation: This question directly addresses the causes of rising sea levels, which is a central aspect of the original question. It slightly deviates by focusing more on historical changes rather than future predictions or broader consequences.
2. Score: 0.9 - Explanation: This question closely aligns with the original question by exploring the consequences of rising sea levels, specifically focusing on ecosystems and human populations, which are mentioned as part of the consequences in the original question.
3. Score: 0.7 - Explanation: Thi

### Context Precision

Context precision evaluates the accuracy and relevance of the retrieved context used to generate the response. Key aspects include:

1. **Information Density**: Assessing how much of the retrieved context is actually relevant to the query.
2. **Noise Reduction**: Measuring the system's ability to filter out irrelevant or redundant information.
3. **Conciseness**: Evaluating whether the context contains only necessary information without extraneous details.
4. **Query Alignment**: Determining how well the retrieved context aligns with the specific requirements of the query.

Calculating context precision often involves comparing the retrieved context against ideal or human-curated responses, using techniques such as content overlap analysis, semantic similarity measures, or machine learning models trained on annotated datasets.

In [9]:
# Context Precision Calculation Prompt

context_precision_prompt = f"""
You are an AI assistant tasked with evaluating the precision of context chunks provided for a given question. Your goal is to calculate a Context Precision score using the following process:

1. Given:
   - Question: {query}
   - Ground Truth: {answer}
   - Contexts: {context}

2. Task Overview:
   Your task is to evaluate the relevance of each context chunk, calculate precision at each rank, and then compute the overall Context Precision score.

3. Steps:

   a. Evaluate Relevance:
      - For each context chunk, determine if it's relevant to answering the question based on the ground truth.
      - Assign a relevance indicator (v_k) of 1 if relevant, 0 if not relevant.
      - List your decisions and briefly explain your reasoning for each.

   b. Calculate Precision at each rank (Precision@k):
      - For each rank k from 1 to K:
        * Count the number of relevant items up to and including rank k (true positives).
        * Calculate Precision@k = (true positives at k) / k
      - Show your calculations for each k.

   c. Calculate Context Precision@K:
      - Sum the products of (Precision@k * v_k) for all k from 1 to K.
      - Divide this sum by the total number of relevant items in the top K results.
      - Show your calculation.

4. Final Output:
   - List the relevance decisions for each context chunk
   - Show Precision@k calculations for each k
   - Present the final Context Precision@K score
   - Provide a brief interpretation of the score

Remember to be objective in your relevance assessments and precise in your calculations.

Example Output Format:

Question: [Question text]
Ground Truth: [Ground truth text]

Relevance Evaluations:
1. Context Chunk 1: [Relevant/Not Relevant] - v_1 = [0/1] - Explanation: [Brief reasoning]
2. Context Chunk 2: [Relevant/Not Relevant] - v_2 = [0/1] - Explanation: [Brief reasoning]
...
K. Context Chunk K: [Relevant/Not Relevant] - v_K = [0/1] - Explanation: [Brief reasoning]

Precision@k Calculations:
Precision@1 = [calculation] = [result]
Precision@2 = [calculation] = [result]
...
Precision@K = [calculation] = [result]

Context Precision@K Calculation:
Sum of (Precision@k * v_k) = [calculation]
Total number of relevant items = [number]
Context Precision@K = [final calculation] = [Final Score]

Interpretation:
[Brief interpretation of the score and what it means for the context precision]
"""

In [10]:
context_precision_system_prompt = "You are an AI assistant specialized in evaluating the precision of context chunks for given questions. Your task is to assess the relevance of each context chunk, calculate precision at various ranks, and compute an overall Context Precision score. Use your understanding of the question, ground truth, and contexts to make objective assessments and perform accurate calculations."

context_precision_response = generate_response(
    context_precision_system_prompt, context_precision_prompt
)

print(context_precision_response)

Question: How does climate change affect global sea levels, and what are some potential consequences?
Ground Truth: Climate change is causing global sea levels to rise at an accelerating rate. This is primarily due to two factors: thermal expansion of the oceans as they warm and the melting of land-based ice, particularly glaciers and ice sheets in Greenland and Antarctica. The rate of sea-level rise has more than doubled from 1.4 mm per year throughout most of the 20th century to 3.6 mm per year from 2006-2015. Consequences of rising sea levels include increased coastal flooding, erosion of coastlines, saltwater intrusion into freshwater aquifers, and the potential displacement of millions of people living in low-lying coastal areas. Some island nations, like the Maldives, are at risk of becoming completely submerged. Additionally, rising seas can damage critical infrastructure and ecosystems, such as wetlands and mangrove forests, which provide natural protection against storms and s

### Context Relevancy Calculation

Context Relevancy measures how well the retrieved context aligns with the user's query. This metric is crucial for ensuring that the RAG system is working with pertinent information. Key aspects include:

1. **Query-Context Alignment**: Assessing how closely the retrieved context matches the information needs expressed in the query.
2. **Semantic Overlap**: Measuring the topical similarity between the query and the retrieved context.
3. **Information Sufficiency**: Evaluating whether the context contains enough relevant information to adequately answer the query.
4. **Contextual Appropriateness**: Determining if the retrieved context is suitable for the query's domain and complexity level.

Calculating context relevancy often involves techniques such as cosine similarity between query and context embeddings, supervised machine learning models trained on human-annotated data, or carefully designed prompts for LLM-based evaluation.

In [11]:
# Context Relevancy Calculation Prompt

context_relevance_prompt = f"""
You are an AI assistant tasked with evaluating the relevancy of retrieved context for a given question. Your goal is to calculate a Context Relevancy score using the following process:

1. Given:
   - Question: {query}
   - Retrieved Context: {context}

2. Task Overview:
   Your task is to identify relevant sentences in the retrieved context, count them, and calculate the Context Relevancy score.

3. Steps:

   a. Sentence Identification:
      - Break down the retrieved context into individual sentences.
      - Number each sentence for easy reference.

   b. Relevance Evaluation:
      - For each sentence, determine if it's relevant to answering the question.
      - Assign a relevance indicator of 1 if relevant, 0 if not relevant.
      - Briefly explain your reasoning for each decision.

   c. Calculate Context Relevancy:
      - Count the total number of sentences in the retrieved context.
      - Count the number of relevant sentences (|S|).
      - Calculate the Context Relevancy score using the formula:
        Context Relevancy = |S| / (Total number of sentences in retrieved context)
      - Show your calculation.

4. Final Output:
   - List all sentences with their relevance decisions
   - Show the calculation of the Context Relevancy score
   - Present the final Context Relevancy score
   - Provide a brief interpretation of the score

Remember to be objective in your relevance assessments and precise in your calculations.

Example Output Format:

Question: [Question text]

Sentence Evaluation:
1. [Sentence 1]: [Relevant/Not Relevant] - Explanation: [Brief reasoning]
2. [Sentence 2]: [Relevant/Not Relevant] - Explanation: [Brief reasoning]
...
N. [Sentence N]: [Relevant/Not Relevant] - Explanation: [Brief reasoning]

Calculation:
Total number of sentences: [N]
Number of relevant sentences (|S|): [X]
Context Relevancy = X / N = [Final Score]

Interpretation:
[Brief interpretation of the score and what it means for the context relevancy]
"""

In [12]:
context_relevance_system_prompt = "You are an AI assistant specialized in evaluating the relevancy of retrieved context for given questions. Your task is to analyze individual sentences, determine their relevance to the question, and compute an overall Context Relevancy score. Use your understanding of the question and context to make objective assessments and perform accurate calculations."

context_relevance_response = generate_response(
    context_relevance_system_prompt, context_relevance_prompt
)

print(context_relevance_response)

Question: How does climate change affect global sea levels, and what are some potential consequences?

Sentence Evaluation:
1. Global mean sea level has risen about 8-9 inches (21-24 centimeters) since 1880, with about a third of that coming in just the last two and a half decades.: Relevant - Explains the historical rise in sea levels, which is a direct consequence of climate change.
2. The two major causes of global sea-level rise are thermal expansion caused by warming of the ocean and increased melting of land-based ice, such as glaciers and ice sheets.: Relevant - Identifies the causes of sea level rise related to climate change.
3. The oceans are absorbing more than 90 percent of the increased atmospheric heat associated with emissions from human activity.: Relevant - Provides information on how the oceans' absorption of heat (related to climate change) contributes to thermal expansion and sea level rise.
4. Scientists estimate that the global mean sea level could rise by 2-7 fee

### Context Recall Calculation

Context Recall measures how completely the retrieved context covers the information necessary to answer the query. This metric ensures that the RAG system isn't missing crucial information. Key aspects include:

1. **Information Completeness**: Assessing whether all essential information required to answer the query is present in the retrieved context.
2. **Coverage of Query Aspects**: Evaluating how well the context addresses all aspects or sub-questions within the main query.
3. **Absence of Critical Gaps**: Identifying whether there are any significant information gaps that could lead to incomplete or misleading responses.
4. **Breadth vs. Depth Balance**: Determining if the context provides a good balance between broad coverage and necessary depth on specific points.

Calculating context recall often involves comparing the retrieved context against ideal or comprehensive reference answers, using techniques such as named entity recognition, key information extraction, or semantic similarity analysis between the context and a set of expected information points.


In [13]:
# Context Recall Calculation Prompt

context_recall_prompt = f"""
You are an AI assistant tasked with evaluating the recall of retrieved context compared to a ground truth answer. Your goal is to calculate a Context Recall score using the following process:

1. Given:
   - Ground Truth Answer: {answer}
   - Retrieved Context: {context}

2. Task Overview:
   Your task is to analyze each sentence in the ground truth answer, determine if it can be attributed to the retrieved context, and calculate the Context Recall score.

3. Steps:

   a. Ground Truth Sentence Identification:
      - Break down the ground truth answer into individual sentences.
      - Number each sentence for easy reference.

   b. Attribution Evaluation:
      - For each ground truth sentence, determine if it can be attributed to (found in or inferred from) the retrieved context.
      - Assign an attribution indicator of 1 if attributable, 0 if not attributable.
      - Briefly explain your reasoning for each decision.

   c. Calculate Context Recall:
      - Count the total number of sentences in the ground truth answer.
      - Count the number of ground truth sentences that can be attributed to the context.
      - Calculate the Context Recall score using the formula:
        Context Recall = (Number of GT sentences attributed to context) / (Total number of sentences in GT)
      - Show your calculation.

4. Final Output:
   - List all ground truth sentences with their attribution decisions
   - Show the calculation of the Context Recall score
   - Present the final Context Recall score
   - Provide a brief interpretation of the score

Remember to be objective in your attribution assessments and precise in your calculations.

Example Output Format:

Ground Truth Sentence Evaluation:
1. [GT Sentence 1]: [Attributable/Not Attributable] - Explanation: [Brief reasoning]
2. [GT Sentence 2]: [Attributable/Not Attributable] - Explanation: [Brief reasoning]
...
N. [GT Sentence N]: [Attributable/Not Attributable] - Explanation: [Brief reasoning]

Calculation:
Total number of GT sentences: [N]
Number of GT sentences attributable to context: [X]
Context Recall = X / N = [Final Score]

Interpretation:
[Brief interpretation of the score and what it means for the context recall]
"""

In [14]:
context_recall_system_prompt = "You are an AI assistant specialized in evaluating the recall of retrieved context compared to ground truth answers. Your task is to analyze individual sentences from the ground truth, determine their attribution to the retrieved context, and compute an overall Context Recall score. Use your understanding of the ground truth and context to make objective assessments and perform accurate calculations."

context_recall_response = generate_response(
    context_recall_system_prompt, context_recall_prompt
)

print(context_recall_response)

Ground Truth Sentence Evaluation:
1. Climate change is causing global sea levels to rise at an accelerating rate. [Attributable] - Explanation: The retrieved context mentions that sea level has risen significantly since 1880, with a substantial increase in the last few decades, implying an accelerating rate.
2. This is primarily due to two factors: thermal expansion of the oceans as they warm and the melting of land-based ice, particularly glaciers and ice sheets in Greenland and Antarctica. [Attributable] - Explanation: The retrieved context explicitly states the two major causes of sea-level rise as thermal expansion and increased melting of land-based ice.
3. The rate of sea-level rise has more than doubled from 1.4 mm per year throughout most of the 20th century to 3.6 mm per year from 2006-2015. [Not Attributable] - Explanation: The retrieved context does not provide specific historical rates or comparisons of sea-level rise rates over these exact time frames.
4. Consequences of r

### Context Entities Recall Calculation

Context Entities Recall is a metric that evaluates how well the retrieved context captures the essential entities present in the ground truth. This metric is crucial for assessing the completeness and accuracy of the information retrieved by the RAG system.

The calculation process typically involves:

1. Entity Identification: 
   - Identify all entities in the ground truth.
   - Identify all entities in the retrieved context.
   - Entities may include named individuals, organizations, locations, dates, numerical facts, and other specific, identifiable information.

2. Set Comparison:
   - Determine the overlap between entities found in the ground truth and those in the retrieved context.

3. Score Calculation:
   - Calculate the Context Entities Recall score using the formula:
     Context Entities Recall = |GE ∩ CE| / |GE|
     Where:
     - GE is the set of entities in the ground truth
     - CE is the set of entities in the retrieved context
     - |GE ∩ CE| is the number of entities common to both sets
     - |GE| is the total number of entities in the ground truth

4. Interpretation:
   - Provide a meaningful interpretation of the calculated score, indicating how well the retrieved context captures the essential entities from the ground truth.

This metric helps in quantifying the RAG system's ability to retrieve contextual information that contains the key entities necessary for answering queries accurately and comprehensively.

In [15]:
# Context Entities Recall Calculation Prompt

context_entities_recall_prompt = f"""
You are an AI assistant tasked with evaluating the recall of entities in retrieved context compared to ground truth. Your goal is to calculate a Context Entities Recall score using the following process:

1. Given:
   - Ground Truth: {ground_truth}
   - Retrieved Context: {context}

2. Task Overview:
   Your task is to identify entities in both the ground truth and the retrieved context, compare these sets, and calculate the Context Entities Recall score.

3. Steps:

   a. Entity Identification:
      - Identify all entities in the ground truth. List them as set GE.
      - Identify all entities in the retrieved context. List them as set CE.
      - Entities may include named individuals, organizations, locations, dates, numerical facts, and other specific, identifiable information.

   b. Set Comparison:
      - Identify the entities that appear in both GE and CE (the intersection).
      - List these common entities.

   c. Calculate Context Entities Recall:
      - Count the number of entities in GE.
      - Count the number of entities in the intersection of GE and CE.
      - Calculate the Context Entities Recall score using the formula:
        Context Entities Recall = |GE ∩ CE| / |GE|
        (Where |GE ∩ CE| is the number of entities in the intersection, and |GE| is the total number of entities in the ground truth)
      - Show your calculation.

4. Final Output:
   - List all entities found in the ground truth (GE)
   - List all entities found in the retrieved context (CE)
   - List the entities common to both (GE ∩ CE)
   - Show the calculation of the Context Entities Recall score
   - Present the final Context Entities Recall score
   - Provide a brief interpretation of the score

Remember to be thorough in your entity identification and precise in your calculations.

Example Output Format:

Entities in Ground Truth (GE):
[List of entities]

Entities in Retrieved Context (CE):
[List of entities]

Common Entities (GE ∩ CE):
[List of common entities]

Calculation:
|GE| (Total entities in ground truth): [Number]
|GE ∩ CE| (Common entities): [Number]
Context Entities Recall = |GE ∩ CE| / |GE| = [Final Score]

Interpretation:
[Brief interpretation of the score and what it means for the context entities recall]
"""

In [16]:
context_entities_recall_system_prompt = "You are an AI assistant specialized in evaluating the recall of entities in retrieved context compared to ground truth. Your task is to identify entities in both the ground truth and context, compare these sets, and compute a Context Entities Recall score. Use your understanding of entity recognition to make thorough identifications and perform accurate calculations."

context_entities_recall_response = generate_response(
    context_entities_recall_system_prompt, context_entities_recall_prompt
)

print(context_entities_recall_response)

### Entities in Ground Truth (GE):
- Climate change
- Global sea levels
- Thermal expansion
- Warming oceans
- Melting of land-based ice
- Glaciers
- Ice sheets
- Greenland
- Antarctica
- 20th century
- 2006-2015
- Coastal flooding
- Erosion of coastlines
- Saltwater intrusion
- Freshwater aquifers
- Millions of people
- Low-lying coastal areas
- Displacement
- Island nations
- Maldives
- Critical infrastructure
- Ecosystems
- Wetlands
- Mangrove forests
- Storms
- Species

### Entities in Retrieved Context (CE):
- Global mean sea level
- 1880
- Last two and a half decades
- Thermal expansion
- Warming of the ocean
- Increased melting
- Land-based ice
- Glaciers
- Ice sheets
- Oceans
- Atmospheric heat
- Emissions
- Human activity
- 2100
- Coastal communities
- Infrastructure
- Ecosystems
- Saltwater intrusion
- Freshwater aquifers
- Drinking water sources
- Agricultural production
- Coastal areas
- Intergovernmental Panel on Climate Change (IPCC)
- Climate change
- Coastal flooding
- 

### Answer Semantic Similarity

Answer Semantic Similarity is a metric that evaluates how closely the meaning of the generated answer aligns with the ground truth or expected answer. This metric goes beyond exact word matching to assess the overall semantic correspondence between the two answers.

Key aspects of Answer Semantic Similarity calculation include:

1. Embedding Generation:
   - Convert both the generated answer and the ground truth answer into dense vector representations (embeddings) using pre-trained language models.

2. Similarity Computation:
   - Calculate the similarity between the two embeddings, often using cosine similarity or other vector similarity measures.

3. Score Interpretation:
   - The resulting similarity score typically ranges from 0 to 1, where 1 indicates perfect semantic similarity and 0 indicates no semantic overlap.

4. Thresholding:
   - Determine acceptable levels of similarity based on the specific requirements of your application.

This metric is particularly useful because it can capture the correctness of an answer even when it's phrased differently from the ground truth. It allows for a more nuanced evaluation of the RAG system's ability to generate contextually and semantically appropriate responses.

In [17]:
# Answer Semantic Similarity Calculation Prompt

answer_semantic_similarity_prompt = f"""
You are an AI assistant tasked with evaluating the semantic similarity between a generated answer and a ground truth answer. Your goal is to calculate an Answer Semantic Similarity score using the following process:

1. Given:
   - Ground Truth Answer: {ground_truth}
   - Generated Answer: {answer}

2. Task Overview:
   Your task is to compare the semantic meaning of the generated answer to the ground truth answer and assign a similarity score between 0 and 1, where 1 indicates perfect semantic similarity and 0 indicates no semantic similarity.

3. Steps:

   a. Content Analysis:
      - Identify the main concepts, facts, and arguments present in both the ground truth and generated answer.
      - List these key elements for each answer.

   b. Structural Comparison:
      - Compare the organization and flow of ideas between the two answers.
      - Note any significant differences or similarities in structure.

   c. Semantic Evaluation:
      - Assess how well the generated answer captures the meaning and intent of the ground truth answer.
      - Consider factors such as:
        * Accuracy of information
        * Completeness of the response
        * Relevance of the content
        * Consistency in terminology and concepts
        * Depth of explanation

   d. Assign Similarity Score:
      - Based on your analysis, assign a similarity score between 0 and 1.
      - Provide a detailed justification for your score, referencing specific aspects of your analysis.

4. Final Output:
   - List the key elements identified in both answers
   - Summarize your structural and semantic comparison
   - Present the final Answer Semantic Similarity score
   - Provide a detailed explanation of how you arrived at this score

Remember to focus on the semantic meaning rather than exact wording, and be as objective as possible in your assessment.

Example Output Format:

Ground Truth Key Elements:
[List of key elements]

Generated Answer Key Elements:
[List of key elements]

Structural Comparison:
[Summary of structural similarities and differences]

Semantic Evaluation:
[Detailed analysis of semantic similarity, addressing the factors mentioned]

Answer Semantic Similarity Score: [Score between 0 and 1]

Justification:
[Detailed explanation of the score, referencing specific aspects of the analysis]
"""

In [18]:
answer_semantic_similarity_system_prompt = "You are an AI assistant specialized in evaluating the semantic similarity between generated answers and ground truth answers. Your task is to analyze the content, structure, and meaning of both answers, and compute an Answer Semantic Similarity score. Use your understanding of language and semantics to make thorough comparisons and provide a justified similarity score."

answer_semantic_similarity_response = generate_response(
    answer_semantic_similarity_system_prompt, answer_semantic_similarity_prompt
)

print(answer_semantic_similarity_response)

### Ground Truth Key Elements:
1. Climate change is causing global sea levels to rise at an accelerating rate.
2. Primary factors: thermal expansion of warming oceans and melting of land-based ice, particularly glaciers and ice sheets in Greenland and Antarctica.
3. Rate of sea-level rise increased from 1.4 mm per year in the 20th century to 3.6 mm per year from 2006-2015.
4. Consequences include increased coastal flooding, erosion of coastlines, and saltwater intrusion into freshwater aquifers.
5. Millions of people in low-lying coastal areas are at risk of displacement.
6. Some island nations, like the Maldives, are at risk of becoming completely submerged.
7. Rising seas can damage critical infrastructure and ecosystems, such as wetlands and mangrove forests.
8. Wetlands and mangrove forests provide natural protection against storms and serve as important habitats for many species.

### Generated Answer Key Elements:
1. Climate change is causing global sea levels to rise at an accel

### Answer Correctness

Answer Correctness is a fundamental metric in evaluating the performance of Retrieval-Augmented Generation (RAG) systems. It assesses the accuracy and factual consistency of the generated answers with respect to the provided context and the ground truth.

Key aspects of Answer Correctness evaluation include:

1. Factual Accuracy:
   - Verifying that the information presented in the generated answer aligns with the facts in the source documents and ground truth.

2. Completeness:
   - Ensuring that the generated answer covers all relevant aspects of the question without omitting crucial information.

3. Relevance:
   - Checking that the answer directly addresses the question asked and doesn't include extraneous or off-topic information.

4. Consistency:
   - Confirming that the answer maintains internal consistency and doesn't contradict itself or the known facts.

This metric is crucial because it directly impacts the reliability and trustworthiness of the RAG system. A high level of answer correctness indicates that the system is effectively retrieving relevant information and generating accurate responses based on that information.

In [19]:
# Answer Correctness Calculation Prompt

answer_correctness_prompt = f"""
You are an AI assistant tasked with evaluating the correctness of a generated answer compared to a ground truth answer. Your goal is to calculate an Answer Correctness score using the following process:

1. Given:
   - Ground Truth Answer: {ground_truth}
   - Generated Answer: {answer}
   - Semantic Weight: 0.5
   - Factual Weight: 0.5
   - Threshold (optional): None

2. Task Overview:
   Your task is to evaluate both the semantic and factual similarity between the generated answer and the ground truth, combine these scores using the given weights, and calculate an overall Answer Correctness score.

3. Steps:

   a. Semantic Similarity Evaluation:
      - Assess how well the generated answer captures the meaning and intent of the ground truth answer.
      - Consider factors such as:
        * Consistency in terminology and concepts
        * Completeness of the response
        * Depth of explanation
      - Assign a semantic similarity score between 0 and 1.
      - Briefly justify your semantic similarity score.

   b. Factual Similarity Evaluation:
      - Identify key facts, figures, and claims in both answers.
      - Compare the accuracy of these elements between the generated answer and ground truth.
      - Consider factors such as:
        * Correctness of specific data points
        * Accuracy of statements and claims
        * Presence of all crucial facts from the ground truth
      - Assign a factual similarity score between 0 and 1.
      - Briefly justify your factual similarity score.

   c. Calculate Weighted Answer Correctness Score:
      - Use the formula: 
        Answer Correctness = (Semantic Weight * Semantic Score) + (Factual Weight * Factual Score)
      - Show your calculation.

   d. Apply Threshold (if provided):
      - If a threshold is given, convert the score to binary:
        * If Answer Correctness >= Threshold, set score to 1
        * If Answer Correctness < Threshold, set score to 0

4. Final Output:
   - Present the Semantic Similarity score with justification
   - Present the Factual Similarity score with justification
   - Show the calculation of the weighted Answer Correctness score
   - If applicable, show the binary threshold conversion
   - Provide the final Answer Correctness score
   - Give a brief interpretation of what the score means for the answer's correctness

Remember to be as objective as possible in your assessment and provide clear justifications for your scores.

Example Output Format:

Semantic Similarity Score: [Score between 0 and 1]
Justification: [Brief explanation]

Factual Similarity Score: [Score between 0 and 1]
Justification: [Brief explanation]

Weighted Answer Correctness Calculation:
([Semantic Weight] * [Semantic Score]) + ([Factual Weight] * [Factual Score]) = [Weighted Score]

[If applicable] Threshold Application:
Original Score: [Weighted Score]
Threshold: [Threshold Value]
Binary Score: [0 or 1]

Final Answer Correctness Score: [Final Score]

Interpretation:
[Brief interpretation of the score and what it means for the answer's correctness]
"""

In [20]:
answer_correctness_system_prompt = "You are an AI assistant specialized in evaluating the correctness of generated answers compared to ground truth answers. Your task is to assess both semantic and factual similarity, combine these assessments into a weighted score, and optionally apply a threshold for binary classification. Use your understanding of language and facts to make thorough comparisons and provide justified scores."

answer_correctness_response = generate_response(
    answer_correctness_system_prompt, answer_correctness_prompt
)

print(answer_correctness_response)

### Semantic Similarity Score: 0.95
**Justification:** The generated answer closely captures the meaning and intent of the ground truth answer. It maintains consistency in terminology and concepts, such as "thermal expansion," "melting of land-based ice," and "sea-level rise." The response is comprehensive, covering all major points from the ground truth, including the consequences of rising sea levels and the risks to island nations and ecosystems. The depth of explanation is also well-aligned with the ground truth, providing a detailed overview of the causes and effects of sea level rise.

### Factual Similarity Score: 0.95
**Justification:** The generated answer accurately reflects the key facts and figures presented in the ground truth. It correctly cites the rate of sea-level rise from the 20th century and from 2006-2015, and it identifies the primary causes of sea level rise. The answer also correctly lists the consequences of rising sea levels, such as coastal flooding, erosion,

### Let's display the evaluation result in a streamlit app

*Note*: streamlit may not working in Kaggle notebook, you can run the code in your local machine or use the Streamlit sharing service to deploy the app.

In [21]:
import streamlit as st
import pandas as pd
import numpy as np
from typing import List, Dict


def display_results(results: List[Dict[str, any]]):
    """Display results in a Streamlit app with improved readability for long text."""
    st.title("Evaluation Results")

    if not results:
        st.info("No results to display.")
        return

    df = pd.DataFrame(results)

    # Display each result in an expander
    for index, row in df.iterrows():
        with st.expander(f"Result {index + 1}"):
            for column, value in row.items():
                if isinstance(value, str) and len(value) > 100:
                    st.subheader(column)
                    st.text_area("", value, height=150)
                else:
                    st.subheader(column)
                    st.write(value)

    # Export to CSV
    csv = df.to_csv(index=False).encode("utf-8")
    st.download_button(
        label="Download results as CSV",
        data=csv,
        file_name="evaluation_results.csv",
        mime="text/csv",
    )

    # Display statistics
    st.subheader("Statistics")
    numeric_columns = df.select_dtypes(include=[np.number]).columns
    if not numeric_columns.empty:
        stats_df = df[numeric_columns].describe()
        st.dataframe(stats_df)

        # Visualizations
        st.subheader("Visualizations")
        for column in numeric_columns:
            st.write(f"Distribution of {column}")
            st.histogram(df[column])
    else:
        st.info("No numeric data available for statistics and visualizations.")

In [None]:
results = [
    {
        "Query": query,
        "Answer": answer,
        "Context": context,
        "Faithfulness": faithfulness_response,
        "Answer Correctness": answer_correctness_response,
        "Context Precision": context_precision_response,
        "Context Relevancy": context_relevance_response,
        "Context Recall": context_recall_response,
        "Context Entities Recall": context_entities_recall_response,
        "Answer Semantic Similarity": answer_semantic_similarity_response,
        "Answer Correctness": answer_correctness_response,
    },
]

# Run the streamlit app, got some error, need to fix it
display_results(results)

## Conclusion

In this notebook, we've explored a comprehensive approach to evaluating Retrieval-Augmented Generation (RAG) systems, moving beyond intuition to implement systematic, quantifiable metrics. By focusing on key aspects such as Faithfulness, Relevance, Context Precision, Context Recall, and Context Entities Recall, we've established a robust framework for assessing RAG performance.

### Key Takeaways

1. **Objective Measurement**: We've implemented metrics that provide concrete, comparable data on our RAG system's performance, enabling informed decision-making in the development process.

2. **LLM-as-Judge Approach**: While leveraging LLMs for evaluation offers efficiency at scale, we've acknowledged and addressed potential limitations through strategies like detailed criteria, example-based instruction, and human oversight.

3. **Streamlit Visualization**: By integrating our evaluation metrics into a Streamlit application, we've created an interactive, user-friendly interface for analyzing and presenting results, making it easier to identify trends and areas for improvement.

4. **Holistic Evaluation**: Our approach considers multiple dimensions of RAG performance, from the accuracy of retrieved context to the relevance of generated responses, providing a comprehensive view of system capabilities.

5. **Continuous Improvement**: With these evaluation tools in place, we're well-equipped to iteratively refine our RAG system, making data-driven decisions to enhance its effectiveness.

### Moving Forward

As you continue to develop and refine your RAG systems, remember that evaluation is an ongoing process. Regular assessment using these metrics will help you:

- Track progress over time
- Compare different system configurations
- Identify specific areas for optimization
- Communicate improvements to stakeholders more effectively

While this evaluation framework provides a solid foundation, don't hesitate to adapt and expand upon it as your specific use cases evolve. The field of RAG is rapidly advancing, and staying flexible in your evaluation approach will be key to long-term success.

By combining systematic evaluation with intuitive visualization, you're now well-positioned to build more accurate, relevant, and effective RAG systems. Keep iterating, keep measuring, and keep pushing the boundaries of what's possible with retrieval-augmented generation!