# üìä Notebook 12: RAGAS Metrics Deep Dive

**Understanding How RAG Evaluation Metrics Work Internally**

**LangChain 1.0.5+ | RAGAS 0.3.9+ | Mixed Level Class**

---

## üéØ Learning Objectives

By the end of this notebook, you will:

1. **Understand the internal calculation process** for each of the 6 core RAGAS metrics
2. **See intermediate outputs** like extracted claims, generated questions, and identified entities
3. **Learn to interpret scores with confidence** using threshold guidelines
4. **Debug evaluation issues** by understanding what each metric actually measures

---

## üìö What Makes This Notebook Different?

While Notebook 10 covers **how to use** RAGAS metrics, this notebook goes deeper into **how they work internally**:

| Notebook 10 | This Notebook (12) |
|-------------|--------------------|
| Run evaluation and get scores | See how scores are calculated step-by-step |
| Use metrics as black boxes | Understand intermediate outputs |
| Focus on results | Focus on the calculation process |

---

## üî¢ The 6 Metrics We'll Explore

| # | Metric | Evaluates | Key Question |
|---|--------|-----------|-------------|
| 1 | **Faithfulness** | Generator | Is the answer grounded in context? |
| 2 | **Answer Relevancy** | Generator | Does the answer address the question? |
| 3 | **Context Precision** | Retriever | Are relevant chunks ranked at the top? |
| 4 | **Context Recall** | Retriever | Did we retrieve all necessary information? |
| 5 | **Context Entity Recall** | Retriever | Did we retrieve all important entities? |
| 6 | **Noise Sensitivity** | System | Does noise cause wrong answers? |

---

## üî∞ Section 1: Setup & Environment

Let's set up our environment with all required imports and configure our LLM/Embedding models.

In [None]:
# Environment Setup
import os
import warnings
warnings.filterwarnings('ignore')

from dotenv import load_dotenv
load_dotenv()

# Verify API key
if os.getenv("OPENAI_API_KEY"):
    print("‚úÖ OPENAI_API_KEY found")
else:
    print("‚ùå OPENAI_API_KEY not found - please set it in your .env file")

‚úÖ OPENAI_API_KEY found


In [None]:
# Core Imports

# Standard library
import numpy as np
import pandas as pd
import asyncio
import json

# LangChain components
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_google_genai import ChatGoogleGenerativeAI, GoogleGenerativeAIEmbeddings

# RAGAS components
from ragas import SingleTurnSample, EvaluationDataset, evaluate
from ragas.metrics import (
    Faithfulness,
    ResponseRelevancy,
    LLMContextPrecisionWithReference,
    LLMContextRecall,
    ContextEntityRecall,
    NoiseSensitivity
)
from ragas.llms import LangchainLLMWrapper
from ragas.embeddings import LangchainEmbeddingsWrapper

print("‚úÖ All imports successful")

‚úÖ All imports successful


In [None]:
# Initialize LLM and Embeddings

# Initialize base models  (OPENAI)
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Initialize base models (GEMINI)
#llm = ChatGoogleGenerativeAI(model="gemini-2.5-flash", temperature=0)
#embeddings = GoogleGenerativeAIEmbeddings(model="models/gemini-embedding-001")

# Wrap for RAGAS compatibility
ragas_llm = LangchainLLMWrapper(llm)
ragas_embeddings = LangchainEmbeddingsWrapper(embeddings)

print("‚úÖ LLM initialized: gpt-4o-mini")
print("‚úÖ Embeddings initialized: text-embedding-3-small")
print("‚úÖ RAGAS wrappers ready")

‚úÖ LLM initialized: gpt-4o-mini
‚úÖ Embeddings initialized: text-embedding-3-small
‚úÖ RAGAS wrappers ready


In [None]:
# Helper function for running async code in Jupyter

def run_async(coro):
    """Helper to run async code in Jupyter notebooks"""
    try:
        loop = asyncio.get_event_loop()
        if loop.is_running():
            # We're in Jupyter with an existing loop
            import nest_asyncio
            nest_asyncio.apply()
            return loop.run_until_complete(coro)
        else:
            return asyncio.run(coro)
    except RuntimeError:
        return asyncio.run(coro)

print("‚úÖ Async helper ready")

‚úÖ Async helper ready


---

# üî∞ Section 2: Faithfulness Deep Dive

## What Faithfulness Measures

**Faithfulness** checks if the generated answer *sticks to the facts* from the retrieved context. It detects **hallucinations** - when the LLM makes things up that aren't in the source material.

### üìñ Analogy

> Imagine you're a journalist writing a news story. Faithfulness checks whether everything you wrote can be traced back to your interview notes. If you add details that weren't in your notes, that's a problem!

### üîß How It Works (3 Steps)

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 1:        ‚îÇ    ‚îÇ  Step 2:        ‚îÇ    ‚îÇ  Step 3:        ‚îÇ
‚îÇ  Extract Claims ‚îÇ -> ‚îÇ  Verify Each    ‚îÇ -> ‚îÇ  Calculate      ‚îÇ
‚îÇ  from Response  ‚îÇ    ‚îÇ  Against Context‚îÇ    ‚îÇ  Score          ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### üìê Formula

$$\text{Faithfulness} = \frac{\text{Number of claims supported by context}}{\text{Total number of claims}}$$

## 2.1 Step 1: Manual Claim Extraction

Let's first see how RAGAS extracts claims from a response. We'll mimic this process manually.

In [None]:
# Define our test case

# The response we want to evaluate
test_response = "The first Super Bowl was held on January 15, 1967 in Los Angeles. It was a sunny day with clear skies."

# The context that was retrieved (source of truth)
test_context = [
    "The First AFL-NFL World Championship Game was played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles, California."
]

print("üìù Response to evaluate:")
print(f"   '{test_response}'")
print("\nüìö Retrieved context:")
print(f"   '{test_context[0]}'")

üìù Response to evaluate:
   'The first Super Bowl was held on January 15, 1967 in Los Angeles. It was a sunny day with clear skies.'

üìö Retrieved context:
   'The First AFL-NFL World Championship Game was played on January 15, 1967, at the Los Angeles Memorial Coliseum in Los Angeles, California.'


In [None]:
# Manual claim extraction using LLM (mimicking RAGAS)

claim_extraction_prompt = ChatPromptTemplate.from_template("""
Given the following response, extract ALL factual claims as a numbered list.
Each claim should be a single, verifiable statement.

Response: {response}

Extract each factual claim:
""")

claim_chain = claim_extraction_prompt | llm | StrOutputParser()

extracted_claims_raw = claim_chain.invoke({"response": test_response})

print("üîç STEP 1: Extracted Claims from Response")
print("=" * 50)
print(extracted_claims_raw)

üîç STEP 1: Extracted Claims from Response
1. The first Super Bowl was held on January 15, 1967.
2. The first Super Bowl was held in Los Angeles.
3. It was a sunny day on January 15, 1967.
4. There were clear skies on January 15, 1967.


In [None]:
# Parse the claims into a list for verification

# For our analysis, let's define the claims explicitly
claims = [
    "The first Super Bowl was held on January 15, 1967",
    "The first Super Bowl was held in Los Angeles",
    "It was a sunny day",
    "There were clear skies"
]

print("üìã Claims to verify:")
for i, claim in enumerate(claims, 1):
    print(f"   {i}. {claim}")

üìã Claims to verify:
   1. The first Super Bowl was held on January 15, 1967
   2. The first Super Bowl was held in Los Angeles
   3. It was a sunny day
   4. There were clear skies


## 2.2 Step 2: Claim Verification

Now we verify each claim against the retrieved context. This is where hallucinations are detected!

In [None]:
# Manual claim verification (mimicking RAGAS)

verification_prompt = ChatPromptTemplate.from_template("""
Given the following context and claim, determine if the claim is SUPPORTED by the context.

Context: {context}

Claim: {claim}

Answer with:
- "SUPPORTED" if the claim can be verified from the context
- "NOT SUPPORTED" if the claim cannot be verified or contradicts the context

Also provide a brief explanation.

Verdict:
""")

verify_chain = verification_prompt | llm | StrOutputParser()

print("üîç STEP 2: Verifying Each Claim Against Context")
print("=" * 60)

verification_results = []
for claim in claims:
    result = verify_chain.invoke({
        "context": test_context[0],
        "claim": claim
    })
    is_supported = "SUPPORTED" in result.upper() and "NOT SUPPORTED" not in result.upper()
    verification_results.append({
        "claim": claim,
        "supported": is_supported,
        "explanation": result
    })
    status = "‚úÖ" if is_supported else "‚ùå"
    print(f"\n{status} Claim: '{claim}'")
    print(f"   Result: {result[:100]}..." if len(result) > 100 else f"   Result: {result}")

üîç STEP 2: Verifying Each Claim Against Context

‚úÖ Claim: 'The first Super Bowl was held on January 15, 1967'
   Result: SUPPORTED

Explanation: The context states that the First AFL-NFL World Championship Game was played...

‚úÖ Claim: 'The first Super Bowl was held in Los Angeles'
   Result: SUPPORTED

Explanation: The context states that the First AFL-NFL World Championship Game, which is ...

‚ùå Claim: 'It was a sunny day'
   Result: NOT SUPPORTED

The context provides information about the date and location of the First AFL-NFL Wor...

‚ùå Claim: 'There were clear skies'
   Result: NOT SUPPORTED

The context provides information about the date and location of the First AFL-NFL Wor...


In [None]:
# Display verification results as a table

print("\nüìä Claim Verification Summary")
print("=" * 80)

df_verification = pd.DataFrame([
    {
        "Claim": r["claim"],
        "Supported?": "‚úÖ Yes" if r["supported"] else "‚ùå No",
        "Reason": "Found in context" if r["supported"] else "HALLUCINATION - Not in context!"
    }
    for r in verification_results
])

print(df_verification.to_string(index=False))


üìä Claim Verification Summary
                                            Claim Supported?                          Reason
The first Super Bowl was held on January 15, 1967      ‚úÖ Yes                Found in context
     The first Super Bowl was held in Los Angeles      ‚úÖ Yes                Found in context
                               It was a sunny day       ‚ùå No HALLUCINATION - Not in context!
                           There were clear skies       ‚ùå No HALLUCINATION - Not in context!


## 2.3 Step 3: Calculate Faithfulness Score

In [None]:
# Manual faithfulness calculation

supported_count = sum(1 for r in verification_results if r["supported"])
total_claims = len(verification_results)

manual_faithfulness = supported_count / total_claims

print("üî¢ STEP 3: Calculate Faithfulness Score")
print("=" * 50)
print(f"\n   Supported claims: {supported_count}")
print(f"   Total claims: {total_claims}")
print(f"\n   Formula: Faithfulness = {supported_count} / {total_claims}")
print(f"\n   üìä Manual Faithfulness Score: {manual_faithfulness:.2f}")

üî¢ STEP 3: Calculate Faithfulness Score

   Supported claims: 2
   Total claims: 4

   Formula: Faithfulness = 2 / 4

   üìä Manual Faithfulness Score: 0.50


## 2.4 Verify with Actual RAGAS Metric

Now let's compare our manual calculation with the actual RAGAS Faithfulness metric!

In [None]:
# Run actual RAGAS Faithfulness metric

# Create sample in RAGAS format
faithfulness_sample = SingleTurnSample(
    user_input="When was the first Super Bowl?",
    response=test_response,
    retrieved_contexts=test_context
)

# Initialize and run the metric
faithfulness_metric = Faithfulness(llm=ragas_llm)

ragas_faithfulness = run_async(faithfulness_metric.single_turn_ascore(faithfulness_sample))

print("üî¨ RAGAS Faithfulness Result")
print("=" * 50)
print(f"\n   Manual calculation:  {manual_faithfulness:.2f}")
print(f"   RAGAS metric score:  {ragas_faithfulness:.2f}")
print(f"\n   Difference: {abs(manual_faithfulness - ragas_faithfulness):.2f}")

üî¨ RAGAS Faithfulness Result

   Manual calculation:  0.50
   RAGAS metric score:  0.50

   Difference: 0.00


## 2.5 Faithfulness Examples: Good vs Bad

Let's see how different types of responses score on Faithfulness.

In [None]:
# Compare different faithfulness scenarios

faithfulness_examples = [
    {
        "name": "Perfect Faithfulness (No hallucinations)",
        "response": "The first Super Bowl was played on January 15, 1967 at the Los Angeles Memorial Coliseum.",
        "context": ["The First AFL-NFL World Championship Game was played on January 15, 1967, at the Los Angeles Memorial Coliseum."]
    },
    {
        "name": "Partial Faithfulness (Some hallucinations)",
        "response": "The first Super Bowl was on January 15, 1967. The Green Bay Packers won 35-10 with Bart Starr as MVP.",
        "context": ["The First AFL-NFL World Championship Game was played on January 15, 1967."]
    },
    {
        "name": "Zero Faithfulness (Complete hallucination)",
        "response": "The first Super Bowl was held in Miami in 1970 and attracted over 100,000 spectators.",
        "context": ["The First AFL-NFL World Championship Game was played on January 15, 1967, at the Los Angeles Memorial Coliseum."]
    }
]

print("üìä Faithfulness Comparison: Different Scenarios")
print("=" * 70)

for example in faithfulness_examples:
    sample = SingleTurnSample(
        user_input="Tell me about the first Super Bowl",
        response=example["response"],
        retrieved_contexts=example["context"]
    )
    score = run_async(faithfulness_metric.single_turn_ascore(sample))
    
    print(f"\nüè∑Ô∏è  {example['name']}")
    print(f"   Response: '{example['response'][:80]}...'" if len(example['response']) > 80 else f"   Response: '{example['response']}'")
    print(f"   Score: {score:.2f}")

üìä Faithfulness Comparison: Different Scenarios

üè∑Ô∏è  Perfect Faithfulness (No hallucinations)
   Response: 'The first Super Bowl was played on January 15, 1967 at the Los Angeles Memorial ...'
   Score: 1.00

üè∑Ô∏è  Partial Faithfulness (Some hallucinations)
   Response: 'The first Super Bowl was on January 15, 1967. The Green Bay Packers won 35-10 wi...'
   Score: 0.33

üè∑Ô∏è  Zero Faithfulness (Complete hallucination)
   Response: 'The first Super Bowl was held in Miami in 1970 and attracted over 100,000 specta...'
   Score: 0.00


## 2.6 Score Interpretation Guide

| Score Range | Interpretation | Action |
|-------------|----------------|--------|
| **0.9 - 1.0** | Excellent - No or minimal hallucinations | ‚úÖ Good to go |
| **0.7 - 0.9** | Good - Minor unsupported claims | ‚ö†Ô∏è Review edge cases |
| **0.5 - 0.7** | Concerning - Significant hallucinations | üîß Improve prompt/temperature |
| **< 0.5** | Poor - Most claims are hallucinated | üö® Major fixes needed |

---

# üî∞ Section 3: Answer Relevancy Deep Dive

## What Answer Relevancy Measures

**Answer Relevancy** checks if the answer *actually answers* the question asked. It doesn't care if the answer is factually correct - just whether it's relevant to the question.

### üìñ Analogy

> If someone asks "What's the capital of France?" and you answer "The Eiffel Tower is beautiful," your answer might be factually true but completely irrelevant to the question!

### üîß The "Reverse Engineering" Approach

RAGAS uses a clever technique: instead of directly comparing the answer to the question, it:

1. **Generates hypothetical questions** from the answer ("What questions would this be a good answer to?")
2. **Compares embeddings** of generated questions with the original question
3. **Calculates similarity** - if the generated questions are similar to the original, the answer is relevant!

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ  Step 1:        ‚îÇ    ‚îÇ  Step 2:        ‚îÇ    ‚îÇ  Step 3:        ‚îÇ
‚îÇ  Generate       ‚îÇ -> ‚îÇ  Embed All      ‚îÇ -> ‚îÇ  Calculate      ‚îÇ
‚îÇ  Questions      ‚îÇ    ‚îÇ  Questions      ‚îÇ    ‚îÇ  Similarity     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### üìê Formula

$$\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \text{cosine\_similarity}(E_{generated_i}, E_{original})$$

## 3.1 Step 1: Hypothetical Question Generation

Let's see how RAGAS generates questions from an answer.

In [None]:
# Define our test case for relevancy

original_question = "When was the first Super Bowl?"
test_answer = "The first Super Bowl was held on January 15, 1967"

print("üìù Original Question:")
print(f"   '{original_question}'")
print("\nüìù Answer to Evaluate:")
print(f"   '{test_answer}'")

üìù Original Question:
   'When was the first Super Bowl?'

üìù Answer to Evaluate:
   'The first Super Bowl was held on January 15, 1967'


In [None]:
# Manual hypothetical question generation (mimicking RAGAS)

question_gen_prompt = ChatPromptTemplate.from_template("""
Given the following answer, generate exactly 3 different questions that this answer would be a good response to.
The questions should be varied but all answerable by this response.

Answer: {answer}

Generate 3 questions (one per line):
1.
2.
3.
""")

question_gen_chain = question_gen_prompt | llm | StrOutputParser()

generated_questions_raw = question_gen_chain.invoke({"answer": test_answer})

print("üîç STEP 1: Generated Hypothetical Questions")
print("=" * 50)
print(generated_questions_raw)

üîç STEP 1: Generated Hypothetical Questions
1. When was the inaugural Super Bowl played?  
2. What date marks the beginning of the Super Bowl history?  
3. Can you tell me when the first Super Bowl took place?  


In [None]:
# Parse generated questions (for our calculation)

# Manually define likely generated questions
generated_questions = [
    "When was the first Super Bowl held?",
    "What date was the inaugural Super Bowl?",
    "On what day did the first Super Bowl take place?"
]

print("üìã Questions for embedding comparison:")
print(f"   Original: '{original_question}'")
print("   Generated:")
for i, q in enumerate(generated_questions, 1):
    print(f"      {i}. '{q}'")

üìã Questions for embedding comparison:
   Original: 'When was the first Super Bowl?'
   Generated:
      1. 'When was the first Super Bowl held?'
      2. 'What date was the inaugural Super Bowl?'
      3. 'On what day did the first Super Bowl take place?'


## 3.2 Step 2: Embedding and Similarity Calculation

Now we compute embeddings and calculate cosine similarity.

In [None]:
# Define cosine similarity function

def cosine_similarity(vec1, vec2):
    """Calculate cosine similarity between two vectors"""
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

print("‚úÖ Cosine similarity function ready")
print("\nüìê Formula: cos(Œ∏) = (A ¬∑ B) / (||A|| √ó ||B||)")

‚úÖ Cosine similarity function ready

üìê Formula: cos(Œ∏) = (A ¬∑ B) / (||A|| √ó ||B||)


In [None]:
# Calculate embeddings and similarities

print("üîç STEP 2: Computing Embeddings and Similarities")
print("=" * 60)

# Get embedding for original question
original_embedding = embeddings.embed_query(original_question)
print(f"\n‚úÖ Original question embedded (dim={len(original_embedding)})")

# Get embeddings for generated questions and calculate similarities
similarities = []
for i, gen_q in enumerate(generated_questions, 1):
    gen_embedding = embeddings.embed_query(gen_q)
    sim = cosine_similarity(original_embedding, gen_embedding)
    similarities.append(sim)
    print(f"\n   Question {i}: '{gen_q}'")
    print(f"   Similarity to original: {sim:.4f}")

üîç STEP 2: Computing Embeddings and Similarities

‚úÖ Original question embedded (dim=1536)

   Question 1: 'When was the first Super Bowl held?'
   Similarity to original: 0.9353

   Question 2: 'What date was the inaugural Super Bowl?'
   Similarity to original: 0.8008

   Question 3: 'On what day did the first Super Bowl take place?'
   Similarity to original: 0.9063


## 3.3 Step 3: Calculate Final Score

In [None]:
# Calculate answer relevancy score

manual_relevancy = np.mean(similarities)

print("üî¢ STEP 3: Calculate Answer Relevancy Score")
print("=" * 50)
print(f"\n   Similarities: {[f'{s:.4f}' for s in similarities]}")
print(f"   Formula: Average of similarities")
print(f"\n   ({' + '.join([f'{s:.4f}' for s in similarities])}) / {len(similarities)}")
print(f"\n   üìä Manual Answer Relevancy: {manual_relevancy:.4f}")

üî¢ STEP 3: Calculate Answer Relevancy Score

   Similarities: ['0.9353', '0.8008', '0.9063']
   Formula: Average of similarities

   (0.9353 + 0.8008 + 0.9063) / 3

   üìä Manual Answer Relevancy: 0.8808


## 3.4 Verify with Actual RAGAS Metric

In [None]:
# Run actual RAGAS Answer Relevancy metric

relevancy_sample = SingleTurnSample(
    user_input=original_question,
    response=test_answer,
    retrieved_contexts=["The First AFL-NFL World Championship Game was played on January 15, 1967."]
)

relevancy_metric = ResponseRelevancy(llm=ragas_llm, embeddings=ragas_embeddings)

ragas_relevancy = run_async(relevancy_metric.single_turn_ascore(relevancy_sample))

print("üî¨ RAGAS Answer Relevancy Result")
print("=" * 50)
print(f"\n   Manual calculation:  {manual_relevancy:.4f}")
print(f"   RAGAS metric score:  {ragas_relevancy:.4f}")

üî¨ RAGAS Answer Relevancy Result

   Manual calculation:  0.8808
   RAGAS metric score:  0.9353


## 3.5 Relevancy Contrast: Good vs Bad Examples

In [None]:
# Compare different relevancy scenarios

relevancy_examples = [
    {
        "name": "Highly Relevant (Directly answers WHEN)",
        "question": "When was the first Super Bowl?",
        "answer": "The first Super Bowl was held on January 15, 1967.",
    },
    {
        "name": "Partially Relevant (Answers but adds extra info)",
        "question": "When was the first Super Bowl?",
        "answer": "The Super Bowl is the annual championship game of the NFL, first held on January 15, 1967.",
    },
    {
        "name": "Low Relevancy (Doesn't answer WHEN)",
        "question": "When was the first Super Bowl?",
        "answer": "The Super Bowl is the annual championship game of the National Football League.",
    },
    {
        "name": "Off-topic (Completely irrelevant)",
        "question": "When was the first Super Bowl?",
        "answer": "Pizza is a popular Italian dish that spread worldwide in the 20th century.",
    }
]

print("üìä Answer Relevancy Comparison")
print("=" * 70)

for example in relevancy_examples:
    sample = SingleTurnSample(
        user_input=example["question"],
        response=example["answer"],
        retrieved_contexts=["Context not relevant for this metric."]
    )
    score = run_async(relevancy_metric.single_turn_ascore(sample))
    
    print(f"\nüè∑Ô∏è  {example['name']}")
    print(f"   Q: '{example['question']}'")
    print(f"   A: '{example['answer'][:60]}...'" if len(example['answer']) > 60 else f"   A: '{example['answer']}'")
    print(f"   Score: {score:.4f}")

üìä Answer Relevancy Comparison

üè∑Ô∏è  Highly Relevant (Directly answers WHEN)
   Q: 'When was the first Super Bowl?'
   A: 'The first Super Bowl was held on January 15, 1967.'
   Score: 0.9353

üè∑Ô∏è  Partially Relevant (Answers but adds extra info)
   Q: 'When was the first Super Bowl?'
   A: 'The Super Bowl is the annual championship game of the NFL, f...'
   Score: 0.7254

üè∑Ô∏è  Low Relevancy (Doesn't answer WHEN)
   Q: 'When was the first Super Bowl?'
   A: 'The Super Bowl is the annual championship game of the Nation...'
   Score: 0.6773

üè∑Ô∏è  Off-topic (Completely irrelevant)
   Q: 'When was the first Super Bowl?'
   A: 'Pizza is a popular Italian dish that spread worldwide in the...'
   Score: 0.1667


## 3.6 Score Interpretation Guide

| Score Range | Interpretation | Example |
|-------------|----------------|--------|
| **0.9 - 1.0** | Directly addresses the question | "When?" ‚Üí "January 15, 1967" |
| **0.7 - 0.9** | Mostly relevant with some tangents | "When?" ‚Üí "It was 1967, a historic game" |
| **0.4 - 0.7** | Partially relevant, missing key aspects | "When?" ‚Üí "It's an NFL championship" |
| **< 0.4** | Off-topic or doesn't answer the question | "When?" ‚Üí "Pizza is delicious" |

---

# üî∞ Section 4: Context Precision Deep Dive

## What Context Precision Measures

**Context Precision** evaluates whether the *most relevant chunks are ranked at the top* of your retrieval results. It's about **ranking quality**, not just whether you retrieved relevant information.

### üìñ Analogy

> Imagine you're a librarian handing someone 5 books to answer their question. Context Precision asks: "Did you put the most useful book on top of the pile?"

### üîß How It Works

1. For each retrieved chunk, determine if it's **relevant** to the question/reference
2. Calculate **Precision@K** at each position (weighted by position)
3. Relevant chunks at the **top** = higher score

### üìê Formula

$$\text{Context Precision@K} = \frac{\sum_{k=1}^{K} (\text{Precision@k} \times \text{relevance}_k)}{\text{Total relevant items in top K}}$$

## 4.1 Understanding Ranking Impact

Let's see how the **same chunks in different orders** produce different scores.

In [None]:
# Define our test case for context precision

question = "Where is the Eiffel Tower located?"
reference = "The Eiffel Tower is located in Paris, France."

# Chunks with known relevance
chunks_with_relevance = [
    ("The Eiffel Tower is located in Paris, France.", True),      # Directly relevant
    ("Paris is the capital of France.", True),                     # Somewhat relevant
    ("The tower was built in 1889.", False),                       # Not relevant to WHERE
    ("Pizza originated in Italy.", False),                         # Completely irrelevant
]

print("üìù Question: '{}'\n".format(question))
print("üìö Retrieved Chunks (with relevance):")
for i, (chunk, relevant) in enumerate(chunks_with_relevance, 1):
    status = "‚úÖ Relevant" if relevant else "‚ùå Not relevant"
    print(f"   {i}. {status}: '{chunk}'")

üìù Question: 'Where is the Eiffel Tower located?'

üìö Retrieved Chunks (with relevance):
   1. ‚úÖ Relevant: 'The Eiffel Tower is located in Paris, France.'
   2. ‚úÖ Relevant: 'Paris is the capital of France.'
   3. ‚ùå Not relevant: 'The tower was built in 1889.'
   4. ‚ùå Not relevant: 'Pizza originated in Italy.'


## 4.2 Manual Relevance Classification

In [None]:
# Manual relevance classification using LLM

relevance_prompt = ChatPromptTemplate.from_template("""
Given the question and reference answer, determine if the following context chunk is RELEVANT.

Question: {question}
Reference Answer: {reference}
Context Chunk: {chunk}

Is this chunk relevant for answering the question? Answer only "RELEVANT" or "NOT RELEVANT".
""")

relevance_chain = relevance_prompt | llm | StrOutputParser()

print("üîç Manual Relevance Classification")
print("=" * 60)

relevance_results = []
for chunk, expected in chunks_with_relevance:
    result = relevance_chain.invoke({
        "question": question,
        "reference": reference,
        "chunk": chunk
    })
    is_relevant = "RELEVANT" in result.upper() and "NOT RELEVANT" not in result.upper()
    relevance_results.append(is_relevant)
    status = "‚úÖ" if is_relevant else "‚ùå"
    print(f"{status} '{chunk[:50]}...' ‚Üí {result.strip()}")

üîç Manual Relevance Classification
‚úÖ 'The Eiffel Tower is located in Paris, France....' ‚Üí RELEVANT
‚úÖ 'Paris is the capital of France....' ‚Üí RELEVANT
‚ùå 'The tower was built in 1889....' ‚Üí NOT RELEVANT
‚ùå 'Pizza originated in Italy....' ‚Üí NOT RELEVANT


## 4.3 Precision@K Calculation: Good Ranking

In [None]:
# Calculate Precision@K for GOOD ranking (relevant at top)

# Good ranking: [Relevant, Relevant, Not Relevant, Not Relevant]
good_ranking = [True, True, False, False]

print("üìä GOOD RANKING: Relevant chunks at TOP")
print("=" * 60)
print("\nRanking: [‚úÖ Relevant, ‚úÖ Relevant, ‚ùå Not Rel, ‚ùå Not Rel]")
print("\nPrecision@K calculation:")

precisions_good = []
relevant_count = 0
for k, is_relevant in enumerate(good_ranking, 1):
    if is_relevant:
        relevant_count += 1
    precision_at_k = relevant_count / k
    contributes = "‚Üí Contributes" if is_relevant else "‚Üí Does NOT contribute"
    print(f"   Position {k}: Precision@{k} = {relevant_count}/{k} = {precision_at_k:.2f} {contributes}")
    if is_relevant:
        precisions_good.append(precision_at_k)

total_relevant = sum(good_ranking)
context_precision_good = sum(precisions_good) / total_relevant if total_relevant > 0 else 0

print(f"\n   Sum of contributing precisions: {sum(precisions_good):.2f}")
print(f"   Total relevant items: {total_relevant}")
print(f"\n   üìä Context Precision (Good Ranking): {context_precision_good:.2f}")

üìä GOOD RANKING: Relevant chunks at TOP

Ranking: [‚úÖ Relevant, ‚úÖ Relevant, ‚ùå Not Rel, ‚ùå Not Rel]

Precision@K calculation:
   Position 1: Precision@1 = 1/1 = 1.00 ‚Üí Contributes
   Position 2: Precision@2 = 2/2 = 1.00 ‚Üí Contributes
   Position 3: Precision@3 = 2/3 = 0.67 ‚Üí Does NOT contribute
   Position 4: Precision@4 = 2/4 = 0.50 ‚Üí Does NOT contribute

   Sum of contributing precisions: 2.00
   Total relevant items: 2

   üìä Context Precision (Good Ranking): 1.00


In [None]:
# Calculate Precision@K for BAD ranking (relevant at bottom)

# Bad ranking: [Not Relevant, Not Relevant, Relevant, Relevant]
bad_ranking = [False, False, True, True]

print("üìä BAD RANKING: Relevant chunks at BOTTOM")
print("=" * 60)
print("\nRanking: [‚ùå Not Rel, ‚ùå Not Rel, ‚úÖ Relevant, ‚úÖ Relevant]")
print("\nPrecision@K calculation:")

precisions_bad = []
relevant_count = 0
for k, is_relevant in enumerate(bad_ranking, 1):
    if is_relevant:
        relevant_count += 1
    precision_at_k = relevant_count / k
    contributes = "‚Üí Contributes" if is_relevant else "‚Üí Does NOT contribute"
    print(f"   Position {k}: Precision@{k} = {relevant_count}/{k} = {precision_at_k:.2f} {contributes}")
    if is_relevant:
        precisions_bad.append(precision_at_k)

total_relevant = sum(bad_ranking)
context_precision_bad = sum(precisions_bad) / total_relevant if total_relevant > 0 else 0

print(f"\n   Sum of contributing precisions: {sum(precisions_bad):.2f}")
print(f"   Total relevant items: {total_relevant}")
print(f"\n   üìä Context Precision (Bad Ranking): {context_precision_bad:.2f}")

üìä BAD RANKING: Relevant chunks at BOTTOM

Ranking: [‚ùå Not Rel, ‚ùå Not Rel, ‚úÖ Relevant, ‚úÖ Relevant]

Precision@K calculation:
   Position 1: Precision@1 = 0/1 = 0.00 ‚Üí Does NOT contribute
   Position 2: Precision@2 = 0/2 = 0.00 ‚Üí Does NOT contribute
   Position 3: Precision@3 = 1/3 = 0.33 ‚Üí Contributes
   Position 4: Precision@4 = 2/4 = 0.50 ‚Üí Contributes

   Sum of contributing precisions: 0.83
   Total relevant items: 2

   üìä Context Precision (Bad Ranking): 0.42


In [None]:
# Visual comparison

print("\n" + "=" * 60)
print("üìä RANKING COMPARISON")
print("=" * 60)

print("""
GOOD RANKING (Score: {:.2f})          BAD RANKING (Score: {:.2f})
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ 1. ‚úÖ Eiffel Tower Paris ‚îÇ          ‚îÇ 1. ‚ùå Pizza Italy        ‚îÇ
‚îÇ 2. ‚úÖ Paris is capital   ‚îÇ          ‚îÇ 2. ‚ùå Built in 1889      ‚îÇ
‚îÇ 3. ‚ùå Built in 1889      ‚îÇ          ‚îÇ 3. ‚úÖ Paris is capital   ‚îÇ
‚îÇ 4. ‚ùå Pizza Italy        ‚îÇ          ‚îÇ 4. ‚úÖ Eiffel Tower Paris ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
     Relevant at TOP! ‚úì                  Relevant at BOTTOM! ‚úó
""".format(context_precision_good, context_precision_bad))

print(f"   Difference: {context_precision_good - context_precision_bad:.2f}")
print("   Same chunks, different ranking ‚Üí HUGE difference in score!")


üìä RANKING COMPARISON

GOOD RANKING (Score: 1.00)          BAD RANKING (Score: 0.42)
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê          ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ 1. ‚úÖ Eiffel Tower Paris ‚îÇ          ‚îÇ 1. ‚ùå Pizza Italy        ‚îÇ
‚îÇ 2. ‚úÖ Paris is capital   ‚îÇ          ‚îÇ 2. ‚ùå Built in 1889      ‚îÇ
‚îÇ 3. ‚ùå Built in 1889      ‚îÇ          ‚îÇ 3. ‚úÖ Paris is capital   ‚îÇ
‚îÇ 4. ‚ùå Pizza Italy        ‚îÇ          ‚îÇ 4. ‚úÖ Eiffel Tower Paris ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò          ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
     Relevant at TOP! ‚úì                  Relevant at BOTTOM! ‚úó

   Difference: 0.58
   Same chunks, different ranking ‚Üí HUGE difference in score!


## 4.4 Verify with RAGAS Metric

In [None]:
# Run actual RAGAS Context Precision

# Good ranking sample
good_sample = SingleTurnSample(
    user_input=question,
    reference=reference,
    retrieved_contexts=[
        "The Eiffel Tower is located in Paris, France.",
        "Paris is the capital of France.",
        "The tower was built in 1889.",
        "Pizza originated in Italy."
    ]
)

# Bad ranking sample (same chunks, reversed order)
bad_sample = SingleTurnSample(
    user_input=question,
    reference=reference,
    retrieved_contexts=[
        "Pizza originated in Italy.",
        "The tower was built in 1889.",
        "Paris is the capital of France.",
        "The Eiffel Tower is located in Paris, France."
    ]
)

precision_metric = LLMContextPrecisionWithReference(llm=ragas_llm)

good_score = run_async(precision_metric.single_turn_ascore(good_sample))
bad_score = run_async(precision_metric.single_turn_ascore(bad_sample))

print("üî¨ RAGAS Context Precision Results")
print("=" * 50)
print(f"\n   Good Ranking (relevant at top): {good_score:.2f}")
print(f"   Bad Ranking (relevant at bottom): {bad_score:.2f}")
print(f"\n   Difference: {good_score - bad_score:.2f}")

üî¨ RAGAS Context Precision Results

   Good Ranking (relevant at top): 1.00
   Bad Ranking (relevant at bottom): 0.42

   Difference: 0.58


---

# üéì Section 5: Context Recall Deep Dive

## What Context Recall Measures

**Context Recall** checks if you retrieved *all the necessary information* to answer the question. It measures **retrieval completeness**.

### üìñ Analogy

> You're studying for an exam using a textbook. Context Recall asks: "Did you read all the chapters needed to answer every exam question, or did you skip some important ones?"

### üîß How It Works

1. Break down the **reference answer** into individual claims
2. Check if each claim can be **attributed** to the retrieved context
3. Calculate: claims found / total claims

### üìê Formula

$$\text{Context Recall} = \frac{\text{Reference claims found in context}}{\text{Total claims in reference}}$$

In [None]:
# Context Recall example setup

recall_question = "Tell me about the Eiffel Tower."
recall_reference = "The Eiffel Tower is located in Paris. It was built in 1889. It is 330 meters tall."

# Retrieved context (missing the height information)
recall_context = [
    "The Eiffel Tower is a landmark located in Paris, France.",
    "The tower was completed in 1889 for the World's Fair."
]

print("üìù Reference Answer (Ground Truth):")
print(f"   '{recall_reference}'")
print("\nüìö Retrieved Context:")
for i, ctx in enumerate(recall_context, 1):
    print(f"   {i}. '{ctx}'")

üìù Reference Answer (Ground Truth):
   'The Eiffel Tower is located in Paris. It was built in 1889. It is 330 meters tall.'

üìö Retrieved Context:
   1. 'The Eiffel Tower is a landmark located in Paris, France.'
   2. 'The tower was completed in 1889 for the World's Fair.'


In [None]:
# Extract claims from reference

reference_claims = [
    "The Eiffel Tower is located in Paris",
    "It was built in 1889",
    "It is 330 meters tall"
]

print("üîç STEP 1: Reference Claims")
print("=" * 50)
for i, claim in enumerate(reference_claims, 1):
    print(f"   {i}. {claim}")

üîç STEP 1: Reference Claims
   1. The Eiffel Tower is located in Paris
   2. It was built in 1889
   3. It is 330 meters tall


In [None]:
# Check attribution of each claim

attribution_prompt = ChatPromptTemplate.from_template("""
Can the following claim be attributed to (found in) the given context?

Context:
{context}

Claim: {claim}

Answer "YES" if the claim is supported by the context, "NO" if it cannot be found.
""")

attribution_chain = attribution_prompt | llm | StrOutputParser()

print("üîç STEP 2: Claim Attribution Check")
print("=" * 60)

combined_context = "\n".join(recall_context)
attribution_results = []

for claim in reference_claims:
    result = attribution_chain.invoke({
        "context": combined_context,
        "claim": claim
    })
    found = "YES" in result.upper()
    attribution_results.append(found)
    status = "‚úÖ Found" if found else "‚ùå MISSING"
    print(f"   {status}: '{claim}'")
    if not found:
        print(f"      ‚ö†Ô∏è This information was NOT retrieved!")

üîç STEP 2: Claim Attribution Check
   ‚úÖ Found: 'The Eiffel Tower is located in Paris'
   ‚úÖ Found: 'It was built in 1889'
   ‚ùå MISSING: 'It is 330 meters tall'
      ‚ö†Ô∏è This information was NOT retrieved!


In [None]:
# Calculate Context Recall

claims_found = sum(attribution_results)
total_claims = len(reference_claims)
manual_recall = claims_found / total_claims

print("üî¢ STEP 3: Calculate Context Recall")
print("=" * 50)
print(f"\n   Claims found in context: {claims_found}")
print(f"   Total claims in reference: {total_claims}")
print(f"\n   Formula: {claims_found} / {total_claims} = {manual_recall:.2f}")
print(f"\n   üìä Context Recall: {manual_recall:.2f}")
print(f"\n   ‚ö†Ô∏è Interpretation: {100 - manual_recall*100:.0f}% of required info was NOT retrieved!")

üî¢ STEP 3: Calculate Context Recall

   Claims found in context: 2
   Total claims in reference: 3

   Formula: 2 / 3 = 0.67

   üìä Context Recall: 0.67

   ‚ö†Ô∏è Interpretation: 33% of required info was NOT retrieved!


In [None]:
# Verify with RAGAS

recall_sample = SingleTurnSample(
    user_input=recall_question,
    response="The Eiffel Tower is in Paris and was built in 1889.",
    reference=recall_reference,
    retrieved_contexts=recall_context
)

recall_metric = LLMContextRecall(llm=ragas_llm)
ragas_recall = run_async(recall_metric.single_turn_ascore(recall_sample))

print("üî¨ RAGAS Context Recall Result")
print("=" * 50)
print(f"\n   Manual calculation: {manual_recall:.2f}")
print(f"   RAGAS metric score: {ragas_recall:.2f}")

üî¨ RAGAS Context Recall Result

   Manual calculation: 0.67
   RAGAS metric score: 0.67


---

# üéì Section 6: Context Entity Recall Deep Dive

## What Context Entity Recall Measures

**Context Entity Recall** checks if you retrieved context containing all the *important entities* (people, places, dates, organizations) mentioned in the reference answer.

### üìñ Analogy

> If the correct answer mentions "Einstein," "1905," and "Princeton," did your retrieved documents mention these specific entities?

### üìê Formula

$$\text{Entity Recall} = \frac{\text{Entities in both reference AND context}}{\text{Total entities in reference}}$$

In [None]:
# Entity Recall example setup

entity_reference = "Albert Einstein developed the theory of relativity at Princeton University in 1905."
entity_context = [
    "Albert Einstein was a famous physicist who worked at Princeton."
]

print("üìù Reference Answer:")
print(f"   '{entity_reference}'")
print("\nüìö Retrieved Context:")
print(f"   '{entity_context[0]}'")

üìù Reference Answer:
   'Albert Einstein developed the theory of relativity at Princeton University in 1905.'

üìö Retrieved Context:
   'Albert Einstein was a famous physicist who worked at Princeton.'


In [None]:
# Manual entity extraction

entity_extraction_prompt = ChatPromptTemplate.from_template("""
Extract all named entities from the following text. 
Include: PERSON, ORGANIZATION, LOCATION, DATE, and other proper nouns.

Text: {text}

List each entity on a new line with its type:
""")

entity_chain = entity_extraction_prompt | llm | StrOutputParser()

print("üîç Entity Extraction")
print("=" * 60)

print("\nüìã Reference Entities:")
ref_entities = entity_chain.invoke({"text": entity_reference})
print(ref_entities)

print("\nüìã Context Entities:")
ctx_entities = entity_chain.invoke({"text": entity_context[0]})
print(ctx_entities)

NameError: name 'llm' is not defined

In [None]:
# Manual entity analysis

# Define entities for analysis
reference_entities = {
    "Albert Einstein": "PERSON",
    "Princeton University": "ORGANIZATION",
    "1905": "DATE"
}

context_entities = {
    "Albert Einstein": "PERSON",
    "Princeton": "ORGANIZATION"  # Partial match
}

print("üìä Entity Comparison")
print("=" * 60)

print("\n| Entity in Reference | Type | Found in Context? |")
print("|" + "-" * 20 + "|" + "-" * 14 + "|" + "-" * 18 + "|")

found_count = 0
for entity, entity_type in reference_entities.items():
    # Check if entity (or partial) exists in context
    found = any(entity.lower() in ctx.lower() or ctx.lower() in entity.lower() 
                for ctx in context_entities.keys())
    if found:
        found_count += 1
    status = "‚úÖ Yes" if found else "‚ùå MISSING"
    print(f"| {entity:18} | {entity_type:12} | {status:16} |")

entity_recall = found_count / len(reference_entities)
print(f"\nüìä Entity Recall: {found_count}/{len(reference_entities)} = {entity_recall:.2f}")
print(f"‚ö†Ô∏è Missing: '1905' - Critical date not retrieved!")

üìä Entity Comparison

| Entity in Reference | Type | Found in Context? |
|--------------------|--------------|------------------|
| Albert Einstein    | PERSON       | ‚úÖ Yes            |
| Princeton University | ORGANIZATION | ‚úÖ Yes            |
| 1905               | DATE         | ‚ùå MISSING        |

üìä Entity Recall: 2/3 = 0.67
‚ö†Ô∏è Missing: '1905' - Critical date not retrieved!


In [None]:
# Verify with RAGAS

entity_sample = SingleTurnSample(
    reference=entity_reference,
    retrieved_contexts=entity_context
)

entity_metric = ContextEntityRecall(llm=ragas_llm)
ragas_entity_recall = run_async(entity_metric.single_turn_ascore(entity_sample))

print("üî¨ RAGAS Context Entity Recall Result")
print("=" * 50)
print(f"\n   Manual estimate: {entity_recall:.2f}")
print(f"   RAGAS metric score: {ragas_entity_recall:.2f}")

üî¨ RAGAS Context Entity Recall Result

   Manual estimate: 0.67
   RAGAS metric score: 0.25


### Entity Types Tracked

```
PERSON:       Albert Einstein, Marie Curie, Elon Musk
ORGANIZATION: Princeton University, NASA, Google
LOCATION:     Paris, Mount Everest, Pacific Ocean
DATE:         1905, January 15, 20th century
NUMBER:       330 meters, $1 billion, 99.9%
```

---

# üéì Section 7: Noise Sensitivity Deep Dive

## What Noise Sensitivity Measures

**Noise Sensitivity** tests how much *irrelevant information* in the retrieved context causes errors in the answer. It measures **robustness** to noise.

### üìñ Analogy

> You're taking an open-book exam, but someone mixed random Wikipedia articles into your notes. Noise Sensitivity measures how often those random articles cause you to write wrong answers.

### ‚ö†Ô∏è Important: Lower is Better!

Unlike other metrics where higher is better, for Noise Sensitivity:
- **0.0** = Great! Model ignores noise completely
- **1.0** = Bad! Model is very easily confused by irrelevant information

### üìê Formula

$$\text{Noise Sensitivity} = \frac{\text{Incorrect claims from noisy context}}{\text{Total claims}}$$

In [None]:
# Noise Sensitivity example setup

noise_question = "What is LIC known for?"
noise_response = "LIC is the largest insurance company in India, known for its vast portfolio. LIC contributes to financial stability."
noise_reference = "LIC is the largest insurance company in India, established in 1956. It is known for managing a large portfolio of investments."

noise_contexts = [
    "LIC was established in 1956 following nationalization.",           # ‚úÖ Relevant
    "LIC is the largest insurance company with huge investments.",      # ‚úÖ Relevant
    "LIC manages substantial funds for financial stability.",           # ‚úÖ Relevant
    "The Indian economy is one of the fastest-growing economies..."     # ‚ùå NOISE!
]

print("üìù Question: '{}'\n".format(noise_question))
print("üìù Response to evaluate:")
print(f"   '{noise_response}'")
print("\nüìù Reference (Ground Truth):")
print(f"   '{noise_reference}'")
print("\nüìö Retrieved Contexts:")
for i, ctx in enumerate(noise_contexts, 1):
    noise_tag = " ‚Üê NOISE!" if i == 4 else " ‚úÖ"
    print(f"   {i}. '{ctx[:60]}...'{noise_tag}")

üìù Question: 'What is LIC known for?'

üìù Response to evaluate:
   'LIC is the largest insurance company in India, known for its vast portfolio. LIC contributes to financial stability.'

üìù Reference (Ground Truth):
   'LIC is the largest insurance company in India, established in 1956. It is known for managing a large portfolio of investments.'

üìö Retrieved Contexts:
   1. 'LIC was established in 1956 following nationalization....' ‚úÖ
   2. 'LIC is the largest insurance company with huge investments....' ‚úÖ
   3. 'LIC manages substantial funds for financial stability....' ‚úÖ
   4. 'The Indian economy is one of the fastest-growing economies.....' ‚Üê NOISE!


In [None]:
# Analyze claims in response

response_claims = [
    ("LIC is the largest insurance company in India", True, "Matches reference"),
    ("LIC is known for its vast portfolio", True, "Matches reference (portfolio)"),
    ("LIC contributes to financial stability", False, "NOT in reference - possible hallucination from noise!")
]

print("üîç Claim Analysis")
print("=" * 70)

print("\n| Claim | Correct? | Reason |")
print("|" + "-" * 45 + "|" + "-" * 10 + "|" + "-" * 40 + "|")

incorrect_count = 0
for claim, is_correct, reason in response_claims:
    status = "‚úÖ Yes" if is_correct else "‚ùå No"
    if not is_correct:
        incorrect_count += 1
    print(f"| {claim[:43]:43} | {status:8} | {reason[:38]:38} |")

üîç Claim Analysis

| Claim | Correct? | Reason |
|---------------------------------------------|----------|----------------------------------------|
| LIC is the largest insurance company in Ind | ‚úÖ Yes    | Matches reference                      |
| LIC is known for its vast portfolio         | ‚úÖ Yes    | Matches reference (portfolio)          |
| LIC contributes to financial stability      | ‚ùå No     | NOT in reference - possible hallucinat |


In [None]:
# Calculate Noise Sensitivity

total_claims = len(response_claims)
noise_sensitivity = incorrect_count / total_claims

print("üî¢ Noise Sensitivity Calculation")
print("=" * 50)
print(f"\n   Incorrect claims: {incorrect_count}")
print(f"   Total claims: {total_claims}")
print(f"\n   Formula: {incorrect_count} / {total_claims} = {noise_sensitivity:.2f}")
print(f"\n   üìä Noise Sensitivity: {noise_sensitivity:.2f}")

if noise_sensitivity < 0.3:
    print("   ‚úÖ Good! Model is mostly resistant to noise.")
elif noise_sensitivity < 0.6:
    print("   ‚ö†Ô∏è Warning! Model is sometimes confused by noise.")
else:
    print("   üö® Bad! Model is highly susceptible to noise.")

üî¢ Noise Sensitivity Calculation

   Incorrect claims: 1
   Total claims: 3

   Formula: 1 / 3 = 0.33

   üìä Noise Sensitivity: 0.33


In [None]:
# Verify with RAGAS (both modes)

noise_sample = SingleTurnSample(
    user_input=noise_question,
    response=noise_response,
    reference=noise_reference,
    retrieved_contexts=noise_contexts
)

# Relevant mode: errors from relevant contexts
noise_metric_relevant = NoiseSensitivity(llm=ragas_llm, mode="relevant")

ragas_noise = run_async(noise_metric_relevant.single_turn_ascore(noise_sample))

print("üî¨ RAGAS Noise Sensitivity Result")
print("=" * 50)
print(f"\n   Mode: relevant")
print(f"   Score: {ragas_noise:.2f}")
print(f"\n   Remember: Lower is better for this metric!")

üî¨ RAGAS Noise Sensitivity Result

   Mode: relevant
   Score: 0.33

   Remember: Lower is better for this metric!


### Score Interpretation Guide

| Score | Meaning | What it means for your RAG |
|-------|---------|---------------------------|
| **0.0 - 0.2** | Excellent | Model effectively ignores irrelevant information |
| **0.2 - 0.4** | Good | Occasional confusion but mostly robust |
| **0.4 - 0.6** | Concerning | Model frequently picks up noise |
| **0.6 - 1.0** | Poor | Model is highly susceptible to distraction |

---

# üéì Section 8: Putting It All Together

## 8.1 Metrics Relationship Diagram

Understanding which metrics evaluate which component of your RAG system:

```
                           USER QUESTION
                                ‚îÇ
                                ‚ñº
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ      RETRIEVER        ‚îÇ
                    ‚îÇ                       ‚îÇ
                    ‚îÇ  Metrics:             ‚îÇ
                    ‚îÇ  ‚Ä¢ Context Precision  ‚îÇ‚óÑ‚îÄ‚îÄ Is ranking good?
                    ‚îÇ  ‚Ä¢ Context Recall     ‚îÇ‚óÑ‚îÄ‚îÄ Is coverage complete?
                    ‚îÇ  ‚Ä¢ Entity Recall      ‚îÇ‚óÑ‚îÄ‚îÄ Are entities captured?
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                ‚îÇ
                                ‚ñº
                       Retrieved Chunks
                                ‚îÇ
                                ‚ñº
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ      GENERATOR        ‚îÇ
                    ‚îÇ        (LLM)          ‚îÇ
                    ‚îÇ                       ‚îÇ
                    ‚îÇ  Metrics:             ‚îÇ
                    ‚îÇ  ‚Ä¢ Faithfulness       ‚îÇ‚óÑ‚îÄ‚îÄ No hallucinations?
                    ‚îÇ  ‚Ä¢ Noise Sensitivity  ‚îÇ‚óÑ‚îÄ‚îÄ Ignores irrelevant?
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
                                ‚îÇ
                                ‚ñº
                    ‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
                    ‚îÇ       ANSWER          ‚îÇ
                    ‚îÇ                       ‚îÇ
                    ‚îÇ  Metric:              ‚îÇ
                    ‚îÇ  ‚Ä¢ Answer Relevancy   ‚îÇ‚óÑ‚îÄ‚îÄ Addresses question?
                    ‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

In [None]:
# Complete evaluation with all 6 metrics

# Create a comprehensive sample
complete_sample = SingleTurnSample(
    user_input="What is the Eiffel Tower and where is it located?",
    response="The Eiffel Tower is a famous iron lattice tower located in Paris, France. It was built in 1889.",
    reference="The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It was constructed from 1887 to 1889.",
    retrieved_contexts=[
        "The Eiffel Tower is a wrought-iron lattice tower on the Champ de Mars in Paris, France.",
        "The tower was constructed from 1887 to 1889 as the centerpiece of the 1889 World's Fair.",
        "The Eiffel Tower is named after Gustave Eiffel, whose company designed and built the tower.",
        "Paris is known for its cafe culture and fashion industry."  # Some noise
    ]
)

print("üìä Complete Sample for Evaluation")
print("=" * 60)
print(f"\nQuestion: {complete_sample.user_input}")
print(f"\nResponse: {complete_sample.response}")
print(f"\nReference: {complete_sample.reference}")
print(f"\nContexts: {len(complete_sample.retrieved_contexts)} chunks")

üìä Complete Sample for Evaluation

Question: What is the Eiffel Tower and where is it located?

Response: The Eiffel Tower is a famous iron lattice tower located in Paris, France. It was built in 1889.

Reference: The Eiffel Tower is a wrought-iron lattice tower in Paris, France. It was constructed from 1887 to 1889.

Contexts: 4 chunks


In [None]:
# Run all 6 metrics

print("üî¨ Running All 6 RAGAS Metrics")
print("=" * 60)

# Initialize all metrics
all_metrics = {
    "Faithfulness": Faithfulness(llm=ragas_llm),
    "Answer Relevancy": ResponseRelevancy(llm=ragas_llm, embeddings=ragas_embeddings),
    "Context Precision": LLMContextPrecisionWithReference(llm=ragas_llm),
    "Context Recall": LLMContextRecall(llm=ragas_llm),
    "Context Entity Recall": ContextEntityRecall(llm=ragas_llm),
    "Noise Sensitivity": NoiseSensitivity(llm=ragas_llm)
}

results = {}
for name, metric in all_metrics.items():
    try:
        score = run_async(metric.single_turn_ascore(complete_sample))
        results[name] = score
        
        # Interpret the score
        if name == "Noise Sensitivity":
            quality = "Good" if score < 0.3 else "Concerning" if score < 0.6 else "Poor"
            direction = "(lower is better)"
        else:
            quality = "Good" if score > 0.7 else "Concerning" if score > 0.5 else "Poor"
            direction = "(higher is better)"
        
        print(f"\n‚úÖ {name}: {score:.3f} {direction}")
        print(f"   Assessment: {quality}")
    except Exception as e:
        print(f"\n‚ùå {name}: Error - {str(e)[:50]}")
        results[name] = None

üî¨ Running All 6 RAGAS Metrics

‚úÖ Faithfulness: 1.000 (higher is better)
   Assessment: Good

‚úÖ Answer Relevancy: 0.892 (higher is better)
   Assessment: Good

‚úÖ Context Precision: 1.000 (higher is better)
   Assessment: Good

‚úÖ Context Recall: 1.000 (higher is better)
   Assessment: Good

‚úÖ Context Entity Recall: 1.000 (higher is better)
   Assessment: Good

‚úÖ Noise Sensitivity: 0.000 (lower is better)
   Assessment: Good


In [None]:
# Summary table

print("\n" + "=" * 70)
print("üìä EVALUATION SUMMARY")
print("=" * 70)

summary_data = []
for name, score in results.items():
    if score is not None:
        if name == "Noise Sensitivity":
            ideal = "0.0"
            status = "‚úÖ" if score < 0.3 else "‚ö†Ô∏è" if score < 0.6 else "‚ùå"
        else:
            ideal = "1.0"
            status = "‚úÖ" if score > 0.7 else "‚ö†Ô∏è" if score > 0.5 else "‚ùå"
        summary_data.append({
            "Metric": name,
            "Score": f"{score:.3f}",
            "Ideal": ideal,
            "Status": status
        })

df_summary = pd.DataFrame(summary_data)
print(df_summary.to_string(index=False))


üìä EVALUATION SUMMARY
           Metric Score Ideal Status
     Faithfulness 1.000   1.0      ‚úÖ
 Answer Relevancy 0.838   1.0      ‚úÖ
Context Precision 1.000   1.0      ‚úÖ
   Context Recall 1.000   1.0      ‚úÖ
Noise Sensitivity 0.000   0.0      ‚úÖ


## 8.2 Debugging Guide: What to Do When Scores Are Low

| Metric | If Score is Low | Action |
|--------|-----------------|--------|
| **Faithfulness** | LLM is hallucinating | Improve prompt to emphasize context adherence, reduce temperature, use stronger LLM |
| **Answer Relevancy** | Answer doesn't address the question | Review prompt template, ensure question type (when/who/what) is addressed |
| **Context Precision** | Irrelevant chunks ranked high | Improve embedding model, add re-ranking, tune retriever similarity threshold |
| **Context Recall** | Missing important information | Increase k (number of chunks), improve chunking strategy, enhance embedding quality |
| **Entity Recall** | Key entities not retrieved | Use entity-aware chunking, keyword search hybrid, increase retrieval scope |
| **Noise Sensitivity** | Model confused by noise (high score) | Filter irrelevant chunks, use re-ranker, improve prompt robustness |

---

# üöÄ Section 9: Production Patterns

Now let's see how to use these metrics with a real RAG pipeline.

In [None]:
# Create a batch of test samples

test_samples = [
    SingleTurnSample(
        user_input="What is RAG?",
        response="RAG stands for Retrieval Augmented Generation. It combines retrieval systems with LLMs to provide accurate, grounded responses.",
        reference="RAG (Retrieval Augmented Generation) is a technique that enhances LLM responses by retrieving relevant documents and using them as context.",
        retrieved_contexts=[
            "RAG combines retrieval with generation for accurate responses.",
            "Retrieval Augmented Generation uses external knowledge bases."
        ]
    ),
    SingleTurnSample(
        user_input="What are embeddings?",
        response="Embeddings are numerical vector representations of text that capture semantic meaning.",
        reference="Embeddings are dense vector representations that encode semantic information about text into numerical format.",
        retrieved_contexts=[
            "Embeddings convert text to dense vectors.",
            "Vector representations capture semantic similarity."
        ]
    ),
    SingleTurnSample(
        user_input="What is chunking?",
        response="Chunking is the process of breaking documents into smaller pieces for processing.",
        reference="Chunking divides large documents into smaller segments that can be individually embedded and retrieved.",
        retrieved_contexts=[
            "Document chunking breaks text into manageable pieces.",
            "Chunk size affects retrieval quality."
        ]
    )
]

print(f"üìä Created {len(test_samples)} test samples for batch evaluation")

üìä Created 3 test samples for batch evaluation


In [None]:
# Batch evaluation using EvaluationDataset

from ragas import EvaluationDataset

# Create evaluation dataset
eval_dataset = EvaluationDataset(samples=test_samples)

# Select metrics for batch evaluation
batch_metrics = [
    Faithfulness(llm=ragas_llm),
    ResponseRelevancy(llm=ragas_llm, embeddings=ragas_embeddings),
    LLMContextRecall(llm=ragas_llm)
]

print("üî¨ Running Batch Evaluation...")
print("=" * 50)

# Run evaluation
batch_results = evaluate(
    dataset=eval_dataset,
    metrics=batch_metrics
)

print("\n‚úÖ Batch evaluation complete!")

üî¨ Running Batch Evaluation...


Evaluating:   0%|          | 0/9 [00:00<?, ?it/s]

LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.
LLM returned 1 generations instead of requested 3. Proceeding with 1 generations.



‚úÖ Batch evaluation complete!


In [None]:
# Display batch results

print("üìä Batch Evaluation Results")
print("=" * 60)

# Convert to DataFrame
results_df = batch_results.to_pandas()
print(results_df.to_string())

# Calculate averages
print("\nüìà Average Scores:")
for col in results_df.columns:
    if col not in ['user_input', 'response', 'reference', 'retrieved_contexts']:
        avg = results_df[col].mean()
        print(f"   {col}: {avg:.3f}")

üìä Batch Evaluation Results
             user_input                                                                                                               retrieved_contexts                                                                                                                         response                                                                                                                                    reference  faithfulness  answer_relevancy  context_recall
0          What is RAG?  [RAG combines retrieval with generation for accurate responses., Retrieval Augmented Generation uses external knowledge bases.]  RAG stands for Retrieval Augmented Generation. It combines retrieval systems with LLMs to provide accurate, grounded responses.  RAG (Retrieval Augmented Generation) is a technique that enhances LLM responses by retrieving relevant documents and using them as context.      0.666667          0.915684             1.0
1  What are embeddings?     

---

# üéì Section 10: Practice Exercises

Test your understanding with these hands-on exercises!

## Exercise 1: Faithfulness Analysis

Given this response and context, manually identify claims and predict the Faithfulness score:

**Response:** "Python was created by Guido van Rossum in 1991. It is known for its elegant syntax and is the most popular programming language in 2024."

**Context:** "Python is a high-level programming language created by Guido van Rossum. It was first released in 1991."

**Your task:**
1. Extract all claims from the response
2. Verify each claim against the context
3. Calculate the expected Faithfulness score

In [None]:
# Exercise 1 - Your solution here

# TODO: Extract claims and calculate faithfulness
exercise1_claims = [
    # Add your claims here
    # ("claim text", True/False for supported)
]

# Verify your answer
exercise1_sample = SingleTurnSample(
    user_input="Tell me about Python",
    response="Python was created by Guido van Rossum in 1991. It is known for its elegant syntax and is the most popular programming language in 2024.",
    retrieved_contexts=["Python is a high-level programming language created by Guido van Rossum. It was first released in 1991."]
)

# Uncomment to check your answer:
# score = run_async(faithfulness_metric.single_turn_ascore(exercise1_sample))
# print(f"Actual Faithfulness score: {score:.2f}")

## Exercise 2: Answer Relevancy Prediction

For each question-answer pair, predict whether the relevancy will be HIGH (>0.8) or LOW (<0.5):

1. Q: "Who invented the telephone?" A: "Alexander Graham Bell invented the telephone in 1876."
2. Q: "Who invented the telephone?" A: "The telephone revolutionized communication worldwide."
3. Q: "When was the moon landing?" A: "Neil Armstrong was the first person to walk on the moon."

In [None]:
# Exercise 2 - Test your predictions

exercise2_pairs = [
    ("Who invented the telephone?", "Alexander Graham Bell invented the telephone in 1876."),
    ("Who invented the telephone?", "The telephone revolutionized communication worldwide."),
    ("When was the moon landing?", "Neil Armstrong was the first person to walk on the moon.")
]

# Your predictions: 
# 1. HIGH / LOW?
# 2. HIGH / LOW?
# 3. HIGH / LOW?

# Uncomment to verify:
# for q, a in exercise2_pairs:
#     sample = SingleTurnSample(user_input=q, response=a, retrieved_contexts=["context"])
#     score = run_async(relevancy_metric.single_turn_ascore(sample))
#     print(f"Q: {q[:40]}... Score: {score:.2f}")

## Exercise 3: Context Precision - Ranking Impact

Given these 4 chunks for the question "What is the capital of Japan?", arrange them to get:

**Chunks:**
- A: "Tokyo is the capital of Japan."
- B: "Japan has a population of 125 million."
- C: "Japan is an island nation in East Asia."
- D: "Sushi is a popular Japanese dish."

1. **Maximize** Context Precision (best possible score)
2. **Minimize** Context Precision (worst possible score)

In [None]:
# Exercise 3 - Test your rankings

chunks = {
    "A": "Tokyo is the capital of Japan.",
    "B": "Japan has a population of 125 million.",
    "C": "Japan is an island nation in East Asia.",
    "D": "Sushi is a popular Japanese dish."
}

# Your best ranking (e.g., ["A", "C", "B", "D"]):
best_ranking = []  # Fill in your answer

# Your worst ranking:
worst_ranking = []  # Fill in your answer

# Uncomment to test:
# def test_ranking(ranking):
#     sample = SingleTurnSample(
#         user_input="What is the capital of Japan?",
#         reference="Tokyo is the capital of Japan.",
#         retrieved_contexts=[chunks[c] for c in ranking]
#     )
#     return run_async(precision_metric.single_turn_ascore(sample))
# 
# print(f"Best ranking score: {test_ranking(best_ranking):.2f}")
# print(f"Worst ranking score: {test_ranking(worst_ranking):.2f}")

---

# üìö Summary & Key Takeaways

## Quick Reference Table

| Metric | What It Measures | Calculation | Ideal |
|--------|-----------------|-------------|-------|
| **Faithfulness** | Hallucination detection | Supported claims / Total claims | 1.0 |
| **Answer Relevancy** | Answer addresses question | Avg cosine similarity of generated Qs | 1.0 |
| **Context Precision** | Ranking quality | Position-weighted precision | 1.0 |
| **Context Recall** | Retrieval completeness | Reference claims found / Total | 1.0 |
| **Entity Recall** | Entity coverage | Common entities / Reference entities | 1.0 |
| **Noise Sensitivity** | Robustness to noise | Incorrect claims / Total claims | 0.0 |

## Key Insights

1. **Faithfulness** and **Answer Relevancy** evaluate the **Generator (LLM)**
2. **Context Precision**, **Context Recall**, and **Entity Recall** evaluate the **Retriever**
3. **Noise Sensitivity** evaluates the **overall system's robustness**
4. Understanding the **intermediate steps** helps debug evaluation issues
5. Use these metrics **together** for comprehensive RAG evaluation

## Next Steps

- üìñ Review Notebook 10 for practical RAGAS workflows
- üîß Apply these metrics to your own RAG pipeline
- üìä Set up automated evaluation with thresholds
- üî¨ Experiment with different LLMs and compare scores

---

## üìö Additional Resources

- [RAGAS Official Documentation](https://docs.ragas.io/)
- [RAGAS GitHub Repository](https://github.com/explodinggradients/ragas)
- [LangChain Documentation](https://python.langchain.com/)

---

**Notebook created for the  RAGAS Metrics Deep Dive**

*Last updated: Nov 2025*