# Layer 2 Metric: Answer Relevancy

## Answer Relevancy

**Purpose of this metric:** Answer Relevancy checks whether the application's answer directly addresses the user's question. A higher score means the answer is on-topic and aligned with the input.

In [20]:
!pip install -U deepeval


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.3[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Setup

This notebook pulls real values from your running application.

Backend requirement: `uvicorn app.main:app --reload --port 8000`

In [21]:
import os
from pathlib import Path

import requests
from deepeval.test_case import LLMTestCase
from deepeval.metrics import AnswerRelevancyMetric
from deepeval.evaluate import evaluate, AsyncConfig

try:
    from dotenv import load_dotenv
    load_dotenv('./../.env')
    load_dotenv('./../../backend/.env')
except Exception:
    pass

BASE_URL = os.getenv('BASE_URL', 'http://localhost:8000')
FILE_PATH = Path(os.getenv('SAMPLE_FILE', '../sample_docs/Match_Summary.pdf')).resolve()
QUESTION = os.getenv('QUESTION', 'How many sixes did Tilak Varma hit?')

print('Backend:', BASE_URL)
print('File:', FILE_PATH)
print('Question:', QUESTION)

if not FILE_PATH.exists():
    raise FileNotFoundError(f'Sample file not found: {FILE_PATH}')

Backend: http://localhost:8000
File: /Users/shubhanshurastogi_1/Learning/rag-session-qa-eval/eval/sample_docs/Match_Summary.pdf
Question: How many sixes did Tilak Varma hit?


## Get Real Input, Output, and Retrieval Context from App

In [22]:
with open(FILE_PATH, 'rb') as f:
    files = {'file': (FILE_PATH.name, f)}
    upload_res = requests.post(f'{BASE_URL}/upload', files=files, timeout=120)

upload_res.raise_for_status()
session_id = upload_res.json().get('session_id')

payload = {'session_id': session_id, 'question': QUESTION}
ask_res = requests.post(f'{BASE_URL}/ask', json=payload, timeout=120)
ask_res.raise_for_status()

ask_data = ask_res.json()
answer = ask_data.get('answer', '')
retrieval_context = ask_data.get('retrieval_context', [])

print('Session:', session_id)
print('Input:', QUESTION)
print('Actual Output:', answer)
print('Retrieved Context Chunks:', len(retrieval_context))
print(f"Retrieved Context ({len(retrieval_context)}): {retrieval_context}")

Session: bff72c52-0792-4144-acd3-753f383bca13
Input: How many sixes did Tilak Varma hit?
Actual Output: Tilak Varma hit 3 sixes.
Retrieved Context Chunks: 1
Retrieved Context (1): ["aren't in use today so it took some time for everyone to realise what happened. A yorker from round the wicket, looks to flick it away, misses and it clips leg stump on the way. Tilak Varma b Marco Jansen 45(19) [4s-3 6s-3] 10.5 4 Marco Jansen to Tilak Varma, FOUR, round the wicket, short of length across off, and it's swatted wide of mid-on. Fielder gets a hand diving across but it still runs away 10.2 Marco Jansen to Suryakumar Yadav, 1 run, dropped! Suryakumar having all the luck out there! Length ball on leg, tries his trademark scoop but is through the shot early. Toe-ends it in the air to deep midwicket and Bosch makes a meal of it 10.1 4 Marco Jansen to Suryakumar Yadav, FOUR, another inside edge past the stumps! Good length across off, he looks to loft down the ground but the"]


## Evaluate Answer Relevancy

In [23]:
metric = AnswerRelevancyMetric()
test_case = LLMTestCase(
    input=QUESTION,
    actual_output=answer,
    retrieval_context=retrieval_context,
)

evaluate(
    test_cases=[test_case],
    metrics=[metric],
    async_config=AsyncConfig(run_async=False)
)

Output()



Metrics Summary

  - ‚úÖ Answer Relevancy (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4.1, reason: The score is 1.00 because the answer was fully relevant and directly addressed the question without any irrelevant information. Great job!, error: None)

For test case:

  - input: How many sixes did Tilak Varma hit?
  - actual output: Tilak Varma hit 3 sixes.
  - expected output: None
  - context: None
  - retrieval context: ["aren't in use today so it took some time for everyone to realise what happened. A yorker from round the wicket, looks to flick it away, misses and it clips leg stump on the way. Tilak Varma b Marco Jansen 45(19) [4s-3 6s-3] 10.5 4 Marco Jansen to Tilak Varma, FOUR, round the wicket, short of length across off, and it's swatted wide of mid-on. Fielder gets a hand diving across but it still runs away 10.2 Marco Jansen to Suryakumar Yadav, 1 run, dropped! Suryakumar having all the luck out there! Length ball on leg, tries his trademark scoop bu

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Answer Relevancy', threshold=0.5, success=True, score=1.0, reason='The score is 1.00 because the answer was fully relevant and directly addressed the question without any irrelevant information. Great job!', strict_mode=False, evaluation_model='gpt-4.1', error=None, evaluation_cost=0.0028060000000000003, verbose_logs='Statements:\n[\n    "Tilak Varma hit 3 sixes."\n] \n \nVerdicts:\n[\n    {\n        "verdict": "yes",\n        "reason": null\n    }\n]')], conversational=False, multimodal=False, input='How many sixes did Tilak Varma hit?', actual_output='Tilak Varma hit 3 sixes.', expected_output=None, context=None, retrieval_context=["aren't in use today so it took some time for everyone to realise what happened. A yorker from round the wicket, looks to flick it away, misses and it clips leg stump on the way. Tilak Varma b Marco Jansen 45(19) [4s-3 6s-3] 10.5 4 Marco Jansen to Til

### Faithfulness

**Purpose of this metric:** Faithfulness checks whether the application's answer is fully grounded in the retrieved context. A higher score means the answer does not introduce hallucinated facts and stays strictly supported by the provided context.

In [24]:
import os
from pathlib import Path
import requests

from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric
from deepeval.evaluate import evaluate, AsyncConfig

# ---- Config ----
BASE_URL = os.getenv("BASE_URL", "http://localhost:8000")
FILE_PATH = Path(os.getenv("SAMPLE_FILE", "../sample_docs/Match_Summary.pdf")).resolve()

QUESTION = os.getenv(
    "QUESTION",
    "On which delivery was Tilak Varma dismissed and how?"
)

print("Backend:", BASE_URL)
print("File:", FILE_PATH)
print("Question:", QUESTION)

# ---- Upload (fresh session) ----
with open(FILE_PATH, "rb") as f:
    files = {"file": (FILE_PATH.name, f)}
    upload_res = requests.post(f"{BASE_URL}/upload", files=files, timeout=120)

upload_res.raise_for_status()
session_id = upload_res.json().get("session_id")

# ---- Ask ----
payload = {"session_id": session_id, "question": QUESTION}
ask_res = requests.post(f"{BASE_URL}/ask", json=payload, timeout=120)
ask_res.raise_for_status()

ask_data = ask_res.json()
answer = ask_data.get("answer", "")
retrieval_context = ask_data.get("retrieval_context", []) or []

print("Session:", session_id)
print("Input:", QUESTION)
print("Actual Output:", answer)
print("Retrieved Context Chunks:", len(retrieval_context))
print(f"Retrieved Context ({len(retrieval_context)}): {[c[:200] for c in retrieval_context]}")

# ---- Faithfulness ----
faithfulness_metric = FaithfulnessMetric()

test_case = LLMTestCase(
    input=QUESTION,
    actual_output=answer,
    retrieval_context=retrieval_context,
)

evaluate(
    test_cases=[test_case],
    metrics=[faithfulness_metric],
    async_config=AsyncConfig(run_async=False),
)


Backend: http://localhost:8000
File: /Users/shubhanshurastogi_1/Learning/rag-session-qa-eval/eval/sample_docs/Match_Summary.pdf
Question: On which delivery was Tilak Varma dismissed and how?
Session: 76621490-1a9f-437f-b86c-0854c4462ab7
Input: On which delivery was Tilak Varma dismissed and how?
Actual Output: Tilak Varma was dismissed on the 11th delivery by Marco Jansen, bowled behind his legs.
Retrieved Context Chunks: 5
Retrieved Context (5): ["aren't in use today so it took some time for everyone to realise what happened. A yorker from round the wicket, looks to flick it away, misses and it clips leg stump on the way. Tilak Varma b Marco Ja", 'f 8.5 6 Nortje to Tilak Varma, SIX, wow, just wow! Advances down to a fast length ball and smokes it over mid-on. Nortje is travelling the distance! 8.2 6 Nortje to Suryakumar Yadav, SIX, first-ball s', "or Linde at deep backward square and Surya's scratchy knock is over. Just never found his timing. Despite that, he walks off with 30 off ju

Output()



Metrics Summary

  - ‚úÖ Faithfulness (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4.1, reason: The score is 1.00 because there are no contradictions listed, indicating the actual output aligns perfectly with the retrieval context. Great job staying faithful to the source!, error: None)

For test case:

  - input: On which delivery was Tilak Varma dismissed and how?
  - actual output: Tilak Varma was dismissed on the 11th delivery by Marco Jansen, bowled behind his legs.
  - expected output: None
  - context: None
  - retrieval context: ["aren't in use today so it took some time for everyone to realise what happened. A yorker from round the wicket, looks to flick it away, misses and it clips leg stump on the way. Tilak Varma b Marco Jansen 45(19) [4s-3 6s-3] 10.5 4 Marco Jansen to Tilak Varma, FOUR, round the wicket, short of length across off, and it's swatted wide of mid-on. Fielder gets a hand diving across but it still runs away 10.2 Marco Jansen to Suryakuma

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Faithfulness', threshold=0.5, success=True, score=1.0, reason='The score is 1.00 because there are no contradictions listed, indicating the actual output aligns perfectly with the retrieval context. Great job staying faithful to the source!', strict_mode=False, evaluation_model='gpt-4.1', error=None, evaluation_cost=0.008678, verbose_logs='Truths (limit=None):\n[\n    "Tilak Varma was bowled by Marco Jansen for 45 runs off 19 balls.",\n    "Suryakumar Yadav scored 30 runs off 16 balls, including 2 fours and 2 sixes, before being caught by Linde off the bowling of Kwena Maphaka.",\n    "Ishan Kishan scored fifty runs off just 23 balls and was retired out.",\n    "Marco Jansen bowled a yorker from round the wicket to dismiss Tilak Varma.",\n    "The zing bails were not in use during this match, which caused a delay in realizing Tilak Varma was bowled.",\n    "Suryakumar Yadav was dr

### Completeness

**Purpose of this metric:** FCompleteness evaluates whether the retrieved context contains all the necessary information required to answer the user's question. A higher score means the system retrieved sufficient information to generate a complete and accurate response.

In [25]:
import os
from pathlib import Path
import requests

from deepeval.test_case import LLMTestCase, LLMTestCaseParams
from deepeval.metrics import GEval
from deepeval.evaluate import evaluate, AsyncConfig

# ---- Config ----
BASE_URL = os.getenv("BASE_URL", "http://localhost:8000")
FILE_PATH = Path(os.getenv("SAMPLE_FILE", "../sample_docs/Match_Summary.pdf")).resolve()

QUESTION = os.getenv(
    "QUESTION",
    "Which bowler conceded 49 runs in two overs and what key events happened during that spell?"
)

print("Backend:", BASE_URL)
print("File:", FILE_PATH)
print("Question:", QUESTION)

# ---- Upload to create a valid session (required if backend was restarted) ----
with open(FILE_PATH, "rb") as f:
    files = {"file": (FILE_PATH.name, f)}
    upload_res = requests.post(f"{BASE_URL}/upload", files=files, timeout=120)

upload_res.raise_for_status()
session_id = upload_res.json().get("session_id")

# ---- Ask the new question ----
payload = {"session_id": session_id, "question": QUESTION}
ask_res = requests.post(f"{BASE_URL}/ask", json=payload, timeout=120)
ask_res.raise_for_status()

ask_data = ask_res.json()
answer = ask_data.get("answer", "")
retrieval_context = ask_data.get("retrieval_context", []) or []

# ---- Debug prints ----
print("Session:", session_id)
print("Input:", QUESTION)
print("Actual Output:", answer)
print("Retrieved Context Chunks:", len(retrieval_context))
print(f"Retrieved Context ({len(retrieval_context)}): {[c[:200] for c in retrieval_context]}")

# ---- Completeness (Answer Completeness) ----
completeness_metric = GEval(
    name="Completeness",
    criteria=(
        "Judge whether the answer fully addresses the user's question. "
        "A complete answer covers all parts of the question and includes necessary specifics "
        "when requested. Penalize missing key details or partial coverage. "
        "Do not reward irrelevant extra information."
    ),
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT],
)

test_case = LLMTestCase(
    input=QUESTION,
    actual_output=answer,
    retrieval_context=retrieval_context,  # optional; kept for inspection
)

evaluate(
    test_cases=[test_case],
    metrics=[completeness_metric],
    async_config=AsyncConfig(run_async=False),
)


Backend: http://localhost:8000
File: /Users/shubhanshurastogi_1/Learning/rag-session-qa-eval/eval/sample_docs/Match_Summary.pdf
Question: Which bowler conceded 49 runs in two overs and what key events happened during that spell?
Session: 06c6c146-3712-45df-920b-0d5cf4a7b3c7
Input: Which bowler conceded 49 runs in two overs and what key events happened during that spell?
Actual Output: The bowler who conceded 49 runs in two overs is Nortje. During that spell, he was smashed for runs, and a key event was that Rinku Singh was caught by Stubbs after attempting to hit a pitched-up delivery downtown, which provided some respite for Nortje.
Retrieved Context Chunks: 5
Retrieved Context (5): ["or Linde at deep backward square and Surya's scratchy knock is over. Just never found his timing. Despite that, he walks off with 30 off just 16. Suryakumar Yadav c Linde b Kwena Maphaka 30(16) [4s-2 ", "aren't in use today so it took some time for everyone to realise what happened. A yorker from round t

Output()



Metrics Summary

  - ‚úÖ Completeness [GEval] (score: 0.6971465031793828, threshold: 0.5, strict: False, evaluation model: gpt-4.1, reason: The response correctly identifies Nortje as the bowler who conceded 49 runs in two overs and mentions a key event (Rinku Singh being caught by Stubbs). However, it lacks detail about other key events during the spell, such as the sequence of boundaries or sixes, and does not specify the match context. The answer is focused and relevant but misses some specifics requested by the question., error: None)

For test case:

  - input: Which bowler conceded 49 runs in two overs and what key events happened during that spell?
  - actual output: The bowler who conceded 49 runs in two overs is Nortje. During that spell, he was smashed for runs, and a key event was that Rinku Singh was caught by Stubbs after attempting to hit a pitched-up delivery downtown, which provided some respite for Nortje.
  - expected output: None
  - context: None
  - retrieval con

EvaluationResult(test_results=[TestResult(name='test_case_0', success=True, metrics_data=[MetricData(name='Completeness [GEval]', threshold=0.5, success=True, score=0.6971465031793828, reason='The response correctly identifies Nortje as the bowler who conceded 49 runs in two overs and mentions a key event (Rinku Singh being caught by Stubbs). However, it lacks detail about other key events during the spell, such as the sequence of boundaries or sixes, and does not specify the match context. The answer is focused and relevant but misses some specifics requested by the question.', strict_mode=False, evaluation_model='gpt-4.1', error=None, evaluation_cost=0.0024619999999999998, verbose_logs='Criteria:\nJudge whether the answer fully addresses the user\'s question. A complete answer covers all parts of the question and includes necessary specifics when requested. Penalize missing key details or partial coverage. Do not reward irrelevant extra information. \n \nEvaluation Steps:\n[\n    "Re

### Combined View (reuse previous results)

This section does not run evaluation again. It only formats scores from the three metric objects already computed above.


In [29]:
import pandas as pd

# Reuse already-computed metric objects from previous sections
metric_objects = [
    ("Answer Relevancy", globals().get("metric")),
    ("Faithfulness", globals().get("faithfulness_metric")),
    ("Completeness", globals().get("completeness_metric")),
]

rows = []
for fallback_name, metric_obj in metric_objects:
    if metric_obj is None:
        rows.append({
            "metric": fallback_name,
            "status_icon": "‚¨ú",
            "result": "NOT RUN",
            "score": None,
            "threshold": None,
            "score_pct": None,
            "threshold_pct": None,
            "reason": "Run its section above first.",
        })
        continue

    score = getattr(metric_obj, "score", None)
    threshold = getattr(metric_obj, "threshold", None)
    success = getattr(metric_obj, "success", None)
    reason = getattr(metric_obj, "reason", None)

    if success is True:
        result = "PASS"
        status_icon = "‚úÖ"
    elif success is False:
        result = "FAIL"
        status_icon = "‚ùå"
    else:
        result = "N/A"
        status_icon = "‚¨ú"

    rows.append({
        "metric": getattr(metric_obj, "name", fallback_name),
        "status_icon": status_icon,
        "result": result,
        "score": score,
        "threshold": threshold,
        "score_pct": round(score * 100, 2) if isinstance(score, (int, float)) else None,
        "threshold_pct": round(threshold * 100, 2) if isinstance(threshold, (int, float)) else None,
        "reason": reason,
    })

summary_df = pd.DataFrame(rows)

display(summary_df[["metric", "status_icon", "result", "score", "threshold", "score_pct", "threshold_pct", "reason"]])

def color_by_threshold(row):
    styles = [""] * len(row.index)
    score_pct = row.get("score_pct")
    threshold_pct = row.get("threshold_pct")

    if pd.notna(score_pct) and pd.notna(threshold_pct):
        passed = score_pct >= threshold_pct
        color = "#166534" if passed else "#b91c1c"
        score_idx = row.index.get_loc("score_pct")
        styles[score_idx] = f"background-color: {color}; color: white;"

    return styles

# Cleaner demo table
styled_summary = (
    summary_df[["metric", "status_icon", "result", "score_pct", "threshold_pct", "reason"]]
    .style
    .hide(axis="index")
    .format(
        {
            "score_pct": lambda v: "" if pd.isna(v) else f"{v:.2f}%",
            "threshold_pct": lambda v: "" if pd.isna(v) else f"{v:.2f}%",
        }
    )
    .apply(color_by_threshold, axis=1)
)

styled_summary



Unnamed: 0,metric,status_icon,result,score,threshold,score_pct,threshold_pct,reason
0,Answer Relevancy,‚úÖ,PASS,1.0,0.5,100.0,50.0,The score is 1.00 because the answer was fully...
1,Faithfulness,‚úÖ,PASS,1.0,0.5,100.0,50.0,The score is 1.00 because there are no contrad...
2,Completeness,‚úÖ,PASS,0.697147,0.5,69.71,50.0,The response correctly identifies Nortje as th...


metric,status_icon,result,score_pct,threshold_pct,reason
Answer Relevancy,‚úÖ,PASS,100.00%,50.00%,The score is 1.00 because the answer was fully relevant and directly addressed the question without any irrelevant information. Great job!
Faithfulness,‚úÖ,PASS,100.00%,50.00%,"The score is 1.00 because there are no contradictions listed, indicating the actual output aligns perfectly with the retrieval context. Great job staying faithful to the source!"
Completeness,‚úÖ,PASS,69.71%,50.00%,"The response correctly identifies Nortje as the bowler who conceded 49 runs in two overs and mentions a key event (Rinku Singh being caught by Stubbs). However, it lacks detail about other key events during the spell, such as the sequence of boundaries or sixes, and does not specify the match context. The answer is focused and relevant but misses some specifics requested by the question."
