In [1]:
import sys
#!{sys.executable} -m pip install --upgrade deepeval openai

In [2]:
!python --version

Python 3.10.0


In [3]:
#!pip list

In [5]:
import os
import getpass
os.environ["GOOGLEAI_API_KEY"] = getpass.getpass(
    prompt="Enter your GOOGLEAI_API_KEY: "
)


Enter your GOOGLEAI_API_KEY:  ········


In [6]:
import os
from deepeval.models import GeminiModel
from deepeval.test_case import LLMTestCase
from deepeval.metrics import FaithfulnessMetric

In [7]:
## Create a GeminiModel with your AI Studio key  
EVAL_MODEL        = "gemini-1.5-pro"
GOOGLEAI_API_KEY  = os.environ["GOOGLEAI_API_KEY"]

In [8]:
eval_model = GeminiModel(
    model_name=EVAL_MODEL,
    api_key=GOOGLEAI_API_KEY
)

## Use that model in your FaithfulnessMetric  
metric = FaithfulnessMetric(
    model=eval_model,
    include_reason=True
)

## 1. Misinformation baked into the context

The context itself is simply wrong, so a perfectly “grounded” answer is wrong too.

In [9]:
tc = LLMTestCase(
    input="What is the boiling point of water at sea level in Celsius?",
    retrieval_context=[
        ## Flat-out misinformation
        "At standard atmospheric pressure, water boils at 110 °C."
    ],
    actual_output="Water boils at 110 °C at sea level."
)

metric.measure(tc)
print(metric.score, metric.reason)   # ≈ 1.00, “All claims supported…”

Output()

1.0 The score is 1.00 because the actual output aligns perfectly with the retrieval context, with no contradictions present.


## 2. Invented co-author (hallucination lives only in the context)

Context adds a spurious co-author; the answer repeats it → still 1.0.

In [10]:
tc = LLMTestCase(
    input="Who wrote the novel *Frankenstein*?",
    retrieval_context=[
        # Hallucinated collaborator
        "Mary Shelley co-wrote *Frankenstein* with Lord Byron in 1818."
    ],
    actual_output="*Frankenstein* was co-written by Mary Shelley and Lord Byron."
)

metric.measure(tc)
print(metric.score, metric.reason)   # ≈ 1.00

Output()

1.0 The score is 1.00 because the actual output aligns perfectly with the retrieval context, with no contradictions present.


## 3. Out-of-date context

Context was true once, but the world has changed. Answer is outdated yet perfectly faithful.

Context was true once, but the world has changed. Answer is outdated yet perfectly faithful.

In [11]:
tc = LLMTestCase(
    input="Who is the current CEO of Amazon?",
    retrieval_context=[
        # Fact that used to be true
        "Jeff Bezos is the current CEO of Amazon."
    ],
    actual_output="Jeff Bezos is the current CEO of Amazon."
)

metric.measure(tc)
print(metric.score, metric.reason)   # ≈ 1.00

Output()

1.0 The score is 1.00 because the actual output aligns perfectly with the retrieval context, with no contradictions present.


### Take-away

FaithfulnessMetric cares only about internal consistency between answer ⇄ context.
To guard against externally incorrect contexts you also need a complementary truthfulness or reference-based QA metric—or you must ensure your retrieval pipeline returns authoritative context in the first place.

Use a second metric (e.g. truthfulness, reference-based QA, or retrieval precision) alongside faithfulness, or make sure your retrieval step always pulls authoritative documents. That combination catches both hallucinations and poisoned context.

In [12]:
## Keep the query + context constant  
query = "Who wrote the novel '1984'?"
retrieval_context = [
    "George Orwell wrote the dystopian novel 'Nineteen Eighty-Four' (often "
    "just called '1984'); it was published in 1949."
]

## Create three answers of increasing factual quality 
answers = {
    "unfaithful":  "Ernest Hemingway wrote 1984.",                           # ≈ 0.00
    "partly true": "George Orwell wrote 1984 together with Aldous Huxley.",  # ≈ 0.50
    "faithful":    "George Orwell wrote 1984."                               # ≈ 1.00
}

## Evaluate each answer 
for label, answer in answers.items():
    tc = LLMTestCase(
        input=query,
        actual_output=answer,
        retrieval_context=retrieval_context
    )
    metric.measure(tc)
    print(f"{label:12} ⇢ score={metric.score:.2f} • reason={metric.reason}")


Output()

Output()

unfaithful   ⇢ score=0.00 • reason=The score is 0.00 because the actual output incorrectly attributes the authorship of "Nineteen Eighty-Four" to Ernest Hemingway instead of the true author, George Orwell.


Output()

partly true  ⇢ score=0.00 • reason=The score is 0.00 because the actual output incorrectly states that Aldous Huxley co-authored "1984" with George Orwell.  The provided context clearly indicates that George Orwell was the sole author.


faithful     ⇢ score=1.00 • reason=The score is 1.00 because the actual output aligns perfectly with the retrieval context, with no contradictions present.


In [13]:
## Query and retrieval context with two atomic facts
query = "Give me two facts about the Great Wall of China."
retrieval_context = [
    # Fact A – dynasty
    "The Great Wall of China was completed mainly during the Ming Dynasty.",
    # Fact B – length
    "The Great Wall of China is about 21,196 kilometres long."
]

## Answer: keeps Fact A correct but muddles Fact B
answer = (
    "The Great Wall of China was completed mainly during the Ming Dynasty "
    "and is about 10,000 kilometres long."
)

##  Evaluate
tc = LLMTestCase(
    input=query,
    actual_output=answer,
    retrieval_context=retrieval_context
)
metric.measure(tc)
print(f"faithfulness={metric.score:.2f} • reason={metric.reason}")


Output()

faithfulness=0.50 • reason=The score is 0.50 because the actual output incorrectly states the length of the Great Wall of China as 10,000 kilometers, contradicting the provided text which clearly indicates its length as approximately 21,196 kilometers. This significant discrepancy warrants a lower faithfulness score.


#### Supported claims ÷ total claims = 1 ÷ 2 = 0.5.
Because DeepEval breaks the answer into these two atomic assertions, the judge LLM usually assigns a faithfulness score very close to 0.5 and explains that one claim aligns while the other conflicts.