This a walkthrough of how to use the basic functionalities from the `Judgeval` library.
First, let's set up our client.

In [2]:
import sys
sys.path.append("/Users/alexshan/Desktop/judgment_labs/judgeval/")  # root of judgeval

# We need to ensure that our environment variables are set up with our Judgment API key.
from dotenv import load_dotenv
load_dotenv()

from judgeval.judgment_client import JudgmentClient

client = JudgmentClient()

Successfully initialized JudgmentClient, welcome back Joseph Camyre!


**Let's set up our first experiment for evaluation!**

Before jumping into all the different kinds of evaluations we can run on `Judgeval`, let's read a toy example:

Imagine we're building a customer support bot for a clothing company. We might have a RAG system set up to help the bot answer questions that require examining policies, such as the return policy. We would therefore be interested in measuring whether our bot's answers are consistent with our policies.

Let's use `Judgeval` to test whether our bot is working as intended. First, we'll set up an `Example`, which represents a single case that we would like to evaluate our system (bot) on.

In [3]:
from judgeval.data import Example 

return_policy_example = Example(
    # User query
    input="What if I bought the wrong color or size?",
    # Fill this in with the output of your system
    actual_output=("You can return the item within 30 days of purchase for a full refund." 
                   "You do not need to present your receipt."),
    retrieval_context=[
        "Customers may return items for a full refund within 30 days of purchase.", 
        "To return an item, customers must have the purchase receipt in physical or digital form."
    ]
)

In order to determine whether the `actual_output` of our bot is consistent with our policies (the `retrieval_context`), we can use Judgment's `Faithfulness` scorer to execute an evaluation.

In this case, we want to make sure that everything that our bot says is factually consistent with our retrieved documents. We can do this by setting the `threshold` of the `JudgmentScorer` to 1.0, meaning that 100% of the claims within `actual_output` do NOT contradict the retrieved information.

In [4]:
from judgeval.scorers import JudgmentScorer
from judgeval.constants import JudgmentMetric
from judgeval.evaluation_run import EvaluationRun


faithfulness = JudgmentScorer(
    threshold=1.0,  # Performance bar for a passing score, if running a test.
    score_type=JudgmentMetric.FAITHFULNESS
)

# Set up our evaluation with our Scorer, and choose GPT4o as our judge LLM.
eval = EvaluationRun(
    examples=[return_policy_example],
    scorers=[faithfulness],
    model="gpt-4.1"
)

results = client.run_eval(eval)

print(results)


  Expected `enum` but got `str` with value `'faithfulness'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


[ScoringResult(success=False, scorers_data=[{'name': 'Faithfulness', 'threshold': 1.0, 'success': False, 'score': 0.5, 'reason': 'The score is 0.50 because the actual output incorrectly states that you do not need to present your receipt to return the item, which contradicts the retrieval context that explicitly requires a purchase receipt in physical or digital form for returns.', 'strict_mode': False, 'evaluation_model': 'gpt-4.1', 'error': None, 'evaluation_cost': None, 'verbose_logs': 'Claims:\n[\n    {\'claim\': \'You can return the item within 30 days of purchase for a full refund.\', \'quote\': \'You can return the item within 30 days of purchase for a full refund.You do not need to present your receipt.\'},\n    {\'claim\': \'You do not need to present your receipt to return the item.\', \'quote\': \'You can return the item within 30 days of purchase for a full refund.You do not need to present your receipt.\'}\n] \n \nVerdicts:\n[\n    {\n        "verdict": "yes",\n        "re

Woohoo! We've executed our first `Judgeval` evaluation. If we're just interested in measuring the score of an evaluation, this guide is complete. 

However, we might want to do some more analysis to see exactly why the evaluation score is 0.5 in this case.

Let's break down the `ScoringResult` that we received by running our evaluation!

Our evaluation has failed, meaning that when treating the `Example` as a unit test, our bot did not produce an answer that was 100% factually consistent with the retrieved documents. Let's dig into what happened.

In [10]:
result = results[0]
print(f"Sucess? {result.success}")

score = result.scorers_data[0].get('score')
print(f"Score: {score} ({score * 100}%)")
print(f"Reasoning: {result.scorers_data[0].get('reason')}")

Sucess? False
Score: 0.5 (50.0%)
Reasoning: The score is 0.50 because the actual output incorrectly states that you do not need to present your receipt to return the item, which contradicts the retrieval context that explicitly requires a purchase receipt in physical or digital form for returns.


Aha! It seems that our bot did indeed hallucinate that customers do not need to present their receipts for returning items. Let's inspect the `result` for even more detail into how `Judgeval`'s scorer found this contradiction:

In [11]:
metadata = result.scorers_data[0].get('additional_metadata')

print(f"Additional Metadata: {metadata}")

Additional Metadata: {'claims': [{'claim': 'You can return the item within 30 days of purchase for a full refund.', 'quote': 'You can return the item within 30 days of purchase for a full refund.You do not need to present your receipt.'}, {'claim': 'You do not need to present your receipt to return the item.', 'quote': 'You can return the item within 30 days of purchase for a full refund.You do not need to present your receipt.'}], 'verdicts': [{'verdict': 'yes', 'reason': "The claim that you can return the item within 30 days of purchase for a full refund is supported by the retrieval context. Quote: 'Customers may return items for a full refund within 30 days of purchase.'"}, {'verdict': 'no', 'reason': "The claim that you do not need to present your receipt to return the item is contradicted by the retrieval context. Quote: 'To return an item, customers must have the purchase receipt in physical or digital form.'"}]}


```
{   
    "claims": [
        {
            "claim": "You can return the item within 30 days of purchase for a full refund.",
            "quote": "You can return the item within 30 days of purchase for a full refund.You do not need to present your receipt."
        },
        {
            "claim": "You do not need to present your receipt to return the item.",
            "quote": "You can return the item within 30 days of purchase for a full refund.You do not need to present your receipt."
        }
    ],
    "verdicts": [
        {
            "verdict": "yes",
            "reason": "The claim that you can return the item within 30 days of purchase for a full refund is supported by the retrieval context. Quote: 'Customers may return items for a full refund within 30 days of purchase.'"
        },
        {
            "verdict": "no",
            "reason": "The claim that you do not need to present your receipt to return the item is contradicted by the retrieval context. Quote: 'To return an item, customers must have the purchase receipt in physical or digital form.'"
        }
    ]
    }
```

Now we know exactly what our bot hallucinated! Using this information, we can fix our bot's response and hopefully, we will get a perfect score the next time we run an evaluation.