This a walkthrough of how to use the basic functionalities from the Judgeval library.
First, let's set up our client.

In [2]:
import sys
sys.path.append("/Users/alexshan/Desktop/judgment_labs/judgeval/")  # root of judgeval

# We need to ensure that our environment variables are set up with our Judgment API key.
from dotenv import load_dotenv
load_dotenv(dotenv_path=f"/Users/alexshan/Desktop/judgment_labs/judgeval/judgeval/.env")

import os 
os.environ["JUDGMENT_API_KEY"] = "1873d763-d7e0-4c9c-a8d1-e73a8a01fb21" # TODO

In [3]:
from judgeval.judgment_client import JudgmentClient


client = JudgmentClient()



Successfully initialized JudgmentClient, welcome back Joseph Camyre!


**Let's set up our first experiment for evaluation!**

In this demo, we will demonstrate how a legal company might use Judgeval to evaluate their 
workflows for document generation. 

Imagine we're using an LLM to write documents (e.g. letters) supporting the immigration case for 
high-skilled entrepreneurs, engineers, and leaders. However, we are wary of the LLM hallucinating 
certain facts and stories in the letter, which could result in the applicant's case being rejected.
Therefore, we want to make sure that our letters are supported by documents that contain verifiable 
information about the applicant.

**Here's an example of an LLM-generated letter that we generate:**

"Dear Sir or Madam:

I am writing this recommendation on behalf of Ms. Aria Tanaka, a contemporary dancer and choreographer whose innovative work has made a significant impact on the international dance scene. As a prominent figure in the Japanese contemporary dance community...

...By way of introduction, my name is Saburo Teshigawara, and I am a choreographer, dancer, and director known for my experimental and visually stunning dance works...

In 1985, I founded the dance company KARAS, through which I have created numerous acclaimed productions..."

We also have the ground truth data for our example, such as the resume and personal documents written by the immigration applicant and their recommender. 

We are interested in evaluating whether the letter has any hallucinated content, i.e. statements that are unsubstantiated by the ground truth evidence.

Now that our goal is clear, let's try setting up an evaluation using `Judgeval`! First, we need to grab our prompt for the task we are running and set it as the `input` to our `task`.

In [4]:
TASK_INPUT = (
    "You are an immigration lawyer. Your task is to write a letter supporting the immigration case for an "
    "exceptional individual. To write the letter, you have access to two documents: a document from the "
    "beneficiary and a document from the recommender. The beneficiary document contains information about the "
    "beneficiary's background, achievements, and reasons for immigrating. The recommender document contains "
    "information about the recommender's relationship with the beneficiary and why they believe the beneficiary "
    "should be granted entry. Write a letter supporting the beneficiary from the perspective of the recommender."
)  # we already have this on hand

Next, let's load in our data. Since we already have our workflow built out, we simply run the workflow on our inputs and save them. So we will have our `input` from the last step, as well as our `output` (letter) and the ground truth information for our letter.

In [5]:
PATH_TO_LETTER = f"/Users/alexshan/Desktop/judgment_labs/judgeval/docs/demo_files/tanaka.txt"
PATH_TO_RESUME = f"/Users/alexshan/Desktop/judgment_labs/judgeval/docs/demo_files/tanaka_beneficiary.txt"
PATH_TO_BENEFICIARY_INFO = f"/Users/alexshan/Desktop/judgment_labs/judgeval/docs/demo_files/tanaka_recommender.txt"

with open(PATH_TO_LETTER, "r") as letter_file, \
    open(PATH_TO_RESUME, "r") as resume_file, \
    open(PATH_TO_BENEFICIARY_INFO, "r") as beneficiary_file:
    letter = letter_file.read()
    resume = resume_file.read()
    beneficiary_info = beneficiary_file.read()

Now, we have all of the data needed to run our evaluation!

In [6]:
from judgeval.constants import JudgmentMetric
from judgeval.evaluation_run import EvaluationRun
from judgeval.data import Example 
from judgeval.scorers import JudgmentScorer

example_1 = Example(
    input = TASK_INPUT,
    actual_output = letter,
    retrieval_context = [resume, beneficiary_info],
)

faithfulness = JudgmentScorer(threshold=0.5, 
                              score_type=JudgmentMetric.FAITHFULNESS)

results = client.run_evaluation(
    examples=[example_1],
    scorers=[faithfulness],
    model="gpt-4.1"
)

print(results)

  Expected `enum` but got `str` with value `'faithfulness'` - serialized value may not be as expected
  return self.__pydantic_serializer__.to_python(


filtered_result={'success': True, 'metrics_data': [{'name': 'Faithfulness', 'threshold': 0.5, 'success': True, 'score': 1.0, 'reason': 'The score is 1.00 because there are no contradictions between the actual output and the retrieval context. Everything aligns perfectly!', 'strict_mode': False, 'evaluation_model': 'gpt-4.1', 'error': None, 'evaluation_cost': None, 'verbose_logs': 'Claims:\n[\n    {\'claim\': \'Ms. Aria Tanaka is a contemporary dancer and choreographer whose innovative work has made a significant impact on the international dance scene.\', \'quote\': \'I am writing this recommendation on behalf of Ms. Aria Tanaka, a contemporary dancer and choreographer whose innovative work has made a significant impact on the international dance scene.\'},\n    {\'claim\': \'Saburo Teshigawara is a choreographer, dancer, and director known for experimental and visually stunning dance works.\', \'quote\': \'By way of introduction, my name is Saburo Teshigawara, and I am a choreographer

Let's inspect this output:
```
[ScoringResult(success=True, metrics_data=[
    {
        'name': 'Faithfulness',
        'threshold': 0.5,
        'success': True,
        'score': 1.0,
        'reason': 'The score is 1.00 because there are no contradictions between the actual output and the retrieval context. Everything aligns perfectly!',
        'strict_mode': False,
        'evaluation_model': 'gpt-4.1',
        'error': None,
        'additional_metadata': {
            'claims': [
                {
                    'claim': "Ms. Aria Tanaka served as a guest choreographer for the Netherlands Dance Theater's NDT 2 in 2021.",
                    'quote': "In 2021, she was invited to serve as a guest choreographer for the Netherlands Dance Theater's NDT 2."
                }
            ],
            'verdicts': [
                {
                    'verdict': 'yes',
                    'reason': "The retrieval context supports the claim that Aria Tanaka served as a guest choreographer for the Netherlands Dance Theater's NDT 2 in 2021. Quote: 'Guest Choreographer - Netherlands Dance Theater's NDT 2 (2021).'"
                }
            ]
        }
    }
])]

We've successfully run an evaluation! In production, it's more than likely that we'll have multiple examples to work with, which can be stored on the Judgment platform as a `Dataset`. Here's a simple guide on creating datasets using the Judgment API.

In [7]:
dataset = client.create_dataset()
dataset.add_example(Example(input="input 1", actual_output="output 1"))

client.push_dataset(alias="test_dataset_5", dataset=dataset, overwrite=False)

# PULL
dataset = client.pull_dataset(alias="test_dataset_5")
print(dataset)

Output()

Output()

EvalDataset(ground_truths=[], examples=[Example(input='input 1', actual_output='output 1', expected_output=None, context=None, retrieval_context=None, additional_metadata=None, tools_called=None, expected_tools=None, name=None), Example(input='input 1', actual_output='output 1', expected_output=None, context=None, retrieval_context=None, additional_metadata=None, tools_called=None, expected_tools=None, name=None), Example(input='input 1', actual_output='output 1', expected_output=None, context=None, retrieval_context=None, additional_metadata=None, tools_called=None, expected_tools=None, name=None), Example(input='input 1', actual_output='output 1', expected_output=None, context=None, retrieval_context=None, additional_metadata=None, tools_called=None, expected_tools=None, name=None), Example(input='input 1', actual_output='output 1', expected_output=None, context=None, retrieval_context=None, additional_metadata=None, tools_called=None, expected_tools=None, name=None), Example(input='

Perhaps one of our ready-made scorers is not what you're looking for. Maybe there's a custom metric you're looking to measure; with Judgeval, you can still use our infrastructure to run and store the outputs of evaluation runs by using a `CustomScorer`!

In [None]:
from judgeval.playground import CustomFaithfulnessMetric
from judgeval.judges import TogetherJudge, MixtureOfJudges, LiteLLMJudge
# You can make your own! CustomFaithfulnessMetric is just our choice

model = LiteLLMJudge(model="gpt-4.1")
c_scorer = CustomFaithfulnessMetric(
    threshold=0.6,
    model=model,
)

results = client.run_evaluation(
    examples=[example_1],
    scorers=[c_scorer],
    model="gpt-4.1"
)