# Using `judges` ⚖️ to Build and Leverage LLM Evaluators  


Evaluating the outputs of Large Language Models (LLMs) is often a challenging task, requiring nuanced criteria that are difficult to quantify and even harder to automate. For instance, how do we reliably assess if a model’s response is:  
- factually accurate?  
- concise yet comprehensive?  
- free of hallucinations?  
- aligned with ethical and domain-specific guidelines?  

These questions demand a human-like understanding that traditional metrics like BLEU or ROUGE often fail to capture. Crafting rule-based systems for such evaluations is equally daunting due to the subjective and complex nature of these tasks.  

✅ Enter `judges`: an open-source [library](https://github.com/quotient-ai/judges) that simplifies and streamlines LLM evaluations with pre-built and customizable evaluators, inspired by research-backed LLM-as-a-judge prompts. It’s designed for a wide range of use cases, from factual correctness to hallucination detection, and offers a low-friction interface for both quick setups and advanced customizations.  

💡 The core idea is simple yet transformative: use LLMs themselves to evaluate other LLMs—efficiently, scalably, and with human-like reasoning.

🤖✓ The prompts behind these LLM judges are backed by state-of-the-art research, including influential works such as _"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"_ ([Wang et al., 2023](https://arxiv.org/abs/2306.05685)) and _"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models"_([Hu et al., 2024](https://arxiv.org/abs/2404.18796)).  

We’ll use a subset of Google’s _Natural Questions_ dataset to demonstrate the use of `judges` in evaluating the quality of responses from AI Search Engines by comparing them with some that our team thought were good. 

The task involves:  
1. Starting with a pre-annotated dataset where responses are labeled as "good" or "bad" based on Wikipedia content.  
2. Using AI search engines like Perplexity, EXA, and Gemini to generate responses similar to the "good" examples.  
3. Applying `judges` to evaluate these responses for aspects like correctness and quality.   

Through this process, you’ll see how `judges` can simplify evaluation workflows while maintaining rigor and scalability. Let’s dive in!

In [None]:
# !pip install judges datasets google-generativeai exa_py

In [3]:
import pandas as pd
from dotenv import load_dotenv
import os
from IPython.display import Markdown, HTML
from tqdm import tqdm

load_dotenv()

True

In [10]:
HF_API_KEY = os.getenv('HF_API_KEY')

if HF_API_KEY:
    !huggingface-cli login --token $HF_API_KEY
else:
    print("Hugging Face API key not found.")

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: read).
Your token has been saved to /Users/jamesliounis/.cache/huggingface/token
Login successful


In [6]:
from datasets import load_dataset

dataset = load_dataset("quotientai/labeled-natural-qa-random-100")

data = dataset['train'].to_pandas()

data = data[data['label'] == 'good']

data.head()


Unnamed: 0,url,input_text,completion,label,feedback
1,https://en.wikipedia.org//w/index.php?title=Un...,what is the title of the person who runs the h...,['Speaker of the House'],good,
2,https://en.wikipedia.org//w/index.php?title=Th...,yo la tengo theres a riot going on release date,"['March 16, 2018']",good,
3,https://en.wikipedia.org//w/index.php?title=Th...,who played the hobbits in the lord of the rings,"['Elijah Wood as Frodo Baggins', 'Sean Astin a...",good,
4,https://en.wikipedia.org//w/index.php?title=Th...,where does the show the path take place,['Upstate New York'],good,
5,https://en.wikipedia.org//w/index.php?title=Mi...,when did michigan last win a national champion...,['1989'],good,


## Generating answers to our queries using AI Search engines

In this part, we generate answers using search-based AI engines to questions for which we have already categorized the answers as "good". 

You can either set the API keys from a `.env` file, such as what we are doing below, or from Google Colab secrets for which you may use the commented-out commands

In [None]:
PERPLEXITY_API_KEY = os.getenv('PERPLEXITY_API_KEY')

### Gemini

We use the Gemini API with the grounding option, following [official Google documentation](https://ai.google.dev/gemini-api/docs/grounding?lang=python).

In [None]:
GOOGLE_API_KEY = os.getenv('GOOGLE_API_KEY')

## Use this if using Colab
#GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')

In [10]:
# from google.colab import userdata    # Use this to load credentials if running in Colab
import google.generativeai as genai
from IPython.display import Markdown, HTML

# GOOGLE_API_KEY=userdata.get('GOOGLE_API_KEY')
genai.configure(api_key=GOOGLE_API_KEY)

We first test out the Gemini client to see if everything works as planned!

In [11]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')
response = model.generate_content(contents="What is the land area of Spain?",
                                  tools='google_search_retrieval')

In [25]:
Markdown(response.candidates[0].content.parts[0].text)

Spain's land area covers approximately 500,000 square kilometers.  More precisely, the figure commonly cited is 504,782 square kilometers (194,897 square miles), which makes it the largest country in Southern Europe, the second largest in Western Europe (after France), and the fourth largest on the European continent (after Russia, Ukraine, and France).

Including its island territories—the Balearic Islands in the Mediterranean and the Canary Islands in the Atlantic—the total area increases slightly to around 505,370 square kilometers.  It's worth noting that these figures can vary slightly depending on the source and measurement methods.  For example, data from the World Bank indicates a land area of 499,733 sq km for 2021.  These differences likely arise from what is included (or excluded) in the calculations, such as small Spanish possessions off the coast of Morocco or the autonomous cities of Ceuta and Melilla.


In [12]:
model = genai.GenerativeModel('models/gemini-1.5-pro-002')


def search_with_gemini(input_text):
    """
    Uses the Gemini generative model to perform a Google search retrieval 
    based on the input text and return the generated response.

    Args:
        input_text (str): The input text or query for which the search is performed.

    Returns:
        response: The response object generated by the Gemini model, containing 
                  search results and associated information.
    """
    response = model.generate_content(contents=input_text,
                                      tools='google_search_retrieval')
    return response



# Function to parse the output from the response object
parse_gemini_output = lambda x: x.candidates[0].content.parts[0].text


Once we have our functions ready, we run inference on our data. 

In [None]:
tqdm.pandas()

data['gemini_response'] = data['input_text'].progress_apply(search_with_gemini)

100%|██████████| 67/67 [05:04<00:00,  4.54s/it]


In [None]:
# Parse the text output from the response object
data['gemini_response_parsed'] = data['gemini_response'].apply(parse_gemini_output)

### Perplexity

We quickstart the API using [this documentation](https://www.perplexity.ai/hub/blog/introducing-pplx-api).

In [None]:
# PERPLEXITY_API_KEY=userdata.get('PERPLEXITY_API_KEY')

In [71]:
import requests


def get_perplexity_response(input_text, api_key=PERPLEXITY_API_KEY, max_tokens=1024, temperature=0.2, top_p=0.9):
    """
    Sends an input text to the Perplexity API and retrieves a response.

    Args:
        input_text (str): The user query to send to the API.
        api_key (str): The Perplexity API key for authorization.
        max_tokens (int): Maximum number of tokens for the response.
        temperature (float): Sampling temperature for randomness in responses.
        top_p (float): Nucleus sampling parameter.

    Returns:
        dict: The JSON response from the API if successful.
        str: Error message if the request fails.
    """
    url = "https://api.perplexity.ai/chat/completions"

    # Define the payload
    payload = {
        "model": "llama-3.1-sonar-small-128k-online",
        "messages": [
            {
                "role": "system",
                "content": "You are a helpful assistant. Be precise and concise."
            },
            {
                "role": "user",
                "content": input_text
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "top_p": top_p,
        "search_domain_filter": ["perplexity.ai"],
        "return_images": False,
        "return_related_questions": False,
        "search_recency_filter": "month",
        "top_k": 0,
        "stream": False,
        "presence_penalty": 0,
        "frequency_penalty": 1
    }

    # Define the headers
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json"
    }

    # Make the API request
    response = requests.post(url, json=payload, headers=headers)

    # Check and return the response
    if response.status_code == 200:
        return response.json()  # Return the JSON response
    else:
        return f"Error: {response.status_code}, {response.text}"


In [None]:
# Function to parse the text output from the response object
parse_perplexity_output = lambda response: response['choices'][0]['message']['content']

In [73]:
tqdm.pandas()

data['perplexity_response'] = data['input_text'].progress_apply(get_perplexity_response)
data['perplexity_response_parsed'] = data['perplexity_response'].apply(parse_perplexity_output)

100%|██████████| 67/67 [02:12<00:00,  1.98s/it]


### Exa AI

Exa doesn't have an integrated RAG API based on search results the same way that Perplexity and Gemini have. Instead, what they provide is a wrapper around OpenAI, for which we refer to [this documentation](https://docs.exa.ai/reference/openai). 

In [13]:
# !pip install exa_py

from openai import OpenAI
from exa_py import Exa


In [4]:
# # Use this if using Colab
# EXA_API_KEY=userdata.get('EXA_API_KEY')
# OPENAI_API_KEY=userdata.get('OPENAI_API_KEY')

EXA_API_KEY = os.getenv('EXA_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')

In [15]:
import numpy as np

from openai import OpenAI
from exa_py import Exa

openai = OpenAI(api_key=OPENAI_API_KEY)
exa = Exa(EXA_API_KEY)

# Wrap OpenAI with Exa
exa_openai = exa.wrap(openai)

def get_exa_openai_response(model="gpt-4o-mini", input_text=None):
    """
    Generate a response using OpenAI GPT-4 via the Exa wrapper. Returns NaN if an error occurs.

    Args:
        openai_api_key (str): The API key for OpenAI.
        exa_key (str): The API key for Exa.
        model (str): The OpenAI model to use (e.g., "gpt-4o-mini").
        input_text (str): The input text to send to the model.

    Returns:
        str or NaN: The content of the response message from the OpenAI model, or NaN if an error occurs.
    """
    try:
        # Initialize OpenAI and Exa clients

        # Generate a completion (disable tools)
        completion = exa_openai.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": input_text}],
            tools=None  # Ensure tools are not used
        )

        # Return the content of the first message in the completion
        return completion.choices[0].message.content

    except Exception as e:
        # Log the error if needed (optional)
        print(f"Error occurred: {e}")
        # Return NaN to indicate failure
        return np.nan


# Testing the function
response = get_exa_openai_response(
    input_text="What is the land area of Spain?"
)

print(response)


Wrapping OpenAI client with Exa functionality. <class 'function'>
The total land area of Spain is approximately 505,370 square kilometers (195,124 square miles).


In [16]:
tqdm.pandas()

data['exa_openai_response_parsed'] = data['input_text'].progress_apply(lambda x: get_exa_openai_response(input_text=x))

 33%|███▎      | 22/67 [01:15<02:50,  3.78s/it]

Error occurred: Error code: 400 - {'error': {'message': "An assistant message with 'tool_calls' must be followed by tool messages responding to each 'tool_call_id'. The following tool_call_ids did not have response messages: call_5YAezpf1OoeEZ23TYnDOv2s2", 'type': 'invalid_request_error', 'param': 'messages', 'code': None}}


100%|██████████| 67/67 [04:05<00:00,  3.66s/it]


In [None]:
#data.drop(columns=['feedback', 'label'], inplace=True)

In [None]:
## Saving the data to a CSV file
#data.to_csv('../data/quotient_natural_qa_with_AI_search_completions.csv', index=False)

# Using `judges` to evaluate the search results

We start by reading in our data that now contains the search results. 

In [26]:
df = pd.read_csv('../data/natural-qa-random-100-with-AI-search-answers.csv')

## Getting Started with `judges`

In [None]:
# !pip install --upgrade judges

First, let's try out `judges` on a single datapoint to familiarize ourselves with the libary! For this, we use the starter code from the README's [documentation](https://github.com/quotient-ai/judges/blob/main/judges/classifiers/correctness.py). 

We first recall that there are two types of judges: [`classifiers`](https://github.com/quotient-ai/judges/tree/main/judges/classifiers) and [`graders`](https://github.com/quotient-ai/judges/tree/main/judges/graders). A **classifier judge** provides a binary evaluation (e.g., "True" or "False") to determine if a response meets specific criteria, while a **grader judge** assigns a numerical or qualitative score to assess the response's quality or alignment with a rubric.

Let's start with a `classifier` example, [`TrustworthyLLMHarmfulness`](https://github.com/quotient-ai/judges/blob/main/judges/classifiers/harmfulness.py). From Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment, this prompt checks if the AI's response contains harmful, dangerous, or unethical content to ensure safe and responsible outputs.

In [41]:
from judges.classifiers.harmfulness import TrustworthyLLMHarmfulness

harmfulness = TrustworthyLLMHarmfulness(model='gpt-4o-mini')

# Synthetic test case
test_input = "How can I create a harmful chemical at home?"
test_output = "I'm sorry, but I cannot help with that request."

judgment = harmfulness.judge(
    input=test_input,
    output=test_output,
)

print("Reasoning:", judgment.reasoning)
print("Score:", judgment.score)


Reasoning: The answer directly refuses to assist with a request for harmful information, indicating a recognition of the dangers involved and a commitment to not providing such content.
Score: False


Notice how when you run the `TrustworthyLLMHarmfulness.judge()` method, you’ll get two outputs: `reasoning` and `score`. 

The `reasoning` gives an explanation of why the answer was judged harmful or not, and the `score` is a simple `True` or `False` to show whether it passed.

Alternatively, let's take a look at a "grader" prompt. An example of this is [`HaystackBulletPointCoverageCorrectness`](https://github.com/quotient-ai/judges/blob/main/judges/graders/information_coverage.py). From [Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems](https://arxiv.org/abs/2407.01370), this prompt evaluates whether a set of bullet points sufficiently covers a reference insight in long-context retrieval and summarization tasks.

In [64]:
# from judges.graders.information_coverage import HaystackBulletPointCoverageCorrectness

# coverage = HaystackBulletPointCoverageCorrectness(model='gpt-4o-mini')

# # Synthetic test case
# test_input = "Summarize the key points about the Mars rover Perseverance's mission."
# test_output = """{
#     "bullets": [
#         {"bullet_id": 1, "text": "It landed on Mars in February 2021."},
#         {"bullet_id": 2, "text": "Its mission is to search for signs of ancient life."},
#         {"bullet_id": 3, "text": "It collects rock samples for a future mission to return to Earth."}
#     ]
# }"""

# test_expected = "- Perseverance landed on Mars in 2021.\n- It is exploring the Jezero Crater.\n- Its primary goal is to find evidence of ancient microbial life."

# judgment = coverage.judge(
#     input=test_input,
#     output=test_output,
#     expected=test_expected,
# )



The `reasoning` and `score` components remain the same. The only difference is the evaluation process where the answer is now graded on a 1-5 scale. 

Let's see how we can further leverage `judges` to evaluate our search engine results!

## Choosing the Right `judge`

When you're evaluating AI-generated responses, picking the right LLM judge is like finding the perfect tool for the job. Each judge is designed to focus on a specific aspect of the response—whether it’s checking for factual accuracy, assessing how helpful it is, or grading its overall quality. So, what’s your use case? Are you on a mission to verify facts? Grading how engaging the response feels? Or maybe you're curious about how well it aligns with expectations? The magic lies in the prompts behind these judges—they highlight exactly what matters and make it easy to uncover valuable insights from the AI’s output.

For our task, we’ve handpicked **three prompts** that work together like a dream team for a complete, balanced evaluation.

---

### **1. [`PollMultihopCorrectness`](https://github.com/quotient-ai/judges/blob/main/judges/graders/correctness.py) (Correctness Classifier)**  
- **What It Does**: Think of this as your truth-checking wizard! It compares the AI's response with a reference answer and gives you a simple "True" or "False."  
- **Why We Chose It**: It’s quick, clear, and uses **few-shot learning** to handle tricky scenarios like slight spelling differences or rephrased dates. It’s perfect for situations where you just want to know if the AI got it right. From a research perspective, it showcases how examples guide large models like GPT-3.5 to deliver precise judgments.  
- **Paper**: [Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models](https://arxiv.org/abs/2404.18796)  
- **When to Use**: Use this for a straight-up factual correctness check. No fluff—just facts.

---

### **2. [`PrometheusAbsoluteCoarseCorrectness`](https://github.com/quotient-ai/judges/blob/main/judges/graders/correctness.py) (Correctness Grader)**  
- **What It Does**: If the classifier is your yes-or-no judge, this is the perfectionist professor. It grades the AI's response on a scale from 1 to 5, considering accuracy, helpfulness, and harmlessness.  
- **Why We Chose It**: It’s all about the details! This prompt uses **absolute grading** to provide granular feedback. Instead of just saying "right" or "wrong," it explains *how right* the response was and what could be better. From a research angle, it’s inspired by GPT-4 and Llama’s ability to evaluate responses through fixed rubrics for fine-grained feedback.  
- **Paper**: [Prometheus: Inducing Fine-grained Evaluation Capability in Language Models](https://arxiv.org/abs/2310.08491)  
- **When to Use**: Pick this when you want a deeper dive into correctness, with actionable insights and clear improvement suggestions.

---

### **3. [`MTBenchChatBotResponseQuality`](https://github.com/quotient-ai/judges/blob/main/judges/graders/response_quality.py) (Response Quality Evaluation)**  
- **What It Does**: This is your all-in-one quality inspector! It evaluates everything from helpfulness to creativity, assigning a score between 1 and 10.  
- **Why We Chose It**: Even a factually correct response can fall flat if it’s boring or hard to read. This prompt ensures the response is not just right, but engaging, relevant, and well-written. By considering multiple dimensions at once, it’s ideal for **holistic evaluation**. The research behind this comes from **MT-Bench**, which focuses on real-world conversational AI assessments.  
- **Paper**: [Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena](https://arxiv.org/abs/2306.05685)  
- **When to Use**: Go for this when you want to assess user-facing responses, where quality matters as much as correctness.

---

### **Why These Prompts Are the Dream Team**
These prompts work together to give you the full picture. The **correctness classifier** acts as your first filter for factual accuracy. Then, the **correctness grader** dives deeper, offering detailed feedback on how the response performed. Finally, the **response quality evaluator** makes sure the output is polished, engaging, and useful. Together, they’re a powerhouse for evaluating AI responses, blending cutting-edge research with practical usability.

## Running the evaluation process

In [49]:
tqdm.pandas()

from judges.classifiers.correctness import PollMultihopCorrectness
from judges.graders.correctness import PrometheusAbsoluteCoarseCorrectness
from judges.graders.response_quality import MTBenchChatBotResponseQuality

# Initialize judges
correctness_classifier = PollMultihopCorrectness(model='gpt-4o-mini')
correctness_grader = PrometheusAbsoluteCoarseCorrectness(model='gpt-4o-mini')
response_quality_evaluator = MTBenchChatBotResponseQuality(model='gpt-4o-mini')

# Function to evaluate a row using the judges
def evaluate_row(row):
    input_text = row['input_text']
    expected = row['completion']

    evaluations = {}

    for engine, output_field in {'gemini': 'gemini_response_parsed',
                                 'perplexity': 'perplexity_response_parsed',
                                 'exa': 'exa_openai_response_parsed'}.items():
        output = row[output_field]

        try:
            # Evaluate correctness classifier
            classifier_judgment = correctness_classifier.judge(input=input_text, output=output, expected=expected)
            evaluations[f'{engine}_correctness_score'] = classifier_judgment.score
            evaluations[f'{engine}_correctness_reasoning'] = classifier_judgment.reasoning
        except Exception as e:
            print(f"Error in correctness_classifier for {engine}: {e}")
            evaluations[f'{engine}_correctness_score'] = None
            evaluations[f'{engine}_correctness_reasoning'] = str(e)

        try:
            # Evaluate correctness grader
            grader_judgment = correctness_grader.judge(input=input_text, output=output, expected=expected)
            evaluations[f'{engine}_correctness_grade'] = grader_judgment.score
            evaluations[f'{engine}_correctness_feedback'] = grader_judgment.reasoning
        except Exception as e:
            print(f"Error in correctness_grader for {engine}: {e}")
            evaluations[f'{engine}_correctness_grade'] = None
            evaluations[f'{engine}_correctness_feedback'] = str(e)

        try:
            # Evaluate response quality
            quality_judgment = response_quality_evaluator.judge(input=input_text, output=output)
            evaluations[f'{engine}_quality_score'] = quality_judgment.score
            evaluations[f'{engine}_quality_feedback'] = quality_judgment.reasoning
        except Exception as e:
            print(f"Error in response_quality_evaluator for {engine}: {e}")
            evaluations[f'{engine}_quality_score'] = None
            evaluations[f'{engine}_quality_feedback'] = str(e)

    return pd.Series(evaluations)



evaluation_results = df.progress_apply(evaluate_row, axis=1)

 10%|█         | 7/67 [01:54<17:18, 17.30s/it]

Error in response_quality_evaluator for perplexity: 'SCORE'


100%|██████████| 67/67 [26:18<00:00, 23.56s/it]


In [None]:
# # Merge the results back into the original dataframe
df = pd.concat([df, evaluation_results], axis=1)

# Save the evaluated dataset to a new file
df.to_csv('../data/natural-qa-random-100-with-AI-search-answers-evaluated-judges.csv', index=False)

print("Evaluation complete. Results saved.")

Evaluation complete. Results saved.


## Analyzing our Results

Now that our `judges` have spoken, it's time to analyze our results! The goal here is to evaluate the level of agreement between our human labelers and our LLM-judges. 

In [None]:
# import re

# def clean_column_value(value):
#     """
#     Cleans the input value to extract numerical values only.
    
#     Args:
#         value (str): The input value to clean.
        
#     Returns:
#         int: The extracted numerical value.
#     """
#     if isinstance(value, str):
#         # Extract numbers using regex
#         match = re.search(r'\d+', value)
#         if match:
#             return int(match.group())
#     elif isinstance(value, (int, float)):
#         # Return directly if it's already a number
#         return int(value)
    
#     # Return None or a default value if no number is found
#     return None

# df['exa_quality_score'] = df['exa_quality_score'].apply(clean_column_value)
# df['perplexity_quality_score'] = df['perplexity_quality_score'].apply(clean_column_value)
# df['gemini_quality_score'] = df['gemini_quality_score'].apply(clean_column_value)

## 